Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
(USC Thesis Other)
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL ALGORITHMS FOR STUDYING HUMAN GENETIC VARIATIONS - STRUCTURAL VARIATIONS AND VARIABLE NUMBER TANDEM REPEATS by Jingwen Ren A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2023 Copyright 2023 Jingwen Ren Dedication I dedicate this dissertation to my beloved parents and boyfriend. ii Acknowledgements First and foremost, I would like to express my deep gratitude to my advisor, Dr. Mark Chais- son, for his exceptional mentorship and unwavering support throughout my Ph.D. journey. Dr. Chaisson has been an inspiring role model to me because of his passion for research and his perseverance in overcoming challenges. In particular, I remember struggling with the revision of my rst paper, and it was Dr. Chaisson who expertly guided me through the entire process. His guidance was instrumental in the development of my research skills and in achieving the milestones I accomplished during my Ph.D. Without his support, none of this would have been possible. I would also like to extend my appreciation to my Ph.D. qualication and dissertation com- mittee members: Dr. Fengzhu Sun, Dr. Liang Chen, Dr. Andrew Smith, and Dr. Serghei Mangul, for their valuable feedback and guidance in shaping my research. I am deeply grateful to Dr. Minping Qian, who introduced me to the magnicent world of computational biology. Under her guidance as an undergraduate student, I had the opportunity to conduct research that opened my eyes to the vast possibilities in this eld. My gratitude extends to Dr. Andrew Smith for teaching me about algorithms in computational biology and introducing me to the fascinating world of this eld. iii I would also like to express my gratitude to my labmate, Bida Gu, for his collaboration with me on the vamos project. Bida’s meticulous attention to detail has been truly inspiring. I am also immensely grateful for the support and camaraderie of all my current and former labmates at Chaisson lab, including Dr. Tsung-Yu Lu, Robel Dagnew, Jianzhi Yang, Bida Gu, Keon Rabbani, and Walfred Ma. Our shared experiences over the past six years will always hold a special place in my heart. Last but certainly not least, I would like to express my utmost gratitude to my parents, De Ren and Shuyu Li, and my boyfriend, Zifan Zhu, for their unwavering love and support throughout this journey. Their constant presence and encouragement have been an immeasurable source of strength and inspiration to me. iv TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 SVs and VNTRs are important sources of human genetic diversities . . . . . . . . 1 1.2 Long read alignment enables more accurate detection and genotype of SVs . . . . 5 1.3 Genotyping VNTR using long reads and assemblies . . . . . . . . . . . . . . . . . 8 1.4 Other authors and contributors to the dissertation . . . . . . . . . . . . . . . . . . 11 Chapter 2: lra: A Long Read Aligner for sequences and contigs . . . . . . . . . . . . . . . 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Building a minimizer index and sequence matching . . . . . . . . . . . . . 15 2.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2.1 Rough clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2.2 Fine clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Cluster splitting and chaining . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Cluster renement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.5 Alignment renement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.6 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.6.1 SDP algorithm with concave gap cost function - dening subproblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.6.2 SDP algorithm with concave gap cost function - conquering subproblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.6.3 Extension to inversion cases . . . . . . . . . . . . . . . . . . . . 29 2.2.6.4 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.1 Evaluation of lra alignments . . . . . . . . . . . . . . . . . . . . . . . . . . 32 v 2.3.2 Variant discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2.1 SV analysis of simulation reads . . . . . . . . . . . . . . . . . . . 35 2.3.2.2 SV analysis of long reads . . . . . . . . . . . . . . . . . . . . . . 36 2.3.2.3 SV analysis of assembly contigs . . . . . . . . . . . . . . . . . . 39 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 3: vamos: VNTR annotation using ecient motif sets . . . . . . . . . . . . . . . . 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1 Generating a reference motif set . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.2 Ecient motif selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.2.1 Integer linear programming formulation . . . . . . . . . . . . . 54 3.2.3 Annotating motif composition on sequence data . . . . . . . . . . . . . . . 56 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 Ecient motif discovery for a reference assembly panel . . . . . . . . . . 57 3.3.2 Allelic diversity in 64 haplotype-resolved de novo assemblies . . . . . . . . 60 3.3.3 Annotating diverged populations . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.4 Analysis of aligned long-reads . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.5 Comparison with interval-based merging of population calls . . . . . . . . 63 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Future work for the lra project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Future work for the vamos project . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 5: Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Supplementary materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.1 Software availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.2 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.3 Algorithm details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.3.1 A detailed example of visualizing subproblems division . . . . . 83 5.1.4 Impact of local minimizer index . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.5 Assembly exclusive calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.6 Analysis of variant calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Supplementary materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.1 Software availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.2 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.3 Theorem of ecient motif set selection problem . . . . . . . . . . . . . . . 97 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 vi ListofTables 2.1 Performance of alignment methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Statistics of structural variant call sets. . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3 Truvari classication of variant calls. . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4 Classication of inversions detected in HG002 using PacBio and ONT reads . . . 42 2.5 Assembly based calls comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6 Read based support of assembly calls. . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1 The mean ecient motif set size and compression ratio. . . . . . . . . . . . . . . . 59 3.2 Comparison between vamos and greedy method. . . . . . . . . . . . . . . . . . . . 59 3.3 Average number of alleles per locus when all 58 HGSVC and 90 HPRC haplotypes were annotated using the original and ecient set of motifs. . . . . . . . . . . . . 61 3.4 Average number of alleles per locus. . . . . . . . . . . . . . . . . . . . . . . . . . 65 S1 The count of indels from PacBio CLR alignments. . . . . . . . . . . . . . . . . . . 86 S2 minimap2 assembly exclusive large SV calls coordinates. . . . . . . . . . . . . . . 87 S3 Assembly exclusive large SV calls counts. . . . . . . . . . . . . . . . . . . . . . . . 87 S4 lra assembly exclusive large SV calls counts. . . . . . . . . . . . . . . . . . . . . . 88 S5 Shared variant calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 S6 Calls unique to a callset in tandem repeats and segmental duplications. . . . . . . 91 vii S7 Comparison of the Truvari result between all combinations of aligners and SV callers on simulated HiFi, CLR and ONT dataset with simulated SVs. . . . . . . . . 92 S8 Comparison of the Truvari result between all combinations of aligners and SV callers on simulated HiFi, CLR and ONT dataset with complicated nested SVs. . . 93 S9 Comparison of the breakpoints on real datasets. . . . . . . . . . . . . . . . . . . . 94 S10 Comparison of the breakpoints on simulated SVs. . . . . . . . . . . . . . . . . . . 95 S11 Truvari classication of cuteSV variant calls. . . . . . . . . . . . . . . . . . . . . . 96 viii ListofFigures 2.1 Example of clustering anchors prior to CG-SDP. . . . . . . . . . . . . . . . . . . . 20 2.2 Visualization of subproblems divisions. . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Mapping accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4 The precision and recall of HG002 SV call sets. . . . . . . . . . . . . . . . . . . . . 43 3.1 vamos workow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Visualization of allele structure change withq . . . . . . . . . . . . . . . . . . . . 68 3.3 Diversity of ecient motif sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 Distribution of number of VNTR alleles for 58 HGSVC and 90 HPRC human haplotypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5 Error of vamos annotation in simulated data . . . . . . . . . . . . . . . . . . . . . 70 3.6 Analysis of raw sequencing reads of NA24385. . . . . . . . . . . . . . . . . . . . . 71 3.7 Comparison of interval-based merging and vamos. . . . . . . . . . . . . . . . . . . 72 S1 A detailed example of visualizing subproblems division. . . . . . . . . . . . . . . . 85 S2 Mapping accuracy of lra w/ and w/o rening local minimizers. . . . . . . . . . . . 87 S3 Example of insertion exclusively appear in lra but not minimap2 alignment. . . . 89 S4 Example of insertion exclusively appear in lra but not minimap2 alignment. . . . 90 ix Abstract Human genomes are rich in genetic variations, and understanding their relationship with phe- notypes has been a primary goal of genetic research. Over that last two decades, Genome-wide association studies (GWAS) have focused primarily on single nucleotide polymorphisms (SNPs) and small indels that co-occur with specic disease or trait phenotypes. Despite the wealth of data generated from GWAS, these markers explain only a fraction of the heritability of complex traits and disorders. Consequently, researchers have shifted their attention to other genetic varia- tions such as structural variations (SVs) and variable number tandem repeats (VNTRs) to address the "missing heritability" problem. However, identifying and characterizing SVs and VNTRs us- ing short-read sequencing (SRS) technology has been challenging, primarily due to limitations in read length and reference bias. Recent advancements in long-read sequencing (LRS) technol- ogy and assembly algorithms have facilitated better resolution of SVs and VNTRs. In this thesis, we present algorithms that can eectively resolve large SVs (>=50bp) and VNTRs using LRS and assemblies. The proposed methodology has the potential to improve our understanding of the functional impact of SVs and VNTRs. x Chapter1 Introduction 1.1 SVsandVNTRsareimportantsourcesofhumangenetic diversities Human genomes are a vast reservoir of genetic variations. Over the years, researchers have long been devoted to understanding the underlying correlations between genotypes and phenotypes. Early attempts to use linkage analysis of aected pedigrees to map gene and gene variants aect- ing monogenic disorders were successful [11]. However, traditional linkage studies have several limitations that hinders its utility. First, the resolution for mapping casual variants is typically low due to the limited number of recombination events in pedigrees. Second, individual casual variants often have small eect sizes, which makes it dicult to detect for complex traits and disorders [101]. Over the past two decades, genome-wide association studies (GWAS) have ex- tensively explored the impact of single nucleotide polymorphisms (SNPs) on complex traits and diseases like Alzheimer’s [30], type 2 diabetes [85], and schizophrenia disease [70]. While SNP- based GWAS have yielded promising results, the contribution of SNPs to heritability is limited. 1 In complex traits and disease disorders such as height [105] and schizophrenia [54], common SNPs account for only a fraction of the variance. To address the "missing heritability" problem in complex traits and disorders, researchers have turned their attention to other types of genetic variations, including structural variations (SVs) and variable number tandem repeats (VNTRs) [34, 29, 103], which hold the promise of shedding light on the underlying genetic architecture of complex traits and diseases. SVs are genetic variants with size greater than 50bp, which includes insertions, deletions, inversions, duplications and translocations. SVs contribute signicantly to the genomic diversity in human genomes [4, 15, 83, 96] and are implicated in numerous diseases, such as cognitive disabilities, predispositions to obesity, and cancer [64, 103]. Studies have demonstrated that SVs are three times more likely to be associated with a GWA study signal than SNPs, with larger SVs (>20 kb) exhibiting up to a 50-fold greater potential to aect gene expression than a SNP [18]. Therefore, a comprehensive understanding of the entire spectrum of SVs is essential to human health research. The advent of short-read sequencing (SRS) technology has enabled the construction of SV sets and genotyping across a population using short-read DNA sequence data from projects like 1000 Genome Project [68, 20, 96]. However, identication and characterization of SVs using SRS has two main challenges [97]. First, the read length of SRS is often insucient to cover large SVs. Second, SVs are often enriched in repetitive regions, such as VNTRs [1, 15, 27, 79], where degenerate alignment and reference bias can occur, as these regions are underrepresented in the consensus reference and are often collapsed in unphased genome assemblies. Recent advancements in long-read sequencing (LRS) technologies and improved assembly algorithms have facilitated the resolution of SVs in complex and repetitive regions, including 2 VNTRs [3, 17, 27, 67, 69, 88, 93]. For instance, study from Human Genome Structural Varia- tion Consortium (HGSVC) indicates 64 haplotypes-resolved assemblies generated from LRS dis- covered 68% more novel SVs than SRS [27]. SV discovery usually starts with mapping samples sequences (reads or contigs) to the reference genome and inferring variations as dierences be- tween sequences and reference in the alignment. In this thesis, we present a fast and ecient algorithm for aligning long reads and assembled contigs to a linear reference genome, with the goal to generate improved alignment around large SVs ( 50bp). In the following section 2, we discuss the signicance, challenges, and opportunities associated with this approach. Despite the improvement of the long-read and assembly alignments in large SVs, accurately identifying and detecting SVs directly from alignment in highly repetitive and variable regions, such as VNTRs, remain a challenge. VNTRs are repetitive sequences consisting of inexact tandem duplication of repeat units larger than 6bp. VNTRs are widely distributed in the human genome, comprising approximately 3% of its total length, and are frequently located in protein-coding exons [49, 81] as well as regulatory regions proximal to a gene [38, 100]. VNTRs are highly polymorphic due to variations in repeat composition and repeat units (RU) counts among individuals, resulting in multiallelic variants. Protein-altering VNTRs have been found to be strongly associated with various human pheno- types, such as height, hair morphology, and biomarkers of health [72, 77]. Additionally, VNTRs have been implicated in multiple human diseases, such as Huntington’s, Fragile XSyndrome [60], Type I diabetes [99], schizophrenia [94], and Alzheimer’s [24]. These ndings suggest that VN- TRs may play important roles in human biology and disease susceptibility. However, genotyping VNTRs can be challenging due to the hypervariability of VNTRs in repeat motif composition and RU counts. 3 One commonly used method for genotyping VNTRs is based on alignment, which denes genotype as the presence of known SVs. This approach involves mapping reads or assemblies from samples to a linear reference genome, identifying SVs from alignments, and merging SVs from dierent reads or samples based on their size and breakpoint distance [50, 15, 45, 27, 16]. However, this method has several limitations. For example, variations in RU counts can cause inconsistent read alignment around SVs, and the merging of SVs can collapse dierent alleles. Additionally, the presence of SVs does not fully capture the extent of sequence mutations in VNTRs. Alternative methods for genotyping VNTRs, such as GangSTR [71], adVNTR [8], adVNTR- NN [7], and danbing-tk [61], dene genotype as the number of RU (aka. VNTR length) instead of the presence of known SVs. However, almost all of them focus on estimating VNTR lengths from SRS. Only adVNTR and adVNTR-NN can be applied to LRS, but they can only handle relative low complexity VNTRs. Moreover, relying solely on RU counts has limitations as it cannot capture the full spectrum of sequence mutations present in VNTRs. As a result, there is a clear need for alternative methods that incorporate sequence-level variation, especially by leveraging the benets of LRS. In this thesis, we propose a novel approach for genotyping VNTRs that leverages long reads and haplotype-resolved assemblies to capture variation in both the repeat composition and RU counts. This method is particularly useful as larger and more comprehensive studies of long reads and assemblies become increasingly available. Our approach can be applied to these datasets, which can further enhance the investigation of the role of VNTRs in complex traits and disor- ders. Section 3 provides an in-depth discussion on the signicance, challenges, and potential opportunities of this methodology. 4 1.2 Long read alignment enables more accurate detection andgenotypeofSVs The discovery and genotyping of SVs are challenging due to their enrichment in repetitive re- gions, like VNTRs, where the complex SVs can occur [1, 15, 27, 39, 79]. The initial attempts of long-read assembly studies found that around 60% SVs 50 bases in humans are in VNTR loci [15, 27]. Early SV discovery methods relied on microarrays were limited to a small num- ber of samples. With the development of SRS technology, it became possible to construct SVs sets and genotype across a population using short-read DNA sequence data from projects like 1000 Genome Project [68, 20, 96]. Despite the progress, SRS has limitations in resolving large and complex SVs especially in repetitive regions, because the insucient read length is unable to cover expanded VNTR alleles. Moreover, these regions are underrepresented in the consensus reference, where degenerate alignment and reference bias are possible [19, 14, 48, 53, 76, 98]. In the last decade, LRS technologies such as Pacic Biosciences (PacBio) and Oxford Nanopore (ONT) have emerged as routine approaches for sequencing genomes, enabling more comprehen- sive studies and characterizations of SVs. The technologies produce reads with an average length of 20kb or more and an error rate 15% or less. A study discovered 107,590 SVs, of which 68% were not found by SRS, by leveraging long-read assemblies and strand-specic sequencing tech- nologies [27, 96]. These ndings demonstrate the advantages of LRS technologies in discovering and genotyping SVs, particularly those in repetitive regions like VNTRs. A comprehensive un- derstanding of the full spectrum of SVs is crucial to human health, and LRS technologies oer a promising avenue to achieve this goal. 5 Accurately aligning long reads and assembly contigs to a linear reference genome is a crucial rst step in characterizing and genotyping SVs using LRS. In this thesis, we focus on improving alignment across SVs with LRS. Needle-Wunsch dynamic programming algorithm was developed in 1970s for globally com- paring two biological sequences [73], while Smith-Waterman algorithm was invented a decade later to determine the similar sub-regions of two sequences [92]. Although both algorithms are elegant and eective, their quadratic time complexity makes them too computationally expen- sive for aligning long reads and assembly contigs. A common approach for long-read alignment is seeding and chaining, pioneered by BLAST [66]. This method speeds up the process of search- ing similarities between two sequences by reducing the search space and focusing on the most promising regions of similarity. Seeding involves identifying short exact matches, or "seeds", between the two sequences, which can be found using various data structures such as a BWT-FM index, sux arrays [13] and minimizers [86, 56]. Chaining involves linking together the seeds to form longer regions of similarity, taking into account gaps and mismatches between the seeds. In this thesis, we adopted this common approach - seeding and chaining in our long-read alignment algorithm. However, alignment in repetitive regions, such as VNTRs, can be challenging, due to the hy- pervariability of RU counts. This can result in gaps being placed in dierent locations or multiple small gaps being placed instead of a single large one, depending on the alignment algorithm and gap penalties used. These issues lead to problems in subsequent SV detection and genotyping, as multiallelic SVs are often collapsed into a single allele when merged across reads/haplotypes by sizes and breakpoint distances [15, 27]. 6 To overcome this issue, a concave gap penalty function [89, 56] is more preferred over an ane one to model SVs. Such function increases the gap penalty gradiently slower as the gap length increases, thus it considers longer indels resulted from genetic variation as a single unit, which is more representative of biological reality than a combination of multiple single deletions or insertions. Concave gap penalty function is able to identify larger SVs within repetitive regions, improving subsequent SV discovery and merging accuracy. Existing aligners, minimap2 [56], ngmlr [89] have adopted a concave gap penalty function, and are demonstrated to be accurate and eective for mapping LRS across SVs. However, both minimap2 and ngmlr rely on heuristics for chaining, which can lead to suboptimal results. For example, minimap2 looks back up to 50 anchors to nd the best predecessor for an anchor, which may result in splitting of large SVs and suboptimal chain. ngmlr maintains two additional ma- trices as the indel length estimates for each cell, to avoid scanning the full row and column to compute the correct value of any cell. But the accuracy of these estimates may be limited, leading to suboptimal chain. Further improvements in long-read alignment are needed to develop more accurate and eective methods for chaining with a concave gap penalty function. As a matter of fact, an exact solution exists, which is called sparse dynamic programming with a concave gap penalty (CG-SDP) [32], and is slightly less ecient than linear-cost SDP. However, as presented in the original paper, the algorithm requires asynchronous processing and has never been imple- mented for sequence alignment in computational biology. Furthermore, the algorithm does not take into account inversions and translocations. In this thesis, we explore the application of CG-SDP to long-read and assembly alignment and present a new alignment method — lra. We have simplied the implementation of the CG- SDP algorithm, and extended it to accommodate genome arrangements such as inversions and 7 translocations. Our results show that lra improves SV discovery while maintaining a comparable runtime to state-of-the-art methods. 1.3 GenotypingVNTRusinglongreadsandassemblies Despite the potential importance of VNTRs in genetic basis of human traits and complex dis- eases, their association with such traits and diseases remains largely unknown, primarily due to diculties in genotyping VNTRs [29, 34, 103]. The hypervariability of RU counts and repeat com- position among individuals makes it challenging to align VNTR sequences to a linear consensus reference genome due to degenerate alignment and reference bias. Recent studies have attempted to overcome this issue by exploring novel methods for VNTRs genotyping and studying their relationship with gene expression [15, 27, 45, 71, 8, 7, 61]. One such method is the alignment-based method, which denes genotype as the presence of known SVs. This approach involves mapping long reads/assemblies from samples to a linear reference genome, identifying SVs from alignments, and merging SVs from dierent samples based on their size and breakpoint distance [16, 15, 27, 45, 50]. For instance, a study of 3,622 Icelandic individuals sequenced with LRS used a combination of pre-ltering and clique-based clustering to identify SV alleles in VNTR sequences that associate with height, atrial brillation, and recombination by VNTR length [10]. Another recent study of long-read data for 31 samples used a SV proximity graph and a constrained minimum spanning forest algorithm to discover many SVs associated with gene expression [50]. However, these method have some limitations. First, the changes in RU counts make it dicult to align even long reads consistently, as the gaps in the reads can be placed in dierent locations, depending on the alignment algorithm and gap penalties used. 8 Second, the SV merging heuristics across haplotypes can collapse dierent mutations, resulting in fewer total alleles. Finally, the presence of SVs only reveals changes in RU counts but not polymorphisms in repeat compositions, thus does not reveal the full extent of sequence mutations in VNTRs. Alternative methods for VNTR genotyping include GangSTR, adVNTR, adVNTR-NN, and danbing-tk dene genotype as the number of RU (aka. VNTR length). Existing methods like GangSTR [34], adVNTR, adVNTR-NN [8, 7], danbing-tk [61] have shown promise in estimat- ing VNTR length. GangSTR extracts information from pair-end reads alignment into a unied probabilistic model and estimate the VNTR length by maximum likelihood. adVNTR, adVNTR- NN leverages HMM to determine the VNTR length from short reads/ long reads. danbing-tk [61] maps short reads to a repeat-pangenome graph (RPGG) to estimate VNTR length. However, these methods have limitations. GangSTR, danbing-tk are designed specially for short reads, thus they cannot handle expanded VNTRs beyond the read length of most SRS technologies. GangSTR, ad- VNTR and adVNTR-NN support only relative low complexity tandem repeats that are formed by a single motif. Most importantly, the narrow denition of VNTR genotype as RU counts cannot describe the full extent of the sequence mutation in VNTRs. A study on schizophrenia identied an association with the CACNA1C gene that was based on the presence of dierent motif alleles rather than the length of an intronic VNTR. This nding suggests that relying solely on VNTR length to identify alleles can overlook important associations. To address the limitation of the narrow denition of VNTR genotype, the authors of danbing- tk has improved their genotyping method by dening VNTR genotype as a motif count vector, where each motif represents a sequence node in the RPGG graph. They are able to nd 9,422 out of 39,125 VNTRs are associated with nearby gene expression through motif mutation, of which 9 only 23.4% associations are accessible from VNTR length [63]. Although this method has made signicant progress of studying the relation of VNTRs with gene expression, it is designed only for SRS. As larger scale long-read sequencing studies become available, a more eective genotype method suitable for long reads may help shed light on the role of VNTRs in human genetics, since LRS and their assemblies has the potential to resolve large and complex VNTR alleles. In this thesis, we propose a novel method to genotype VNTRs using LRS or haplotype-resolved assemblies. We dene VNTR genotype as a string of motifs, which captures variations in both RU counts and repeat motif composition. In a high level, we develop an automated approach for constructing a repeat motif database for each individual VNTR locus from a reference set of haplotype-resolved assemblies. To annotate a new VNTR sequence, we use the StringDecomposer algorithm [26], which takes as input the query sequence and the library of repeat motifs and determines the VNTR genotype by identifying the closest string of motifs in the database based on the edit distance between the raw query sequence and the annotated string. One naive approach to creating a motif database for each VNTR locus is to compile a non- redundant set of repeat motifs identied by a repeat discovery tool such as Tandem Repeats Finder (TRF) [9], applied to a reference set of VNTR sequences. However, this method may generate an extensive list that includes many rare motifs, which may obscure the pattern of repetition. For example, the sequence ACGGTACGGTACCGTACGT may be decomposed into [ACGGT, ACGGT, ACCGT, ACGT], however (ACGGT) 4 (4 repetitions) is more concise and has a low divergence from the original sequence. Thus we dene an ecient motif set as a subset of original motifs, in which rare motifs are replaced by more common ones while maintaining a bounded total replacement cost. 10 To nd such ecient motif sets, we have developed a toolkit, VNTR Annotation Using Ef- cient Motifs Set (vamos) that discovers original motifs using a reference set of genomes, and nds ecient motif sets given a divergence parameter. We generated ecient motif set for each VNTR locus from 64 haplotype-resolved assemblies sequenced with LRS by the Human Genome Structural Variant Consortium (HGSVC) [27] under three levels of divergence parameter (). As a proof of concept, we use our method to quantify diversity of VNTR sequences in the haplotype- resolved assemblies, and evaluate its application on long reads and simulated new genomes. 1.4 Otherauthorsandcontributorstothedissertation Dr. Mark Chaisson, Bida Gu. 11 Chapter2 lra: ALongReadAlignerforsequencesandcontigs 2.1 Introduction Studies of genetic variation often begin by aligning sequences from a sample back to a reference genome, and inferring variation as dierences in the alignment. Long read, single molecule se- quencing (LRS) is becoming established as a routine approach for sequencing genomes. The two technologies that produce LRS technologies, Pacic Biosciences (PacBio) and Oxford Nanopore (ONT) generate reads over 50kb at error rate 15% or less. Aligning these sequences is a compu- tationally challenging task for which several methods are available including minimap2, ngmlr, and BLASR [56, 89, 13, 2]. They are demonstrated to be quite fast and accurate, but have limita- tions, particularly when there are large sequence dierences between the read and the reference. This problem is amplied in complex, repetitive regions such as variable-number tandem repeats, that only make up 3% of the human genome, but account for nearly 70% of observed structural variation: insertions and deletions at least 50 bases (SV), and in larger SV [87]. A common approach for mapping LRS reads is seeding and chaining, where an approximate alignment is formed based on a subset (chain) of exact matches between the sequence and the 12 genome. The exact matches may be found using various data structures including variable-length matches using a BWT-FM index or sux arrays [13], and minimizer based indexing [56]. The chaining algorithm used by BLASR uses a linear cost gap function for sparse-dynamic program- ming (SDP) [6]. While the chaining algorithm is ecient, it has long been known that linear-cost gap functions do not accurately reect biological variation [33], and has been shown to decrease sensitivity for detecting SV from LRS alignments [89]. Both the minimap2 and ngmlr aligners use gap penalties that are a concave function of gap length, and are demonstrated to be quite eective for mapping LRS reads across SV with align- ment gaps that reect biological variation. In minimap2, a heuristic algorithm is used for chaining, while ngmlr adopts a Smith-Waterman-like dynamic programming algorithm. An exact solution to sparse dynamic programming with a concave gap (CG-SDP) function exists [32], and is slightly less ecient than linear-cost SDP. However, as presented, the algorithm requires asynchronous processing and has never been implemented for sequence alignment in computational biology. Furthermore, the algorithm does not take into account genome arrangements such as inversions. An additional challenge for variant discovery is ecient alignment of contigs assembled from LRS that now have contiguity on par with the initial release of the human genome [91, 52, 17]. Existing methods exist to align whole genomes, but do not implement concave gap penalties [65], are low-resolution [42], or split alignments across large variants [56]. To explore the application of the exact solution of seed chaining sparse dynamic programming with a concave gap function to read and assembly alignment, we developed an alignment method - lra. We have simplied the implementation of the CG-SDP algorithm [32], and extended to allow for inversions and translo- cations. Finally, we demonstrate that this approach can improve SV discovery while having a runtime on par with state of the art methods. 13 2.2 Methods The alignment of reads and contigs to a reference are generally dened by the maximally scoring local alignment of a queryq to a set of target sequences collectively referred to as a targett with a match bonus and penalties for mismatch, gaps, and inversions/translocations. lra uses a heuristic to nd an approximate local alignment employing the commonly used seeding, chaining, and renement approach that has been applied to all scales of alignment from short-read, long-read, and whole-genome alignment [46, 90, 13]. Each alignment proceeds in four broad steps: seed sequence matching, clustering, chaining, and nally alignment renement. Many recent advances have been made in sequence mapping using a subsampled index on a reference using minimizers or locally sensitive hashing [86, 41]. A minimizer index is parameter- ized by a k-mer sizek and window sizew, and indexes the position of the lexicographically least canonical k-mer in every sequence of lengthw across the genome. We develop a variant on the approach of minimizers that uses adaptive thresholds to limit the total number of positions sam- pled in unique regions of a genome, and increase the sampling of positions near paralog-specic variants that distinguish repetitive regions. Optimal sets of matches between the query and reference are selected in two phases using CG-SDP. Given a set of fragments =f 1 ;:::; n g, each fragment is dened by a start point and an end point on a Cartesian plane and a weight: i = ((x s i ;y s i ); (x e i ;y e i );l i ). The basic CG-SDP formulation is to dene a chain of fragmentsC [1;:::;n] that maximizes jCj X j=1 l C j jCj1 X j=1 gap( C j ; C j +1 ) (2.1) 14 such that C j+1 is above and to the right of C j , andgap is a concave function. An algorithm was presented for chaining using a concave gap cost model [32], however there are no alignment methods that implement this approach. This method has been implemented in lra and extended to allow for inversions in the optimal chain. Additionally, the original description of the algorithm requires asynchronous processing, which we have updated to use standard serial computation. The details of determining minimizers and the chaining algorithm are given below. 2.2.1 Buildingaminimizerindexandsequencematching We add three additional parameters to generate a minimzer index:F M ,N M andW G that limit the density of minimizers that are selected. An initial set of minimizers is determined in the standard approach, with minimizer k-mer parameter k and window w [86]. Next, minimizers of multi- plicity larger than F M are removed. Then the reference is partitioned into intervals of length W G , and the remaining minimizers starting in each interval are selected in order of their multi- plicity in the genome until the rstN M minimizers are obtained, in a more simplied version of weighted minimizer sampling [43]. Dierent parameter settings are used to index a genome for dierent sequencing technologies and contig alignments. When indexing the genome for align- ing HiFi reads, 867M minimizers from total 1015M were left after ltering by frequency threshold F M . In total, 117M minimizers were selected after the nal ltering based onN M andW G . All minimizers from a query sequence are matched against the ltered set of minimizers from the reference. The result of the sequence matching is a set of anchorsA =f 1 ;:::; n g, where i is a tuple (x i ;y j ;k), whereq [x i ;x i+1 ;:::;x i+k1 ] of the query matchest [y j ;y j+1 ;:::;y j+k1 ] of the target. 15 The chains of anchors for low-accuracy single-pass SMS reads may be sparse or have spurious o-diagonal matches in repetitive sequences. The detailed alignment may be calculated using dy- namic programming within the region that is chained, however to limit the computation required for rening the alignment we implement an additional index of local minimizers that are used to rene an alignment once a coarse-chaining has been done. The local index is parameterized by k l < k,w l < w, and a tiling lengthW l , and is composed of a collection of separate minimizer indexes for sequence intervals of lengthW l tiling across both the query and the target. 2.2.2 Clustering Although CG-SDP can be applied to all anchorsA, for eciency a greedy approach is used to cluster anchors that would likely be together on an optimal chain. These clusters may be used to lter out spurious matches in low accuracy reads, or may be chained directly on high accuracy reads and contig mapping. When forming alignments from chained clusters, it is necessary to have a cluster rening step that divides rough clusters into non-overlapping ne clusters to avoid chaining that skips biological variation in repetitive sequences. 2.2.2.1 Roughclustering Rough clustering partitions anchors into clusters representing approximate intervals on the query and target that are aligned (Fig 2.1a), and serves to exclude noisy anchors unlikely to be chained in an alignment by CG-SDP. Denoting the forward diagonal of each anchor i asf i = y i x i , and the reverse diagonal r i = x i + y i , a sorted anchor order O = [o 1 ;:::;o n ] is dened by ordering anchors by forward diagonal and then x coordinate. A reverse sorted order O rev = [o rev 1 ;:::;o rev n ] is similarly dened sorting on reverse diagonal and x coordinate. This will be 16 used to detect alignments on the reverse strand, but because the operations are the same as on the forward strand, only subsequent steps using the forward sorted order are given. The set of rough clusters is dened by partitioningO into non-overlapping intervals such that every anchor indexed in an interval has a diagonal withinD R of the preceding anchor in the interval. Intervals are greedily assigned with rst interval starting at the rst index inO, and subsequent intervals starting on after a gap of more thanD R between anchors. Intervals with few elements (dened by aminClusterSize parameter) are discarded, and the rough clustersR =fR 1 ;:::;R NR g are dened from the set of anchors included in each interval. The value ofD R is chosen so that rough clusters are likely to contain at leastminClusterSize true anchors from a read (default to 3 for CLR and ONT, and 10 for HiFi/contigs). For CLR and ONT data, we empirically determined a sucientD R is 200, and 150 for HiFi and assembly contig alignment. For low accuracy reads, chains are formed by running CG-SDP on all matches retained in rough clustering. For high accuracy reads, the clusters must be post-processed with ne-clustering prior to being chained. 2.2.2.2 Fineclustering For mapping hich accuracy reads, each rough cluster is processed independently by dividing into non overlapping ne clusters, where each ne cluster consists of anchors on a close diagonalD F , with endpoints that do not overlap. The rst step of CG-SDP will be applied to chain the ne clusters and nd an approximate alignment betweenq andt. Each ne clusterC j is dened by all of the anchors contained in the cluster, and endpoints (x s j ;y s j ) and (x e j ;y e j ), wherex s j ,y s j are the minimumx,y coordinates of the starting points of all the anchors inC j , andx e j ,y e j are the max- imumx,y coordinates of the ending points of all the anchors inC j . To dene the ne clusters, anchors in each rough cluster are rst sorted by Cartesian coordinate. Within each rough cluster, 17 an anchor is dened as unique when the k-mer of the match is not repeated in the cluster. Fine clusters are initialized as runs of unique anchors in the Cartesian ordering that are on a close diag- onal, and the distance between the end of one anchor and the start of the next is small (Fig 2.1b). Every pair of ne clusters C j and C k are merged if their endpoints have diagonal dierences smaller thanD F and are within Cartesian distanceG dist (Fig 2.1c), and all non-unique anchors within the trapezoid dened by (x e j ;y e j D F ); (x e j ;y e j +D F ); (x s k ;y s k D F ); (x s k ;y s k +D F ) are included into the merged ne cluster (Fig 2.1d and e). The remaining non-unique anchors that are not added into the ne cluster are discarded. In the step of ne clustering, we empirically found D F = 500 was able to distinguish clustering anchors in dierent tandem repeats and allowed a sucient number of repetitive anchors to be included in the ne clusters. 2.2.3 Clustersplittingandchaining Each ne clusterC j 2C denes a super-fragmentF j that unlike the minimizer fragments which have starting and endpoints along the same diagonal, the endpoints ofC j and may not be along a diagonal. The set of all such fragments isF , and an optimal chain of fragments will be dened using CG-SDP. However, considering the rectangles dened by the boundaries of each fragment, the CG-SDP algorithm only selects fragments with non-overlapping rectangles in the optimal chain. Due to the repetitive nature of genomes, this may result in erroneously skipped fragments. To account for this, ne clusters are split at overlapping boundaries (Fig 2.1g). The start and end coordinates of all ne clusters are stored in a set that is queried to nd boundary points appearing in the range of each ne cluster. The coordinates of each split cluster are set according 18 to the rst/last anchors appearing after/before the boundary point. An optimal chain of fragments F SDP 2F with corresponding clustersC SDP are then found using CG-SDP onF . 2.2.4 Clusterrenement The optimal chain of super-fragmentsC SDP contains the anchors from which the optimal align- ment will be dened. A pairwise alignment may be created using dynamic programming on the substrings between the minimizer matches, however for high-error rate reads matches are sparse and the length of substrings may be too large to eciently compute, or have too large of a diago- nal gap to use banded alignment. A second anchoring step using the local minimizer index is used to detect shorter and more dense anchors. The local minimizer index containsd jtj W l e andd jqj W l e separate minimizer indexes for the target and query sequences that index each tiling substring of lengthW l , accounting for the relative positions of the substrings in each sequence. Every pair of substrings fromq andt that have an anchor inC OPT are compared using their local minimizer indexes to generate a resulting set of anchors A local . To reduce runtime of CG-SDP on A local , anchors that are adjacent in Cartesian sorted order and on the same diagonal are merged. The length of any merged anchor is the dierence from the last endpoint to the rst starting point. The resulting merged fragments are chained using CG-SDP, and the anchors on this chain are denoted asA localopt . 2.2.5 Alignmentrenement Banded alignment is used to create a pairwise alignment between anchors inA localopt . We as- sume that large gaps between anchors may be modeled using an ane alignment that allows 19 Figure 2.1: Example of clustering anchors prior to CG-SDP. a, Two rough clusters (blue and orange), which are far from each other on the reference. b, The initial four ne clusters dened from the contiguous stretches of unique anchors. c, ne cluster-1 and ne cluster-2 are merged because their diagonal dierence is smaller than D F and projected distances between their endpoints is smaller thanG dist . ne cluster-3 and ne cluster-4 are not merged due to the large diagonal dierence. d, Non-unique anchors in the trapezoid between ne cluster-1 and ne cluster-2 are added to the merged ne cluster-1, along with non-unique anchors in the trapezoid dened by the start of rough cluster-2 and the start of ne cluster-3. e, Three ne clusters are obtained after rough clustering and ne clustering. f-h, Splitting of overlapping ne-clusters: f, overlap of clusters. g, Boundaries of split clusters dened by a start (red dot) and an end (blue dot). h, the optimal chain of split super-fragments. 20 a single large penalty-free gap. Given two sequencesq local andt local , a match matrixM, a gap penalty , and an alignment band B, assume that q local < t local B. A lower-score ma- trixS lower is calculated for a typical banded alignment bandB and gap penalty, e.gS lower i;j = maxfS lower i1;j1 +M q local i ;t local j ;S lower i;j1 ;S lower i1;j g ifjijj B,1 otherwise. Next, an upper-diagonal matrixS upper i;j is calculated that allows for a single transition from the lower matrix with banded alignment. Denoting the length of theq local andt local asl q andl t : S upper i;j = 8 > > > > > > > > < > > > > > > > > : maxfS upper i1;j1 +M q local i ;t local j ;S upper i;j1 ;S upper i1;j g ifij (l t l q )x j ,y i >y j (conversely (x j ;y j ) is below (x i ;y i )). The chaining score is dened by Equation 2.1 wheregap( C j ; C j +1 ) is a concave function of (y s C j+1 x s C j+1 ) (y e C j x e C j ) , the dierence between the forward diagonals of the endpoint of C j and the starting point of C j +1 . The naive way to solve this problem takesO(n 2 ) in time. By applying CG-SDP [32], the runtime isO(n log(n) 2 ). Below, the solution provided by Eppstein, Galil, and Giancarlo is described both 21 with increase clarity from the original description, with a modication that enables calculation with synchronous computation. Finally, the method is extended to allow for inversions, similar to [12]. 2.2.6.1 SDPalgorithmwithconcavegapcostfunction-deningsubproblems For simplicity, assume both sequences are the same lengthl and that all points are in [0;l) (e.g. shifted by the minimumx andy coordinate), and are on all grid. To speed the chaining al- gorithm, the search for the fragment that precedes another on an optimal chain is divided into multiple overlapping subproblems that may be solved independently and more eciently than the naive scan, and the globally optimal score for each point is selected from each of the subproblems that overlap it. Each subproblem divides a block oft sub columns or rows of the search space into anA-part andB-part coveringd l sub 2 e andb l sub 2 c columns/rows respectively, whereA-part con- tains all the endpoints andB-part contains all the startpoints in the corresponding columns/rows. When the size of a subproblem is only one column/row, theA-part of the corresponding subprob- lem is set to be empty. Each subproblem is described by the labeld2fcolumn;rowg, the starting and ending rows and columns of the subproblemA:s,A:e,B:s,B:e, and a set of lists collectively referenced asDATA that are used in the calculation of optimal chains. The full set of subprob- lemsSub(d;s;e) are generated recursively as: 8 > > > > > > > > < > > > > > > > > : f(d;;;;;s;e;DATA)g ifs ==e; f d;s;b s+e 2 c;b s+e 2 c + 1;e;DATA [Sub(d;s;b s+e 2 c)[Sub(d;b s+e 2 c + 1;e)g otherwise (2.3) 22 The more detailed pseudocode for this partitioning is given in S1 Algorithm. Fig 2.2 visualizes the column and row subproblem division of a simple case with six fragments. Each subproblem is processed by nding the optimal endpoint from theA-part that precedes each starting point in correspondingB-part. For a starting pointp i that is assigned to theB- parts of a set of column-based subproblems, the union of theA-parts of those subproblems form the entire plane to the left of the point. Similarly for a starting point assigned to theB-part of a set of row subproblems, the union of theA-parts of those subproblems form the plane below the point. After solving for the optimal preceding endpoint in all of the subproblems in whichp i is contained, the one with maximum score among these endpoints is the global optimal, which represents the optimal chain from all endpoints below and to the left ofp i . Once the subproblems are dened, the list elements ofDATA are allocated and initialized. The following elements are associated with theA-part of a column/row subproblem: • D I : The forward diagonals in increasing/decreasing order overlapped by endpoints in the A-part. • D V : The optimal chaining score for diagonals overlapped by endpoints inA-part. D V [s] holds the optimal value of forward diagonalD I [s]. • D P : Index of point with optimal score along diagonal.D P [s] references the point with the best score ofD V [s]. The following elements are associated with theB-part of a column/row subproblem: • E I : The forward diagonals in increasing/decreasing order overlapped by starting points in theB-part. 23 • E V : Similar toD V , the locally best chaining value. • E P : Index of the forward diagonal inD I that leads to the best chaining valueE V [s] for the forward diagonalE I [s]. • E B : A block structure used in the calculation ofE V . • E L : Index of the most recent value inD V used to compute values inE V . For each point, we have two lists that reference the subproblems a point is contained in: • SA, A list of subproblems that contain a point that is in anA-part of a subproblem. • SB, A list of subproblems that contain a point that is in aB-part of a subproblem To make the description of the method more concise, the operation'(D I ;f) and'(E I ;f) are dened to give the index of diagonalf inD I andE I , respectively. 2.2.6.2 SDPalgorithmwithconcavegapcostfunction-conqueringsubproblems The original description of CG-SDP, uses asynchronous processing during solving subproblems [32]. Here we give an alternative description of the algorithm, and provide an approach to solve subproblems in a way that allows synchronous processing of points in Cartesian order for a more simple implementation. The framework for solving a subproblem is rst described, and then an order of processing points is given to solve for optimal fragment chaining. For any subproblem, the optimal chains between endpoints in theA-part to starting points in theB-part are found using theD I ,D V ,D P ,E I ,E V ,E P arrays, variableE L and block listE B . Consider two pointsp a inA-part, andp b inB-part that are on diagonalsf a andf b , respectively. An invariant for any subproblem is that if the optimal chain ending with the fragment that has 24 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 4 3 5 6 7 A0 B0 A1 B1 A2 B2 c c c c c c 8 A0 r B0 r 1 2 3 6 5 9 11 12 4 8 7 10 DI 1 DV 2 DP 6 E I -6 -5 E V 0 0 E P -1 -1 A0 c B0 c ( ) , DI DV DP E I E V E P A2 c B2 c ( ) , 0 -5 0 2 2.8 2 8 -4 -6 -1 2 Start 1 3 4 5 7 11 P sub,ind leaf leaf leaf End 2 6 8 9 10 12 P V 2 2 2 B1 c ,2 B0 r ,1 2.8 2.8 2.8 9 10 0.8 B2 c ,1 2.8 DI DV DP E I -4 -1_1 E V E P A1 c B1 c ( ) , 0 2 DI DV 2 0 DP -1 E I E V 0 0.8 E P -1 A0 r B0 r ( ) , 0 -4 -6 1 -1 -5 0.8 2 2 0 0 0 0 -1 -1 0.8 1 2 8 E L -1 E L 0 E B (-1, -1) E B (0, 0) E L 0 E L 1 E B (0, 2) E B (0, 1) (1, 2) Figure 2.2: Visualization of subproblems divisions. The data structures needed for each subproblem: D I , D V , D P , E I , E V , E P , E L and E B after the processing of all the 12 points. Points are numbered in the order of processing (Cartesian sorted order). 12 points are assigned into three column subproblems (A c 0 ;B c 0 ), (A c 1 ;B c 1 ), (A c 2 ;B c 2 ) and one row subproblem (A r 0 ;B r 0 ), where starting points are assigned to A-part and endpoints are assigned to B-part. Leaf sub- problems are not shown for simplicity. Start and End are used for the traceback of the op- timal chain. Start stores sub - the index of the subproblem which yields the optimal chain- ing score up to a starting point and ind - '(E I ;f i ), where f i is the diagonal of the starting point. End stores the optimal value for each endpoint. For this toy example, the match bonus of every fragment is 2 and the gap cost function of appending fragment j to fragment i is gap( i ; j ) = 0:25log (y e i x e i ) (y s j x s j ) + 1 + 1, where (x e i ;y e i ) is the endpoint of i and (x s j ;y s j ) is the starting point of j .End indicates that there are three optimal chains achiev- ing the optimal value: 2.8. By tracing back, chain-1: point-1, point-2, point-3, point-6, chain-2: point-1, point-2, point-5, point-9 and chain-3: point-7, point-10, point-11, point-12. endpointp a has been solved and is referenced asScore(p a ), the score of chainingp a withp b is Score(p a ) +gap(jf b f a j). Because the gap cost only depends on diagonal and not the coordi- nates of a point, all points inA-part on the same diagonalf a will have the same cost to chain with p b . More generally, the score to chain any starting point inB-part to an ending point inA-part is equal to the score of chaining a diagonal inB to a diagonal inA. The score of an optimal chain up to pointp b inB-part,E V [j], wherej ='(E I ;f b ), is found for column-based problems as: 25 E V [j] = max k:D I [k]E I [j] D V [k] +gap(jE I [j]D I [k]j) (2.4) For row-based subproblems: E V [j] = max k:D I [k]>E I [j] D V [k] +gap(jE I [j]D I [k]j) (2.5) The naive approach to solve for all values of E V scales quadratically as O(jE V jjD V j). A problem of this form: E[j] = min kj D[k] +w(k;j); (2.6) wherew is a concave function andD[k] is a linear transformation ofE[k], was solved inO(jDj logjEj) time using the auxiliary block data structureE B [35, 40]. In this solution, iteration is performed with one pass over theD array. After each iterationi, the invariant holds: each element inE has been set to reference which element of the prexD[1;:::;n] gives the minimum value for Equation (2.6). Eciency is gained by not updatingE explicitly but by storing indexes in the data structureE B that may be updated in log(jEj) time by applying a functionUpdate(D[i];E B ) on each iteration. The data structureE B has an operation that can reconstructE,E[k] = (E B ;k). The details of E B and Update may be derived from [40]. In the application to CG-SDP, D is replaced byD V andE byE V , and Equation (2.4) and Equation (2.5) solve for the optimal chains between diagonals from anA-part of a subproblem to diagonals in theB-part of the column and row subproblem respectively. An important detail in our implementation is that in column-based 26 problems anUpdate operation only aects references inE V that are on a greater or equal diago- nal than the current element inD V , and for row-based problems elements inE V are only aected for diagonals that are less than the current element inD V . However, it is not possible to apply Equation (2.4) and Equation (2.5) directly. Because points are processed in Cartesian order, but D V in forward diagonal order, values ofD V are not solved in increasing order and not all values ofD V are guaranteed to be solved when values ofE V are computed. In [32] this is accounted for by asynchronous computation. Below, an approach is described to solve with a standard model of computation by callingUpdate(D V [i];E B ) using subsets ofD V as they become known. Con- trary to the customary model of divide and conquer, where subproblems are completely solved before combining into a global solution, this solves portions of subproblems on each iteration. Points are processed in order ofx and theny coordinate, determining the value of the optimal chain up to the current processed point. The value of a starting point is the value of the optimal chain that chains the starting point to an endpoint below it. When no endpoints are below a starting point, the value of this starting point is trivially set to 0. The value of an endpoint is simply the value of the optimal chain preceding the corresponding starting point plus the value of the fragment. When processing an endpointp i , the starting pointp s of the corresponding fragment i has been solved because points are processed in Cartesian order. The value of the chain at the end- point isScore(p s )+l i , wherel i is the match bonus of fragment i . In order forp i to be used when solving for starting points that are above it, the value ofD V must be updated in subproblems for whichp i is a point in theA-part. TheSA list is used to determine which subproblems contain p i in aA-part. Suppose the forward diagonal ofp i isf i . In each subproblem inSA,Score(p i ) is 27 passed toD V [j] andD V [j] is set to maxfScore(p i );D V [j]g, wherej = '(D I ;f i ), andD P [j] is set toi ifD V [j] gets updated. When processing a starting pointp i , the optimal value must be calculated in each of theB- parts of subproblems that includep i , and the global optimal value will be selected among them. For any subproblem, this can be achieved by solving forE V [j] using Equation (2.4) for column- based or Equation (2.5) for row-based subproblems, wherej = '(E I ;f i ) andf i is the diagonal ofp i . For a column-based subproblem, this requires that the values ofD V [k] have been solved whereD I [k] f i , and for row-based subproblems, the values ofD V [k] must have been solved whereD I [k] > f i . These correspond diagonals overlapping points that fall in the region below and to the left of the p i and are exactly what have been processed when solving for points in Cartesian sorted order. Thus all required values inD V arrays (though not the entireD V array) are available when solving forp i . The value ofE V [j] is optimal onceUpdate(D V [k];E B ) has been called on all diagonals that contain potential predecessors to points on diagonal E I [j]. The function Update(D V [k];E B ) must be called only once per element inD V and in increasing order. However because points are processed in Cartesian sorted order, values of E V are solved in arbitrary order, and call- ing Update(D V [k];E B ) can reference elements in D V multiple times. To account for this, we maintain an index E L for each subproblem that keeps track of the last element of D V which has been processed byUpdate(D V [k];E B ). Before solving any points,E L is initialized to -1 in every subproblem, and when processing points will be assigned to reference the most recently updated diagonal fromD V . When processing starting pointp i in a column-based subproblem, E V [j] must be solved, wherej ='(E I ;f i ). IfD I [E L ]>f i ,Update has been called on all values ofD V thatE V [j] relies on, and the state ofE B contains the optimal value forE V [j] such that 28 it may be calculated immediately from (E B ;j). Otherwise, Update(D V [k];E B ) is called for E L < k < C, whereD I [C] is the rst diagonal from the left that is larger thanf i in arrayD I , andE V [j] may be calculated from (E B ;j). Similarly, for row-based subproblems iff i D I [E L ], the value ofE V [j] may be assigned immediately from (E B ;j) wherej ='(E I ;f i ). Otherwise, Update(D V [k];E B ) is called forE L < k < C, whereD I [C] is the rst diagonal from the left that is smaller than or equal to f i in array D I , and E V [j] can be retrieved from (E B ;j). By comparing valuesE V [j] from all the subproblems in theSB list of startpointp i , the maximum value forp i will be obtained and stored. The pseudocode and detailed example of solving points and conquering subproblems are given in the S2 and 5.1.3.1. 2.2.6.3 Extensiontoinversioncases This extension is inspired by the work [12]. As explained in the previous section, when chaining fragments in forward direction, two points - a lower left start point (x s1 i ;y s1 i ) and a upper right endpoint (x e1 i ;y e1 i ) would be associated to each fragment i . (x s1 i ;y s1 i ) can be chained to an endpoint below and to the left of it and (x e1 i ;y e1 i ) can be the predecessor of a starting point above and to the right of it. To allow inversions to happen, fragments must be allowed to be chained in the reverse direction. To account for this, we associate two more points to each fragment i , that are a upper left start point (x s2 i ;y s2 i ) and a lower right endpoint (x e2 i ;y e2 i ). The start point (x s2 i ;y s2 i ) can only be chained to endpoint (x e2 j ;y e2 j ) of some other fragment j that satises x e2 j <x s2 i ;y e2 j >y s2 i . The endpoint (x e2 i ;y e2 i ) can precede a starting point (x s2 k ;y s2 k ) of some other fragment k that satisesx s2 k > x e2 i ;y s2 k < y e2 i . (x s1 i ;y s1 i ), (x e1 i ;y e1 i ), (x s2 i ;y s2 i ) and (x e2 i ;y e2 i ) can be represented as s1(i), e1(i), s2(i) and e2(i) respectively. Chaining fragment i to fragment j in the reverse direction equals to chaining the start points2( i ) to the endpointe2( j ). The 29 cost of appendings2( i ) tos2( j ) in the reverse direction isgap rev ( j ; i ), which is a concave function of (x e2 j +y e2 j ) (x s2 i +y s2 i ) , the dierence between the reverse diagonals of fragments i and j . In a short word, s2() and e2() of each fragment would be responsible for the possible chaining of in the reverse direction, whiles1() ande1() would be responsible for the forward direction. We used the same column and row subproblem dividing scheme to sort alls1 ande1 points and assign them into column-1 and row-1 subproblems. Then alls2 ande2 points are sorted and assigned to column-2 and row-2 subproblems in the same way. ArraysD I ,D V ,D P ,E I ,E V ,E P , the block structureE B and variableE L are allocated and initialized for each subproblem. Lists SA 1 andSA 2 reference column-1/row-1 and column-2/row-2 subproblems that each endpointe1 ande2 is in respectively. Similarly, listsSB 1 andSB 2 references column-1/row-1 and column- 2/row-2 subproblems that starting pointss1 ands2 in respectively. The only dierence between column-2/row-2 and column-1/row-1 subproblems is that column-2/row-2 stores reverse diagonal instead of forward diagonal inD I andE I arrays. The steps to solve subproblems to allow chaining in both forward and reverse directions are highly similar to section SDP algorithm with concave gap cost function - conquering subprob- lems. Points are processed in order ofx and theny coordinate. When a starting points1( i ) is being processed, E V [j], wherej = '(E I ;f i ) andf i is the forward diagonal, will be computed from column-1/row-1 subproblems in SB 1 . The maximum E V [j] of those subproblems is the value of optimal chain up tos1( i ),Score(s1( i )), wheres1( i ) is forwardly chained to an end- pointe1( j ) below and to the left of it. Similarly, when a starting points2( i ) is being processed, E V [j], wherej ='(E I ;r i ) andr i is the reverse diagonal, will be computed from column-2/row-2 subproblems inSB 2 . The maximumE V [j] of those subproblems is the value of optimal chain up 30 tos2( i ),Score(s2( i )), wheres2( i ) is reversely chained to an endpointe2( k ) above and to the left of it. After solvings1( i ) ands2( i ) for fragment i , the value of the optimal chain up to fragment i can be calculated asScore( i ) = maxfScore(s1( i ));Score(s2( i ))g +l( i ), wherel( i ) is the match bonus of fragment i . This optimal chain is chosen from all the possible chains that fragment i is forwardly or reversely chained to the predecessor in the left. When solvinge1( i ), Score( i ), will be passed to arrayD V [j], wherej ='(E I ;f i ) andf i is the forward diagonal, to column-1/row-1 subproblems inSA 1 . AndD P [j] will be updated to the index of pointe1( i ), if Score( i )>D V [j]. Similarly, whene2( i ) is being processed,Score( i ) will be passed to array D V of column-2/row-2 subproblems inSA 2 andD P will be updated. Therefore, the addition of two pointss2() ande2() for each fragment make it possible to allow to be chained in the reverse direction. Meanwhile, the overall time complexity and storage remain bounded byO(n(log(n) 2 )), wheren is the total number of fragments. 2.2.6.4 Timecomplexity Assume there aren fragments in total, listSB/SA containsO(log(n)) subproblems that a point is in. In section SDP algorithm with concave gap cost function - conquering subproblems, we men- tion that when a starting pointp i with forward diagonalf i is being processed,Update(D V [k];E B ) is called forE L <k<C, whereD I [C] is the rst diagonal from the left that is larger than/smaller than or equal tof i in column/row subproblems. AndE V [j] can be retrieved inO(log(n)) time from the block structureE B by calling (E B ;j), wherej = '(E I ;f i ). ProcedureUpdate may be called several times for a starting point in each subproblem. In order to make it easy to quan- tify the total time complexity ofUpdate procedures, we consider that eachUpdate procedure is 31 called right afterD V [j] is updated by some endpointp i with diagonalf i , wherej = '(E I ;f i ). When an endpoint is being solved, there are O(log(n)) subproblems associated with it and in each subproblemUpdate takesO(log(n)) to conduct. Therefore, it takesO((log(n)) 2 ) time to solve the subproblems that are associated with an endpoint. When a starting point is being pro- cessed, there areO(log(n)) subproblems it is in and in each subproblemE V [j] can be computed from the block structureE B inO(log(n)) time. Therefore, it takesO((log(n)) 2 ) time to tackle subproblems that are associated with a starting point. Since there aren fragments in total, the time complexity of processing all the points and subproblems is bounded byO(n log(n) 2 ). 2.3 Results 2.3.1 Evaluationoflraalignments We compared alignment metrics and variant discovery on simulated data sampling from the hu- man genome build GRCh38, and real sequencing data from HG002 including three sequencing datasets: PacBio consensus reads (HiFi), PacBio single-pass reads (CLR), and ONT reads, as well as on contigs from a haplotype-resolveddenovo assembly [17]. The simulated data were mapped to GRCh38 to evaluate mapping accuracy with a genome that includes centromeric sequences. All real data were mapped to build GRCh37 to use curated variants for accuracy analysis [107]. The PacBio HiFi data are characterized by high-accuracy reads (>99%) with an average length of 19kb, compared to the CLR reads that have an accuracy around 80% and an average read length of 21kb. The ONT data: 88% accuracy and 1.2kb average read length. All read datasets are above 40 coverage of a human genome. 32 The alignment results and structural variant callsets were compared to those generated by minimap2 and ngmlr. When computing SAM formatted alignments with minimap2, the lra and minimap2 runtime are comparable. lra and minimap2 reach similar speed on HiFi and ONT data, although all are within a factor of 1.12. minimap2 is 3.7 times faster than lra on CLR dataset (Table 2.1). lra is 4-6 times faster than ngmlr. The dierence of bases aligned by lra and min- imap2 are within 0.06-1%. We compared the mapping accuracy of lra, minimap2 and ngmlr on simulated HiFi, CLR data for read lengths between 5-50kb and ONT data for read lengths between 1-50kb. HiFi and CLR reads were simulated by PBSIM (https://github.com/pfaucon/ PBSIM-PacBio-Simulator) and ONT data were simulated by alchemy2, which is distributed with lra source. We used paftools.js mapeval to evaluate the mapping accuracy across three overlap percentages (10%, 40%, 70%) to measure the change of mapping accuracy with overlap percent- age evaluation metric. The mapping accuracy of lra for reads with mapping quality at least 20 is 99.89%, 99.97% and 99.97% for HiFi, CLR and ONT respectively, when a 40% overlap with the simulated interval is required for a correct alignment (Fig 2.3). As the overlap percentage goes up from 10% to 70%, the mapping accuracy of lra alignment decreases 0.1%, 0.09%, 0.03% for HiFi, CLR and ONT at mapqv 20 due to dierences in lengths of sequences aligned in repetitive regions. In lra, a second local minimizer index is used to rene chained anchors and improve alignments (Methods). We evaluated the mapping accuracy of lra without the step of rening by local min- imizers ( 5.1.4). For simulated HiFi data, the mapping accuracy has almost no dierence on all 10%, 40%, 70% overlap percentage metrics, while mapping accuracy at the 70% dropped below 0.90 and 0.97 for simulated CLR and ONT reads, respectively. 33 Table 2.1: Performance of alignment methods. HG002 HiFi HG002 CLR HG002 ONT HG002 HiFi hiasm hap1 HG002 HiFi hiasm hap2 lra runtime 942m 7428m 1528m 105m 90m memory 12.01G 16.46G 13.96G 31.10G 28.66G # of mapped reads 1.84M 4.95M 1.37M 396 411 # of mapped bases 35.85Gb 86.82Gb 25.45Gb 2.87Gb 2.98Gb minimap2 runtime 890m 1973m 1358m 100m 82m memory 19.04G 18.49G 22.88G 20.95G 21.90G # of mapped reads 1.85M 5.06M 1.41M 492 496 # of mapped bases 36.20Gb 87.36Gb 25.73Gb 2.92Gb 3.01Gb ngmlr runtime 5087m 33047m 8862m - - memory 13.6G 14.67G 17.00G - - # of mapped reads 1.78M 4.84M 2.07M - - # of mapped bases 34.59Gb 83.75Gb 24.22Gb - - Each dataset was aligned to GRCh37 by all methods with 16 threads, with - indicating a software crash. Total CPU time is reported. The optimal values in each class are given in bold. The HiFi hiasm assembly is a haplotype-resolved de novo assembly of HG002 using HiFi reads, with haplotype N50 values of 50.55Mb and 42.92Mb. 0 20 40 60 mapping quality 0.98 0.99 1.00 accuracy a overlap percentage: 10.0% 0 20 40 60 mapping quality 0.98 0.99 1.00 accuracy b overlap percentage: 40.0% 0 20 40 60 mapping quality 0.98 0.99 1.00 accuracy c overlap percentage: 70.0% Figure 2.3: Mapping accuracy. Mapping accuracy of lra, minimap2 and ngmlr on simulated HiFi, CLR reads for lengths between 5-50kb and ONT reads for lengths between 1-50kb. Simulated reads were mapped to genome GRCh38. A read is considered as correctly mapping if the reported mapping interval has 10%; 40%; 70% overlap with the truth interval. paftools.js mapeval was used to evaluate the mapping accuracy. 2.3.2 Variantdiscovery A common application of LRS in human genetics is detecting structural variation [15, 89]. We evaluated SV detection from mapped reads across combinations of technologies, aligners, and SV calling algorithms on both simulated datasets and real dataset. Reads were aligned with lra, pbmm2/minimap2, and ngmlr, and variants were detected with pbsv, Snies and cuteSV [104, 89, 45]. The pbmm2 method encapsulates minimap2 with technology and SV discovery specic 34 parameters. The pbsv method did not run using ONT reads, so pbsv analysis of ONT data is omitted. 2.3.2.1 SVanalysisofsimulationreads In the simulation study, we used SUVIVOR (https://github.com/fritzsedlazeck/SURVIVOR) to sim- ulate insertions, deletions, inversions and more complicated nested SVs, including deletion-inversion- deletions (INVDEL) and inverted duplications (INVDUP). We simulated 195 insertions and dele- tions of lengths between 50-10000 bases, 97 inversions of lengths between 600-2000 bases and 100 INVDEL and INVDUP of lengths between 600-1000 bases respectively using the human genome GRCh37 chromosomes 20-22 as a reference. HiFi and CLR reads were simulated by PBSIM and ONT reads were simulated by alchemy2. Truvari (https://github.com/spiralgenetics/truvari) was used to benchmark the insertions, deletions and inversions callset. Both pbsv and cuteSV calling results show similar F1 score (>= 0.974, 0.979, 0.974) for lra, minimap2 and ngmlr align- ments on insertions and deletions over all data types. All three aligners achieved comparable F1 scores (>= 0.97) on inversions from cuteSV calling results over all data types ( S7 Table). There was a lower F1 score for variants detected by Snies on lra alignment of the simulated CLR reads, which is inconsistent with other two callers. Complicated nested SVs like INVDEL presented a greater challenge, particularly for lower accuracy reads. Using simulated HiFi reads, 98 of the IN- VDEL and 94 of the INVDUP were correctly called by lra, with all alignment methods detecting at least 93 variants in each data set. The majority of the INVDEL variants detected by lra align- ments using CLR (79) data missed one deletion, while variant calls from minimap2 and ngmlr correctly detect 94 and 74 INVDEL variants, respectively. A dierent pattern existed for ONT data, where 1-2 correct calls were detected from minimap2/ngmlr, and 37 by lra. The INVDUP 35 calls were successfully detected from lra and ngmlr alignments, and represented as insertions by minimap2. The lra alignments resulted in 99 and 100 calls from the CLR and ONT simulated reads, and ngmlr 98 calls from both datasets ( S8 Table). 2.3.2.2 SVanalysisoflongreads We used pbsv and Snies to detect variation on real HiFi, CLR, and ONT data; the cuteSV vari- ant calls had lower accuracy on PacBio data across multiple alignment methods ( S11 Table) and were excluded from analysis. The majority of SVs are under 500 bases [15] and are spanned by LRS alignments. Consequently, variant calls with similar representations of gaps should have converging callsets, which may be conrmed by comparing SV calls from dierent combinations of algorithms. The most consistent variant calling was found using pbsv on both types of PacBio using lra and pbmm2 alignments. These callsets ranged from 21,439-22,325 SVs, with a dierence of 2.6-3.3% between alignment algorithms and 0.7-1.5% between data types (Table 2.2). Compared to the pbsv calls, there was greater variation between callsets generated using Snies across both data types and alignment algorithms, with a range of 16,689-27,001 SVs per callset. There was less variation between the size of callsets of dierent data types and the same alignment algo- rithm (average dierence of 624 across pbsv callsets), compared to dierent alignment algorithms applied to the same data (an average dierence of 2,290 across pbsv callsets). The dierences be- tween callsets not due to gap placement were broadly measured by comparing callsets with a low-stringency lter: within 50% of length and at most 1kb apart. Using this denition of over- lap, the callsets from PacBio data using pbsv were most similar with no more than 20% of calls unique to an alignment method. When stratifying PacBio variants unique to a dataset overlap 36 with genomic features, 29-46% of variants overlap tandem repeats, and 15-56% overlap segmen- tal duplications (SD) ( S6 Table). This indicates that calls in tandem repeats may have dierent breakpoints and are dicult to compare using standard overlap approaches [107]. A greater call count may reect more sensitive detection of SV, or simply fragmentation of variants. To assess the accuracy of variant callsets, we compared callsets against the GIAB Tier 1 calls [107] using Truvari. The lra and pbmm2 based callsets outperform ngmlr based calls in precision and recall on all data types (Table 2.3 and Fig 2.4). While the F1 scores are eectively equivalent between lra and pbmm2 based variant calls on HiFi (0.970 vs 0.968) and CLR (0.967 vs 0.967) using pbsv, there is a substantial improvement on calls on ONT data made by Snies (0.942 vs 0.910). This indicates that the greater call count on ONT data includes is at least partially attributed to increasing recall, particularly in larger (>500 base) insertions, without aecting precision on the high-quality regions that were ascertained. While the F1 scores are nearly equivalent on PacBio data, the combination of algorithms may contribute to a more complete evaluation of an LRS genome. Calls from the lra and pbmm2 alignments contribute 83 and 43 unique calls, respectively, that were annotated as true positives in the HiFi/pbsv call sets, and 134, 133 unique calls from the CLR/pbsv call sets. The average length of the uniquely called true positive SV among these callsets is 1986bp, highlighting the challenge of detecting longer SV from read based alignments. To further assess the potential for callsets to be more comprehensive, benchmark variants annotated as false-negatives were compared to variants discovered directly from read alignments. Each call where at least 20% of the reads overlapping the call had an SV of the same type with the ratio of lengths between 0.6- 1.4 was annotated as a supported false-negative (s/FN). Between 14.5%-83.2% of annotated false negatives from pbsv variant callsets, and 35.0%-93.6% of false-negative calls in Snies callsets 37 were considered s/FN. Furthermore, there were 13% more supported false-negative calls in the PacBio lra/pbsv callsets compared to the pbmm2/pbsv callsets, possibly the result of tuning the pbsv variant discovery algorithm to the input generated by a particular alignment algorithm. The accuracy of breakpoints was measured by comparing the boundaries of true positive SVs to the breakpoints of the GIAB truth data for each aligner/caller combination. Breakpoint accuracy was measured by the percentage of SVs with perfect breakpoint boundaries and the av- erage shifting distance between the left-most coordinate of SV boundaries. The average shifting distances of lra is 53bp across all data type and SV calling algorithms, compared to an average shifting distance of 46bp across all aligner/data/SV calling methods. The percentages of SVs with perfect breakpoints called from lra alignments ranged between 31-64% across all data and SV call- ing algorithms ( S9 Table), which indicates in addition to alignment, the accuracy of breakpoints also depends on SV caller. A comparison of callsets highlights how detecting inversions and SV in segmental duplica- tions remains challenging. Inversion calls were generated by Snies. We ltered calls that over- lapped the centromere. Each resulting callset was manually curated using dot-plots of the genome assemblies against the reference. Calls were classied as true positives if at least one haplotype demonstrated the inversion, an inverted duplication if the dot-plot signature was a xed inverted duplication or inverted transposition, false-positive if both haplotypes did not show an inversion or an inverted duplication, and NA if not possible to validate with the assembly, commonly in pericentric regions. Across all data types, ngmlr based alignments discovered the most inver- sions and inverted duplications annotated as true-positives, with calls ranging from 8077-44,538 bases (Table 2.4). On average, 75 inversions and inverted duplications were detected using ngmlr 38 alignments, versus 61 for both minimap2 and lra, indicating that additional development may be required to accurately detect rearranged DNA with minimal computational burden. 2.3.2.3 SVanalysisofassemblycontigs When sequencing depth is suciently high coverage, comprehensive haplotype-resolveddenovo assemblies can be used to detect variation instead of read alignments. Both lra and minimap2 were used to align contigs of a haplotype-resolveddenovo assembly of HG002 constructed by the hi- asm assembler [17]. The haplotype assemblies resolve 2.97Gb and 3.07Gb, with N50 values 50.55 and 42.92 Mb. Both alignment methods have similar runtime to align de novo assembly contigs to GRCh37, although lra requires 1.5-fold the memory. Over 95% of each haplotype assembly was mapped by each method; 2.92/2.97 Gb were mapped by minimap2, and 2.87/2.97Gb mapped by lra. It was not possible to use ngmlr to align contigs. Variants were detected from the lra and minimap2 alignments using htsbox pileup -cuf. Calls from dierent haplotypes that overlap at least 30% were determined as homozygous calls and the rest were classied as heterozygous calls. Additionally, calls in centromere regions were removed for both minimap2 and lra callsets. This generated 25,982 calls by lra, and 26,341 by minimap2 . Truvari analysis of the calls gives an F1 score of 0.955 by lra calls, and 0.933 for minimap2 calls (Table 2.5). There are 233 calls from the lra assembly callset missing from the true-positive annotations of the lra HiFi/pbsv callset. Upon inspection, the majority of these missed calls are due to a combination of the process used to merge haplotype calls into a diploid callset, shifted breakpoints relative to the GIAB annotation, and haplotype dropout/swapping of de novo assembly sequences. Diversity studies are increasingly using SMS de novo assemblies to measure variation [4, 27] rather than from sequencing reads. An integrated analysis of the regions of the genome spanned 39 by assembly alignments and the raw-read base calls from multiple alignment and SV discovery methods can help reveal the extent that variation is discovered and potentially missing from a high-quality haplotype-resolved de novo assembly. Across all read technologies, alignment algo- rithms, and variant discovery methods, 167 unique calls representing 0.6% of an average read- based SV callset were in regions not mapped by either haplotype of the assembly. Of these, 113 are in segmental duplications (SD). Because SDs make up roughly 5% of the human genome, the majority of SD regions that are mapped by SMS reads in this study are assembled in the hiasm genome. The lra alignments detect 53 and 60 large SV (>20kb) totaling over 3Mb from the hap- lotype 1 and haplotype 2 assemblies that are not found in the read SV callsets ( 5.1.5). Because the length of SV negatively correlates with allele frequency, and is associated with enrichment in genome-wide association studies signicant loci [96], these larger variants represent a class of variation that is biologically important for discovery. The high-condence callset used in Truvari analysis contains fewer than 10,000 SVs, less than half what are expected to be found in a human genome. To gauge the specicity of assembly- based calls outside this set, SV contained in read alignments were compared to the SV discovered from assemblies. An SV detected from an individual read supported an assembly based SV if the call was the same type (e.g. insertion or deletion), was within 1kb, and the ratio of SV lengths was between 0.5-2. When using PacBio CLR aligned reads, nearly all (99.998-100%) of SV annotated by Truvari as true-positives had at least four supporting reads aligned by lra or minimap2. Using this approach, 98.3% of lra deletion SV, and 98.7% of minimap2 deletion SV are supported by reads, and 99.1%/99.7% of insertion SV from lra/minimap2, are supported by reads (Table 2.6). This is greater than the precision measured by Truvari on assembly-based calls: 0.955 for lra, and 0.933 for minimap2, indicating an under estimate of the callset precision. When inspecting the SV calls 40 that are not supported by read alignments, many are due to dierential placement of gaps causing disagreement of SV length between the read and assembly-based calls. Table 2.2: Statistics of structural variant call sets. lra pbmm2 ngmlr Number of calls Average call size Total size Number of calls Average call size Total size Number of calls Average call size Total size pbsv CLR INS 13177 560 7.38M 12828 523 6.72M 11416 537 6.14M DEL 9148 688 6.30M 8926 603 5.39M 8569 594 5.10M HiFi INS 13133 577 7.58M 12655 488 6.19M 10342 506 5.24M DEL 9037 619 5.60M 8784 583 5.12M 8445 575 4.86M Snies CLR INS 16377 1074 17.59M 11592 785 9.10M 8318 600 5.00M DEL 10624 820 8.71M 9195 578 5.32M 8371 739 6.19M HiFi INS 12004 805 9.67M 10902 459 5.01M 9716 436 4.24M DEL 8883 1073 9.54M 7907 430 3.40M 7698 672 5.18M ONT INS 13052 1084 14.15M 12103 488 5.91M 10314 476 4.91M DEL 10129 752 7.63M 9238 550 5.08M 8286 637 5.28M Summary of SV callsets from dierent aligners (lra/pbmm2/ngmlr) + SV callers (pbsv/Snies). Table 2.3: Truvari classication of variant calls. pbsv Snies HiFi CLR HiFi CLR ONT lra pbmm2 ngmlr lra pbmm2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr TP base 9466 9408 8433 9438 9414 9076 8924 8700 8015 8969 8082 6533 9085 8887 8471 TP call 9506 9449 8500 9516 9507 9208 8964 8713 8028 9101 8147 6567 9102 8902 8488 FP 404 390 396 448 418 428 392 357 800 2023 1492 1298 550 940 815 FN 175 233 1208 203 227 565 717 941 1626 672 1559 3108 556 754 1170 s/FN 108 93 176 169 151 268 573 598 573 629 1423 2387 477 464 410 precision 0.959 0.96 0.955 0.955 0.958 0.956 0.958 0.961 0.909 0.818 0.845 0.835 0.943 0.904 0.912 recall 0.982 0.976 0.875 0.979 0.976 0.941 0.926 0.902 0.831 0.930 0.838 0.678 0.942 0.922 0.879 F1 score 0.970 0.968 0.913 0.967 0.967 0.952 0.942 0.931 0.869 0.871 0.842 0.748 0.943 0.91 0.895 Truvari comparisons between lra, pbmm2/minimap2 and ngmlr using the Genome in a Bottle benchmark SV set. Optimal results in each category are shown in bold. TP-base means true positive calls in the benchmark SV curation set, while TP-call means true positive calls in the SV set from each aligner. False positive means the number of non-matching calls from the SV set from each aligner. False negative means the number of non-matching calls from the SV curation set. 2.4 Discussion The initial description of SDP was given in 1992 in two publications, one covering an ane-cost gap function, and another with a concave-cost gap function [31, 32]. While there have been 41 Table 2.4: Classication of inversions detected in HG002 using PacBio and ONT reads Dataset Aligner TP Dup NA FP HiFi lra 30 21 8 26 minimap2 59 4 4 8 ngmlr 58 17 2 16 CLR lra 65 9 9 36 minimap2 69 1 9 20 ngmlr 66 17 9 17 ONT lra 53 7 7 18 minimap2 44 7 1 6 ngmlr 58 10 3 9 TP, true positive using manual curation. Dup, inverted duplication misclassied as an inversion on both haplotypes, NA not possible to manually curate, FP - no signature of inversion in read nor assembled haplotype dot-plots. Table 2.5: Assembly based calls comparison. HG002 HiFi hiasm assembly lra minimap2 Insertion, hom 6276 6494 Insertion, het 8852 8783 Deletion, hom 3518 3895 Deletion, het 7336 7169 TP base 9310 9035 TP call 9412 9062 FP 546 692 FN 331 606 precision 0.945 0.929 recall 0.966 0.937 F1 score 0.955 0.933 htsboxpileup-cuf is used to call SVs from contig lra and minimap2 alignment. Calls from lra and minimap2 assembly alignments are classied as homozygous and heterozygous by comparing calls from two haplotypes. Truvari comparison results between these two callsets are shown. many implementations of ane cost SDP, no sequence alignment methods have been imple- mented using SDP with a concave-cost gap function. In the original description of the algorithm, the processing of a starting point is blocked until all subproblems the point relies on are solved, 42 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls 0 300 600 900 1,200 1,500 Variant calls -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value a lra (HG002 HiFi) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value b minimap2 (HG002 HiFi) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value c ngmlr (HG002 HiFi) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value d lra (HG002 CLR) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value e minimap2 (HG002 CLR) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value f ngmlr (HG002 CLR) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value g lra (HG002 ONT) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value h minimap2 (HG002 ONT) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value i ngmlr (HG002 ONT) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value j lra (HG002 hifiasm assembly) -10,000 -7,500 -5,000 -2,500 -1,000 -750 -500 -250 250 500 750 1,000 2,500 5,000 7,500 10,000 Variant length (bp) 0% 20% 40% 60% 80% 100% Value k minimap2 (HG002 hifiasm assembly) htsbox_Precision htsbox_Recall sniffles_Precision sniffles_Recall pbsv_Precision pbsv_Recall Figure 2.4: The precision and recall of HG002 SV call sets. The precision and recall of SV call sets from HG002 HiFi, CLR, ONT, hiasm assembly, measured by Truvari using Genome in a Bottle benchmark SV callsets. and then the process is unblocked and processing resumes. We nd that this blocking and un- blocking strategy is not necessary, and the addition of data structures to keep track of the state of computation of subproblems enables solving the problem with a standard model of computation. We used two additional strategies to eectively employ SDP in lra: an iterative renement pro- cess where a large number of anchors from the initial minimizer search are grouped into a small super-fragments that are chained using CG-SDP, and once a rough alignment has been found, a 43 Table 2.6: Read based support of assembly calls. Assembly SV type total ltered supported fraction Haplotype 1 lra DEL 6101 5874 5767 0.982 Haplotype 1 minimap2 DEL 6078 5924 5841 0.986 Haplotype 2 lra DEL 6287 6061 5963 0.984 Haplotype 2 minimap2 DEL 6267 6107 6033 0.989 Haplotype 1 lra INS 9736 9354 9275 0.992 Haplotype 1 minimap2 INS 9882 9619 9587 0.997 Haplotype 2 lra INS 10032 9651 9568 0.991 Haplotype 2 minimap2 INS 10183 9890 9859 0.997 The two haplotype assemblies are considered separately to avoid complications of merging into a diploid callset. Calls are produced by lra and minimap2, with only deletion and insertion SV classes considered. The total calls are the original calls produced by each method, and ltered are calls excluding centromeres and 50kb of anking sequence. The supported SV have at least four reads supporting the call from either lra or minimap2 alignments. new set of matches with smaller anchors is calculated using the local minimizer indexes. As a result, alignment is of similar speed to state of the art algorithms, without the need for single- instruction multiple-data (SIMD) processing; runtime was slower when using an SIMD alignment library [95] possibly due to the overhead of invoking the library functions. The lra alignments have competitive runtime and memory usage compared to minimap2. Us- ing two dierent SV discovery algorithms, pbsv and Snies, we show it is possible to use lra alignments to discover SV using PacBio HiFi, CLR, and Oxford Nanopore reads, as well as di- rectly from aligned de novo assembly contigs. The performance for SV detection using PacBio reads and the pbsv algorithm is similar between lra and minimap2, with lra demonstrating a small gain in recall over larger SV events. Importantly, the availability of multiple alignment al- gorithms can help improve the results of studies that require high sensitivity and specicity, such as Mendelian analysis. The greater improvement in SV discovery metrics on ONT data aligned by lra additionally highlights the utility of using multiple algorithms to analyze sequencing data. 44 Finally, as de novo assembly of genomes becomes more routine, it is important to have accu- rate methods of SV detection from contig alignments. We used lra to align a haplotype-resolved genome assembled from PacBio HiFi reads and detect SV. When compared to the GIAB SV bench- mark data, the lra alignments show slightly lower precision and recall than read-based SV detec- tion. However, when the assembly-based SV are compared to SV detected in an orthogonal read dataset, nearly all (> 98.8%) variants discovered from assemblies are supported by read align- ments. This indicates few dierences between the HiFi based assembly and the reference are due to assembly error, and the annotated precision of the callset is likely an under estimate. A variant callset is eectively a list of operations that may be applied to the reference genome to recon- struct a sample genome. Because the HiFi assembly has few assembly errors on the same size scale as an SV, this supports the development of an alternative model for validating SV callsets in which the reconstructed genome is compared to the haplotype-resolved assembly, rather than by comparing callsets. This may be used to validate calls inside the high-condence regions dened where a haplotype-resolved assembly has been condently generated. 45 Chapter3 vamos: VNTRannotationusingecientmotifsets 3.1 Introduction Variable number-tandem repeats (VNTRs) are a class of repetitive DNA composed of short DNA sequences called motifs repeated many times in tandem. By convention, VNTRs are composed of motifs at least six bases; shorter motifs are classied as short tandem repeats (STRs). The repetitive nature of these sequences primes them for mutations through strand slippage during replication and unequal crossing over that increase or decrease motif copy number or introduce mutations of motifs [55]. Variation of VNTR sequences has been found to impact physiology and cellular function. Disease studies have found associations of VNTR length or composition with diabetes [99], schizophrenia [94], and Alzheimer’s [24]. Additionally, methods developed to analyze VNTR variation using high-throughput short read sequencing data found widespread association between VNTR length and gene expression [7, 61, 37]. Finally, variation directly in coding sequences are found to be linked with human traits including height and hair patterns [72, 10]. 46 The overall knowledge of genetic diversity in these sequences lags behind non-repetitive DNA, primarily due to diculties in genotyping VNTRs using short read data. The hypervariabil- ity of repeat counts and repeat composition among individuals makes it challenging to interpret VNTR variation from read alignments due to alignment degeneracy and reference bias. As a con- sequence, STR and VNTR sequences have been masked in many large scale short read studies using low complexity lters [5, 108]. Specic methods have been written to analyze repeat unit variation in STRs, however these methods achieve precise counts of repeat units only for loci shorter than the read length [34] or insert-size [23], or for specic repeat patterns [25]. Meth- ods developed to genotype VNTR variation using short-reads, only provide an estimate of motif count based on read depth inferred from genomic alignments [37], hidden Markov models [8, 7], and alignment to pangenomes [61]. Contrary to short-read sequencing, variation of VNTR loci is routinely resolved using long- read sequencing (LRS) [104, 89, 45, 16, 51] and assemblies [15, 27] based on the property that long read alignments or assemblies span across most VNTR loci. The initial long-read assembly studies found that insertions and deletions greater than 50 bases (structural variation) in humans are enriched in VNTR loci [79, 15]. More recently it was found that over 61% of structural vari- ants discovered in 32 haplotype-resolved assemblies produced by the Human Genome Structural Variation Consortium [27] are inside VNTRs. The loci that harbor these structural variants are highly polymorphic. An analysis of30,000 VNTRs in these assemblies (corresponding to loci studied in [27]) shows9 alleles per locus, when considering each distinct sequence as an al- lele. This analysis with recent long-read data in human agrees with the long-studied observation that VNTR mutation rate in prokaryotes is up to six orders of magnitude greater than single- nucleotide polymorphisms [102, 44]. 47 Variation is discovered using long-reads or their assemblies by mapping reads/assemblies to genomes using methods that allow for large insertions and deletions [56, 84], and recording vari- ants by their position along with the number of bases gained or lost directly from alignments [89, 45]. When discovering variation from unassembled reads, variant signatures from separate reads are combined into individual calls allowing for inexact matching of variants to account for alignment dierences driven by sequencing error [104, 89, 45, 16, 51]. Similarly, when combin- ing variation discovered from multiple individuals into population-scale databases, variants in dierent samples are merged into distinct alleles depending on similarity of variant calls [15, 27, 51]. In both instances, merging variant signatures in tandem repeat loci is challenging in repeti- tive DNA because gaps in the reads/assemblies may be placed in dierent locations, depending on the alignment algorithm and gap penalties used [15]. The variant merging heuristics across haplotypes can collapse dierent mutations, resulting in a reduced measure of diversity of these sequences. Finally, recording variants strictly as gains and losses of DNA masks polymorphisms in repeat compositions, thus does not reveal the full extent of sequence mutations in VNTRs. For example, variation in an intronic VNTR of CACNA1C is associated with schizophrenia by repeat composition, rather than length [94], and pangenome-analysis of VNTR composition cis- association expression of 174 genes with changes in VNTR composition[62]. A recent study of 3,622 Icelandic individuals sequenced with LRS used a combination of pre-ltering and clique- based clustering to provide a ner separation of variant alleles for VNTR sequences that associate with height, atrial brillation, and recombination [10]. As larger scale long-read sequencing stud- ies become available, a more eective genotyping method suitable for long reads may help shed 48 light on the role of VNTRs in human genetics, since LRS and their assemblies has the potential to resolve large and complex VNTR alleles. Rather than building a catalog of variation from dierences in alignments, a more accurate approach to describe VNTR variation is to annotate their motif composition. By comparing tan- dem repeat sequences according to repeat units, the full diversity of a population or study co- hort may be revealed as both dierences in repeat length and composition. One naive approach to creating a motif database for each VNTR locus is to compile a non-redundant set of repeat motifs identied by a repeat discovery tool such as Tandem Repeats Finder (TRF) [9], applied to a reference set of VNTR sequences. However, this method may generate an extensive list that includes many rare motifs, which may obscure the pattern of repetition. For example, the sequence ACGGTACGGTACCGTACGT may be decomposed into [ACGGT, ACGGT, ACCGT, ACGT], however (ACGGT) 4 (4 repetitions) is more concise and has a low divergence from the original sequence. Furthermore, annotation by TRF across multiple samples may have inconsistent peri- odicity and starting frames. Thus we dene an ecient motif set as a subset of original motifs consistently dened across a reference panel, in which rare motifs are replaced by more common ones while maintaining a bounded total replacement cost. To nd such ecient motif sets, we have developed a toolkit, VNTR Annotation Using E- cient Motifs Set (vamos) that nds ecient motif sets using a reference panel of diversity genomes. We have integrated the StringDecomposer algorithm [26] into vamos to annotate new genomes sequenced from aligned LRS reads or their assemblies using ecient motif sets. We generated an ecient motif set for VNTR loci from 148 haplotype-resolved assemblies sequenced with LRS by the Human Genome Structural Variant Consortium (HGSVC) [27] and the Human Pangenome 49 Reference Consortium (HPRC) [59] and under three levels of divergence, as well as the origi- nal motifs. As a proof of concept, we used vamos to create a combined VNTR callset across the HGSVC and HPRC assemblies to quantify diversity of VNTR sequences, and compared this to the diversity measured by a separate approach that combines calls based on merging similar variants). We additionally evaluated the applicatoin of vamos to unassembled long read data. 3.2 Methods 3.2.1 Generatingareferencemotifset We used a total of 148 haplotype-resolved assemblies covering nonredundant samples from the HGSVC (N=58) and HPRC (N=90) projects to build a catalog of VNTR motifs. The motif sets were calculated using sequences orthologous to 692,882 VNTR loci dened by the simple repeat track in GRCh38 [47]. Since the GRCh38 VNTR annotations may be missing VNTR annotations due to misassembly and collapsed repeats, we also ran TRF on the 148 assemblies. This identied 5,294 additional loci where the lengths on the reference are less than half of the average lengths on assemblies. VNTR loci that are on alternative chromosomes, in centromere regions, longer than 10kb, or are annotated as more than 40% transposable element were removed and not studied, resulting in a total of 583,316 remaining loci that were considered for analysis. Given the location of a single VNTR in GRCh38, whole-genome alignments were used to de- termine the orthologous boundaries across assemblies. The VNTR sequence in the reference and all orthologous VNTR sequences from assemblies are collectively referred to as a single VNTR lo- cus. Each locus was processed rst by individual assembly and then collectively across assemblies using TRF, motif ltering, and re-annotation in order to identify their initial motif composition in 50 Figure 3.1: vamosworkowofoneVNTRlocus. A.Processingandltering. TRF may re- port multiple potentially overlapping annotations or not annotate the entire VNTR sequence. So, for each assembly consensus motifs from all TRF annotations are collected to re-annotated the full VNTR sequence by StringDecomposer. Motifs resulted from re-annotation are combined across assemblies as the original motif database. B.Ecientmotifselection. An ecient motif set is selected as a subset of the original motif set under a parameterq representing the compression strength. C. New sequence annotation. New sequences are annotated by StringDecomposer algorithm into a string of ecient motifs that is closet to the raw nucleotide sequence by edit distance. 51 a manner that is unied across assemblies (Figure 3.1A). The VNTR sequences were rst extracted from each assembly and annotated in isolation using TRF, where each TRF annotation denes a repeat interval and a consensus motif over that interval. Due to changes in repeat pattern or other interruptions in repeat sequence, TRF may report multiple potentially overlapping annotations or not annotate the entire VNTR sequence. To account for this, the motif composition of each VNTR sequence was re-annotated using the StringDecomposer method with its TRF consensus motifs modied to account for cyclical rotations and ltered to remove homopolymers and re- dundant motifs that are approximate concatenates of smaller annotations (longer than 1.5 times the length and encompassing 80% a smaller consensus motif). As a result, each VNTR sequence was partitioned into a set of non-overlapping repeat motifs. A new consensus was generated over the re-annotated motifs by the tool abPOA [36]. Since motifs on the sequence boundaries are often partial, we further eliminated boundary motifs that are shorter than 2=3 or longer than 4=3 of the overall consensus. The full set of reference motifs for a locus were then formed as the non-redundant union of individual sequence motifs from all assemblies, followed by an additional cyclic rotation adjustment so that all motifs are in-frame with the consensus generated by abPOA. To avoid degenerate annotations, homopolymer and dinucleotide loci were excluded (N=165,308). Finally, we found that the ecient motif selection and StringDecomposer annotation did not have practical runtimes for loci that had many motifs, and to improve runtime loci that have more than 500 motifs were additionally excluded (N=506). 52 3.2.2 Ecientmotifselection The ecient motif selection process is applied independently to each VNTR locus. For a given VNTR locus, letV =fv 1 ;:::;v J g represent a collection of orthologous VNTR sequences from a reference set of assemblies. After processing and ltering in section 3.2.1, the set of all distinctly observed motifs in V is referred as original motif set, dened by =fm 1 ;m 2 ;:::;m p g, with count of each motif represented byo i . The ecient motif set is a subset (compressed) from the original motif set (Figure 3.1B). During compression, rare motifs are replaced with more common motifs if they are similar, as this suggests that the rare motifs may have mutated from the more common ones or arose from sequencing error. In contrast, rare motifs that are dissimilar to other motifs are retained, as it’s less likely that they arose from the mutations of other motifs. The cost of replacing a motifm j by another motifm i is dened as the edit distance betweenm i andm j , denoted by ij . We formulate the ecient motif selection problem as an optimization problem. Specically, we dene an ecient motif set e is a minimizer of the weighted sum of the ecient motif set size and the total motif replacement cost. We require the total replacement cost to be bounded by a parameter to control the compression. we also require if a motifm j is replaced with motifm i , all occurrences ofm j must be replaced. Furthermore, a motif can be only replaced by another motif with higher count, not the other way around. The parameter is specic to each locus, and represents the upper bound of total motif replacement cost. Loci with longer or more divergent motifs require larger for the ecient motif set to be likely smaller than the original motif set. is designed to control the number of motifs that can be removed from original motif set and set a reasonable removing cost per motif based on the motif divergence at each locus. LetM be the full list of all observed motifs in 53 V . A user-specied global parameterq2 [0; 1] is used to calculate a locus-specic by setting = (Q(q)jjMjjq). The valueQ(q), represents theq-quantile pairwise edit distance of all pairwise edit distances of motifs inM and reects the motif divergence at a locus, indicating the upper bound of the allowed removing cost per motif. The second termjjMjjq indicates the upper bound of number of motifs that can be removed. The value of grows with increasingq. In the limit, a single motif will be selected for e and the ecient motif set will lose the ability to represent variation in composition for a VNTR locus. Theorem 1 (Appendix) guarantees that if ecient motif set exists, the nucleotide sequences translated from vamos annotation string of ecient motifs, dier from the original sequences by at most . Additionally, Theorem 1 also indicates that a concatenation of counterpart ecient motif of each original motif in the sequence, referred asreplacementannotation, is a good approx- imation of the original raw sequence, with the divergence bounded by . However, it should be noted that theoretically, the vamos annotation is even closer to the original raw sequence. 3.2.2.1 Integerlinearprogrammingformulation Theorem 1 implies that it is possible to search for an ecient motif set by bounding on the cost of replacing motifs independent of their context inV . We prove that the ecient motif set selection problem is a NP-hard problem when the cost function div is a general function not limited to edit distance in theorem 2 (Appendix). Fortunately, it can be formulated as an integer linear programming (ILP) problem (equation 3.2), which can be eciently solved using the Google OR-tools (CP-SAT solver) [80] . 54 The problem may be formulated as a linear programming (LP) problem with an indicator func- tion as objective (equation 3.1), which is not yet in ILP form due to the presence of an indicator function in the objective. min x ij p X j=1 1( p X i=1 x ij 1) + p X i=1 o i p X j=1 x ij ij s.t. p X i=1 o i p X j=1 x ij ij < p X j=1 x ij = 1;81ip x ij (o i o j ) 0;81i;jp x ij 2f0; 1g;81ip (3.1) However, equation 3.1 can be transformed to an equivalent equation 3.2 by introducing addi- tional variablesy i = 1( P p i=1 x ij 1). L andQ in equation 3.2 are the lower bound and upper bound of 1 P p i=1 x ij (In this case,L can be 1p andQ can be 1). Equation 3.2 is clearly in ILP form. 55 min y j n X j=1 y j + p X i=1 o i p X j=1 x ij ij s.t. p X i=1 o i p X j=1 x ij ij < p X j=1 x ij = 1;81ip 1 p X i=1 x ij Q(1y j );81jp 1 p X i=1 x ij (L 1)y j + 1;81jp x ij (o i o j ) 0;81i;jp x ij 2f0; 1g;81ip y j 2f0; 1g;81jp (3.2) 3.2.3 Annotatingmotifcompositiononsequencedata The vamos software adapts the StringDecomposer algorithm to annotate the motif composition of tandem repeat sequences for the use cases of alignments of haplotype-resolved assemblies and long (single-molecule sequencing) reads to a reference genome. As input vamos requires se- quence alignments and a le with coordinates of reference VNTRs and a motif list for each VNTR. The tandem repeat sequences to annotate are extracted from the input alignments. When anno- tating assemblies, sequences are annotated directly as read from the alignments. When annotat- ing read alignments, all reads covering a VNTR locus are rst collected. Reads are partitioned by haplotype using phase tags if they have been phased using WhatsHap [78] or HapCut2 [28]), or by a max-cut heuristic that initializes partitions with the two reads with the most disagreeing heterozygous SNVs and assigns remaining reads to a partition based on shared SNVs. The VNTR 56 sequences from each of the reads are extracted based on alignments, and the StringDecomposer annotation is executed on the abPOA consensus [36] of the VNTR sequences in each partition. The output of a run is in Variant Call File (VCF) format [22] with one variant entry per reference VNTR. The ecient motif list for a VNTR is stored in the INFO eld along with the run-length encodings of each observed motif composition. The values of the genotype index the correspond- ing run-length encoding for each haplotype. In this manner, a combined-sample VCF maintains a record of distinctly observed alleles. 3.3 Results 3.3.1 Ecientmotifdiscoveryforareferenceassemblypanel Ecient motif set discovery and allele annotation were ran on the 48 HPRC and HGSVC assem- blies. After preprocessing, an average of 360; 864 6072 annotated VNTR loci were retained per haplotype. The union of annotations of all 148 assemblies yielded a nal set of 390,115 loci, of which only 15,514 were trivial repeats with a single motif. On average, there were 8:97 26:57 motifs per locus. The ltered motif sets were used as input for ecient motif discovery. To examine the motif diversity at varying levels of compression and population sampling, we generated ecient motif sets for an increasing number of reference assemblies, withq values of 0.1,0.2, and 0.3. Examples of VNTR decompositions using all observed motifs compared to ecient set of motifs annotations are shown in Figure 3.2; theACAN exonic VNTR found to be associated with height [10, 72], and an intronic VNTR in WDR7 found to be a modier of amyotrophic lateral sclerosis [21]. 57 To assess the global diversity of VNTR motifs, we used genome-wide total motif count to measure the change in ecient motifs as more assemblies were added to the discovery panel from one to 148 assemblies. As expected, the number of the original motifs (q = 0) increased as genomes were added (Figure 3.3), reecting the inclusion of rare motifs with each additional assembly. The number of ecient motifs also increased as more assemblies are included, with total diversity showing asymptotic leveling at 148 genomes. Specically, after inclusion of 30 as- semblies, 99.07% ecient motifs were selected compared to the entire set when 148 assemblies were incorporated with aq value of 0.1 (Figure 3.3). On average, the ecient motif set size per VNTR locus ranges from 4.36 at q = 0:1 to 2.6 atq = 0:3 when 148 assemblies are included, while the original motif set size averages 8.97 per VNTR locus (Table 3.1). The mean compression ratio of motifs (ecient / original motif size) ranges from 0.78-0.63 (q = 0:1 0:3) (Table 3.1). One possible approach to compile an ecient motif set is a greedy approach, which selects the most frequently occurring motifs at each locus. To compare the performance of ecient motif set selected by vamos algorithm and the greedy method, we generated ecient motifs withq value of 0.1 by vamos and motif sets of the same size using the greedy method. Both motif sets were used to annotate a total of 136,748 VNTR loci for 148 HGSVC and HPRC haplotype-assemblies, excluding homogeneous loci with less than ve original motifs. To evaluate the quality of the annotations, we computed the edit distance between the nucleotide sequences translated from the annotated string of motifs and the assembled raw sequences. We dened an annotation to be superior to another if its edit distance to the true VNTR sequence is less than 80% of the corresponding edit distance of the other sequence. Under this metric vamos outperforms the greedy approach on 6:3 0:1% (8097), which is roughly 1.3 times the number of loci where the greedy approach is 58 superior (Table 3.2). One advantage of using ecient motifs over the greedy approach is that it takes into account the cost of replacing each original motif, resulting in not only highly frequent motifs being reserved but also less frequent ones with long length and dissimilarity from other motifs. For instance, suppose a locus has three motifsAAAG,AAAC, andTGTGACCTGCAC with counts 10, 3 and 1. The greedy approach picks top two frequent motifs AAAG and AAAC, whereas vamos selectsAAAG,TGTGACCTGCAC. By including the rare motif TGTGACCTGCAC, vamos ecient motifs allow for representation of more diverse sequences that include this motif. Another advantage of vamos is the automatic decision of the number of ecient motifs to select at each locus, which benets from the ILP formulation and the association with a total replacement cost upper bound that can adapt to the complexity of original motif set for each locus. Table 3.1: The mean ecient motif set size and compression ratio. q 0.1 0.2 0.3 Mean ecient motif set size 4.36 3.26 2.60 Mean compression ratio 0.78 0.69 0.63 The mean ecient motif set size and compression ratio under three levels ofq (0.1,0.2,0.3), when 148 haploid assemblies are incorporated. Table 3.2: Comparison between vamos and greedy method. Average number of loci per assembly vamos< greedy 15,170 vamos< 80%greedy 8,097 greedy< vamos 13,479 greedy< 80% vamos 6,308 The quality of an annotation was measured calculating the edit distance between the nucleotide sequences translated from the annotated string of motifs and the original assembled sequences. An annotation is signicantly better if the corresponding edit distance is less than 80% of that of the other. 59 3.3.2 Allelicdiversityin64haplotype-resolved de novoassemblies Because previous studies using long-read sequencing and assembly to quantify human diversity using methods that merged separate distinct alleles into the same variant [27], we sought to use vamos to quantify human diversity in VNTR sequences with motif-resolution in the HGSVC and HPRC haplotype-resolved assemblies. Each assembly was annotated independently using the original motif set and ecient motif sets dened byq = 0:1. The average number of alleles per locus under the original motif set was 4.9 (7.8 excluding constant loci), and under the ecient motif set was 4.1 (6.4 excluding constant loci) (Table 3.3, Figure 3.4). Combining the HGSVC and HPRC data set resulted in about 26-33% more alleles than each individual set for q=0, and 21-33% more alleles with q=0.1, as evidence of the high variability of VNTRs. The number of dierent alleles correlates with the average motif count using both the original (r 2 = 0:21;p< 2:210 16 ) and ecient (r 2 = 0:21;p< 2:2 10 16 ) motif sets. In contrast, there were 3.3 alleles per locus when comparing by exact length, and 5.4 alleles per locus when comparing by exact sequence. To measure how dierent alleles are, we also grouped together annotations if their edit dis- tance with respect to the motif composition was up to 0-3 motifs (Table 3.3). When considering annotations using the original motif set, there was a 47% reduction in the average number of al- leles per locus when grouping alleles that dier by up to two motifs by an edit distance allowing for insertion and deletion in addition to motif mismatch. Similarly, there was a 44% reduction in the average number of alleles in the ecient motif set under a similar grouping by edit distance. Because the annotation by ecient motif sets already have a reduced diversity due to motif com- pression, the similar relative reduction by edit distance in both annotations indicates relatively 60 few motif mismatches and that copy number variation instead accounts for the majority of allelic variation, as expected by the slippage mutational mechanism of tandem repeat sequences. Table 3.3: Average number of alleles per locus when all 58 HGSVC and 90 HPRC haplotypes were annotated using the original and ecient set of motifs. Edit distance Motif set 0 1 2 3 Original set (q = 0) 4.9 (7.8) 3.4 (5.2) 2.6 (3.8) 2.2 (3.1) Ecient set (q = 0:1) 4.1 (6.4) 2.9 (4.3) 2.3 (3.3) 2.0 (2.7) Annotations were grouped into one allele if their edit distance with respect to the motif composition is 0-3 motifs. Statistics calculated by excluding constant loci that have only one allele measured by 0 edit distance under the original set are shown in brackets. 3.3.3 Annotatingdivergedpopulations We used simulation analysis to evaluate annotations of new genomes not included the reference assemblies using simulated VNTR sequences so that the ground truth motifs are known. VNTR sequences were simulated based on 50,000 randomly selected loci described in Section (3.2.1) and the corresponding replacement annotations described in Section (3.2.2) were compared with the vamos annotations using StringDecomposer. Motif sets with a single motif were excluded from simulation since variations introduced by simulation may largely be concealed by the motif homogeneity in such cases. From each of the 50,000 selected motif sets, VNTR sequences were generated by sampling and concatenating 50-100 motifs according to their underlying frequencies in the reference assemblies. Population diversity where sequences have motifs not reected in the original or ecient mo- tif sets was simulated by adding random mutations into the simulated VNTR sequences. Speci- cally, four settings for mutation rate: a 1% and 2% mutation rate sampled uniformly from single 61 base substitution, deletion, or duplication combined with an optional 1% rate of insertions 2-4 bases (Figure 3.5). The annotations were assessed using simulated assemblies (error-free), as well as 30-fold cov- erage of reads from HiFi using pbsim [75] with average 98.1% accuracy and ONT using alchemy2 (distributed with lra [84] and averaging 88.8% accuracy). Similar to the analysis of aligned reads, the accuracy of the annotation was impaired by sequencing error (Figure 3.5). The addition of population diversity through sequence divergence further decreased the annotation accuracy. Overall, vamos exhibited robust performance on simulated assemblies and high quality sequenc- ing reads ( 98% accuracy). The baseline error rate for annotating sequences without divergence was 6.1% for PacBio HiFi, and 15.6% for ONT. When there was sequence divergence through muta- tion and indels, the error rate only rose modestly, from 6.6-10.0% for PacBio HiFi, and 15.7-18.9% for ONT, indicating that relatively large dierences in sequence variation may still have simi- lar motif annotations in the ecient motif set. As technologies improve the relative dierences between approaches are likely to decrease. The instances where there was a high divergence between the simulated motifs and the motif decomposition were typically due to long motifs in the original motif set being annotated by multiple shorter motifs in the ecient motif set. 3.3.4 Analysisofalignedlong-reads Because studies may elect to perform low-coverage sequencing to increase sample size, we ana- lyzed long-read data of NA24385 sampled at 10-30X for both HiFi and ONT platforms to inves- tigate the accuracy of VNTR annotation on real mapped reads. Since analysis of assembly data gave the best overall performance by simulation, we compared annotations from reads data to 62 those from the HPRC NA24385 assembly (Figure 3.6). Regions where reads could not be phased, but still covered a VNTR locus were annotated with homozygous calls. The higher sequenc- ing coverage leads to better phasing and construction of more accurate consensus sequences for improved overall performance at higher sequencing depth. Increasing sequencing depth from 10X to 20X greatly decreased the number of uncovered loci for both platforms, yet such benet was not as signicant when sequencing depth was further increased from 20X to 30X. Under the sequencing depth of 30X, the percentage of VNTR loci with over 80% of the covered reads suc- cessfully phased was 78% and 71% for the HiFi and ONT data, respectively. However, although both platforms had over 97% of the VNTR loci covered at 10X or above, this number dropped to 91% for the ONT data but 83% for the HiFi data after phasing, showing the advantage of longer reads for phasing. The HiFi data generally had more loci not covered than the ONT data, though for annotated loci the HiFi data produced slightly better agreement to the assembly annotations. We examined the cases showing small divergence (i.e., edit distance 2) between the reads and the assembly results and found such loci enriched at regions of low sequence complexities (e.g., AT or GC rich regions), where motifs are highly similar and thus chances of motif misplacement increases. 3.3.5 Comparisonwithinterval-basedmergingofpopulationcalls We compared the vamos annotations on the 148 assemblies to a method made to merge variant calls across populations, Jasmine [51]. To generate the variant calls for each assembly, we used the dipcall [57] and applied a lter to exclude variants under 20bp. The variant calls were then merged across samples using the jasmine tool with the –dup_to_ins option. Although we 63 excluded smaller variants from analysis, the jasmine method was written to use input from a companion program iris, which uses variant calls produced by Snies [89] and the default setting of Snies is to detect variants at least 35 bases making our analysis an overestimate of default behavior. This resulted in a total of 410,418 variants, out of which 153,880 were located in 65,584 VNTRs. The variants in the Jasmine callset were fully phased, so the distinct alleles could be enumer- ated based on the pattern of inherited variants each assembly had that overlap with VNTR loci. On average each locus had 5.5 distinct alleles in the Jasmine combined callset when enumerating inherited variants, and 4.0 distinct alleles per locus when aggregating the total gain or loss of variants on each locus. In contrast for vamos, we observed an average of 16.7 alleles per locus when considering distinct annotations, and 7.6 alleles per locus of when only considering the length of each annotation (Table 3.4). When using the ecient motifs, there was a 15-29% de- crease in the number of distinct alleles annotated by vamos, and 1.3-2.6% decrease in the number of length alleles observed (Table 3.4). On 96% loci, we observed more alleles from the vamos an- notation (Figure 3.7). This indicates the issue of over-merging in interval-based method, which relies on heuristics to merge variants based on variant size and breakpoint distance, and can lead to omission allelic variation at VNTR loci. 3.4 Discussion We developed vamos as a tool to automatically curate VNTR motif sets from long-read population assemblies and annotate new genomes using these sets. We show that compared to interval-based merging of variant calls, vamos annotations capture a greater amount of diversity of VNTRs. 64 Table 3.4: The average number of alleles per locus obtained from a combined variant callset based on merging of variants by Jasmine, and vamos motif annotations. variant>=20bp variant>=30bp Total number of variants 410,418 288,172 Total number of variants 153,880 117,431 falling into VNTRs Total number of VNTRs 65,584 46,597 intersected with variants # alleles by # alleles by # alleles by # alleles by variant total variant length variant total variant length interval-based method 5.5 4.0 5.8 4.1 (Jasmine) # alleles by # alleles by # alleles by # alleles by vamos motif string length motif string length q=0 16.7 7.6 19.9 8.5 q=0.1 14.1 7.5 16.7 8.3 q=0.2 13.0 7.4 15.3 8.3 q=0.3 11.8 7.4 13.8 8.2 For the Jasmine calls, the number of alleles was determined using two dierent denitions. The rst denition was based on the presence of known variants in the callset, referred to as “allele by variant" in the table. The second denition was based on the aggregated length of inherited variants, referred to as “allele by total variant length". In contrast, for the vamos method, the table evaluated two denitions of an allele. The rst denition was based on the number of motifs in the annotated string, denoted as “allele by length" in the table. The second denition was based on the annotated string of motifs from vamos, denoted as “allele by motif string". The analysis was conducted using the subset of VNTR loci for which Jasmine records a variants (N=65,584 for variants 20bp and N=46,597 for variants 30bp). 65 The method is shown to be robust for annotating on unassembled low-coverage long read data, making it feasible to study large cohorts without high-coverage data anddenovo assembly. New long-read sequencing and assembly eorts are generating large databases of structural variation enriched at variable number of tandem repeats. These variants tend to be multi-allelic, however current catalogs of variation merge multiple dierent variant calls into a single representative call that limits the ability to associate dierent alleles with traits. One solution to this is to consider each unique VNTR sequence as an allele. Under this estimate, there are an average of 5.4 alleles per locus in VNTRs among the set of loci examined in this study. Here, we show there is a 24% reduction in annotated allele diversity when small-scale variation is abstracted by the ecient motif set annotation. This dierence is subtle, however as the number of genome assemblies increases, the benet of an abstract encoding of VNTR alleles will also grow, for example in the application of long-read sequencing to association analysis. The set of regions that vamos calls depends on the initial preprocessing of assemblies. The number of loci annotated in each genome depends on the overall quality of the assembly and the contiguity of the whole-genome alignments between the assembly and the reference. While we were able to partially address missing VNTR loci by including regions masked by TRF in the HGSVC and HPRC assemblies, but missing from GRCh38, future annotations based on the CHM13 T2T assembly [74] will likely increase the number of loci included in the ecient motif database. Additionally, higher quality assemblies will expand the regions where an annotation database may be constructed, such as VNTRs in segmental duplications. The appropriate parameters for selecting an ecient motif set will be specic to the context which the analysis is performed. Generally, for the reference assemblies that were considered in this study, values ofq 0:1 preserved common motifs and excluded rare ones. Depending 66 on the aims of the study, a higher value ofq will enable characterization of domain structure or other higher-order patterns in repetitive DNA, and low, or complete motif sets can be used to study patterns of rare variation and mutation rates. The use of a standardized motif database (original or ecient) to annotate VNTRs will obscure rare motif variants or repeat interruptions (such as those that may arise in pathogenic disease loci). Future versions of the software may annotate dierences between samples and the closest matching motif. Furthermore, it is possible to use custom motif databases may be necessary to use when there is a standard representation of motif variation among experts of a particular locus that is not represented by the original TRF annotations. Finally, there is no distinction between removing low frequency motifs that arose from error versus true biological mutations. In this instance, annotations on the original motif set could be retained to be validated by orthogonal sequencing data. The runtime of ILP solver is highly variable across loci. It takes a few minutes to run for loci with up to hundreds of motifs, but for loci with thousands of motifs, it can take hours. However, each locus may be calculated in parallel, and it only needs to run once on a set of reference genomes. The vamos software is distributed with ecient motif sets computed forq = 0:1, 0:2 and 0:3, along with the original motif set for the 148 haplotype-assemblies, and may be updated as additional human genomes are assembled. 67 Figure 3.2:Visualizationofallelestructurechangewithfourlevelsofq(q = 0; 0:1; 0:2; 0:3). q = 0 means no compression on the original motif database. With q growing larger, only truly representative motifs are selected and rare/contamination motifs possibly resulting from sequence error disappear and the allele’s domain structure starts to grow more clear. The ACAN and WDR7 VNTR sequences from 148 HGSVC and HPRC assemblies were annotated into a string of ecient motifs selected at each level of q. Repeat motifs is color-coded and each in- dividual annotation is plotted as a series of colors (row). Repeat motif unit is shown on x-axis. A. ACAN (chr15:88855422-88857301 on GRCh38) repeat alleles ranked by decreased length. B. WDR7 (chr18:57024494-57024955 on GRCh38) repeat alleles ranked by decreased length. 68 Figure 3.3: Diversityofecientmotifsets. Diversity of ecient motif sets under four levels of compression (q = 0; 0:1; 0:2; 0:3) as more assemblies are incorporated, measured by genome- wide number of ecient motifs. The curve ofq = 0 reects the number of original motifs. As expected, the number of the original motifs increases as genomes are added, due to the inclusion of rare motifs. The ecient motif set size increases as more assemblies are included, with total diversity in this category showing asymptotic leveling at 30 genomes. Asq grows larger, more compression is posed on the original motifs, resulting in less ecient motifs. 69 Figure 3.4: Distribution of number of VNTR alleles for 58 HGSVC and 90 HPRC haplo- types. VNTR sequences were annotated by vamos contig using both ecient motifs (q = 0:1) and original motifs (q = 0). Alleles from the call set were counted by the annotation strings. Figure 3.5: Error of vamos annotation in simulated data for assemblies, HiFi, and ONT readsundervedivergencesettings. The ground truth annotation is dened by replacing the original simulated motifs by their counterpart ecient motifs forq = 0:1 The error is calculated by dividing the edit distance on motifs between the vamos annotation and ground truth to the number of motifs repeated in the ground truth. The divergence and gap percentages refer to the number of base mutation and indel spaces added per hundred bases of the locus. 70 Figure 3.6:AnalysisofrawsequencingreadsofNA24385. Raw sequencing reads of NA24385 were phased by HapCut2 and annotated by vamos read. Edit distance was calculated by aligning annotation strings to results from the HPRC NA24385 assembly for each VNTR locus. 71 Figure 3.7: Comparison of interval-based merging and vamos. Number of alleles compar- ison between interval-based merging of variants (Jasmine) and motif annotation (vamos). The number of alleles obtained by merging variants (x-axis, Jasmine), and by annotating motifs (y- axis, vamos). The Jasmine calls reect a combined callset on variants 20 bases, and the vamos annotation using an ecient motif set withq = 0:1. 72 Chapter4 Conclusionsandfuturework Human genomes are known to harbor a diverse array of genetic variations. Over the years, under- standing the association between genotypes and phenotypes is the key goal in genetic research. However, conventional computational genetics, such as linkage analysis [11] and GWAS [30, 70, 85], have been focused primarily on studying SNPs and small indels, while neglecting other types of genetic variants, including SVs and VNTRs. These missing components of genetic variation may contribute signicantly to complex traits and disorders, and thus require greater attention [29, 34, 103]. In Chapter 2, we present our eort of developing a long-read aligner called lra, which is designed to facilitate accurate sequence alignment around large SVs ( 50bp) using a speedup sparse dynamic programming algorithm with a concave gap function (CG-SDP) for optimal chain determination. We have simplied the implementation of the CG-SDP algorithm described in the original description [32] and extended to allow for inversions and translocations. In the original description of the algorithm, the processing of a starting point is blocked until all subproblems the point relies on are solved, and then the process is unblocked and processing resumes. We nd that this blocking and unblocking strategy is unnecessary, and the addition of data structures to keep 73 track of the state of computation of subproblems enables solving the problem with a standard model of computation. Our benchmark study demonstrates that lra oers promising results for accurate SVs detection using a variety of sequencing data types, including PacBio HiFi, CLR, and ONT reads, as well as de novo assembly contigs. We observe a signicant improvement in SV discovery, especially in large SVs, with an increase in recall of 0.6% on HiFi dataset and 3.2% and 2.2% gain in F1 score on ONT dataset anddenovo assembly, respectively. Overall, our results demonstrate the potential of lra as a valuable alternative to other aligners for accurate SV detection and highlight its potential applications in improving the accuracy of SV call sets in de novo assembly [106, 27]. In Chapter 3, we present our eort for genotyping VNTRs using long reads and assembly contigs. Our approach, named vamos, considers variations in both RU counts and repeat motif variation. vamos constructs original motif database using a reference set of assembly genomes, builds an ecient motif database from all observed motifs under a divergence parameter, and annotates VNTR sequences as a string of these ecient motifs. We constructed original and e- cient motif sets for 407,087 VNTR loci from HGSVC 64 haplotype-resolved assemblies under three divergent parameters . We evaluate vamos annotation on both simulated and real datasets, in- cluding long reads and haplotype-resolved assemblies that harbour a wide range of allele lengths. Our results demonstrate that vamos consistently generates accurate genotype results. Further- more, by quantifying the diversity of VNTR sequences in the HGSVC assemblies, we estimated an average of 3.1-4.0 alleles per locus, which contrasts with current studies that merge multi-allelic calls based on VNTR lengths that have 2.0-2.8 alleles per locus. We note that the appropriate parameters for selecting an ecient motif set will depend on the specic context of the analysis. Generally, for the reference assemblies considered in this study, 74 we found that values ofq 20% preserved common motifs while excluding rare ones. However, a higher value of q may be suitable for characterizing domain structure or other higher-order patterns in repetitive DNA, while lower value ofq can be used to study patterns of rare variation and mutation rates since it leads to more complete motif sets. These research eorts provide important tools for the functional characterization of SVs and VNTRs in complex traits and diseases, and hold great promise for a wide range of future appli- cations. We next discuss possible future directions of these existing works. 4.1 Futureworkforthelraproject One direction for the future is to investigate the application of the CG-SDP algorithm to sequence- to-graph aligners. These aligners maps reads to a pangenome graph that integrates multiple genomes from the same species, providing a more comprehensive representation of genetic di- versity than the traditional linear consensus reference. Reference bias can be a signicant prob- lem with the linear consensus reference due to under-representation of genetic diversity. By contrast, pangenome reference graphs oer the potential to represent genomic diversity in the human population in a single data structure. Several existing tools are available to map sequences to a pangenome reference graph, in- cluding minigraph [58], graph aligner [82]. Graph aligner adopts an ane gap penalty, while minigraph combines a thorough search in a small window with a concave gap penalty and an approximate search in a large window with a linear gap penalty. While this heuristic is fast and eective in most cases, it can miss the optimal chain in segmental duplication regions. To address this issue and pursue the exact optimal chain with a concave gap penalty, the CG-SDP algorithm 75 could be applied in graph alignment. Such an approach has the potential to improve the accuracy and sensitivity of sequence-to-graph alignment, providing a more comprehensive and accurate representation of genomic diversity. 4.2 Futureworkforthevamosproject Although we have shown the eectiveness of vamos for genotyping VNTRs, the lack of publicly available large cohort long-read datasets currently limits our ability to directly investigate the association of VNTRs with phenotypic traits or diseases using the genotypes generated by vamos. Therefore, our future research eorts will focus on leveraging large-scale long-read studies and case-control disease studies to explore the functional eects of VNTRs. One advantage of the vamos annotation is that it contains information about variations in both the number of repeat units and repeat composition of VNTR sequences. This feature allows us to investigate the associations of repeat counts, repeat motif variations, to phenotypic traits or disease status. The association test for repeat counts and phenotype is straightforward. Regarding repeat motif, we can dene the dosage of each motif as the number of times it appear in the annotation, and test the association of the motif dosage with the phenotype. This approach will provide valuable insights into the functional signicance of VNTRs and their potential roles in disease pathogenesis. 76 Chapter5 Supplementarymaterials 5.1 SupplementarymaterialsforChapter2 5.1.1 Softwareavailability lra is available on Biocondahttps://anaconda.org/bioconda/lra and GitHubhttps://github.com/ ChaissonLab/LRA. 5.1.2 Dataavailability • HG002 HiFi dataset can be found athttps://s3-us-west-2.amazonaws.com/human-pangenomics/ NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/PacBio_HiFi/20kb/m64011_190901_095311. Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_ NA24385_son_v1/PacBio_HiFi/20kb/m64011_190830_220126.Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_ NA24385_son_v1/PacBio_HiFi/25kb/m64011_190712_225711.Q20.fastq, 77 https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_ NA24385_son_v1/PacBio_HiFi/25kb/m64011_190726_220327.Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_ NA24385_son_v1/PacBio_HiFi/19kb/m64011_190714_120746.Q20.fastq, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_ NA24385_son_v1/PacBio_HiFi/19kb/m64011_190728_111204.Q20.fastq. • HG002 CLR dataset can be found athttps://s3-us-west-2.amazonaws.com/human-pangenomics/ NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/PacBio_CLR/WUSTL_SV-HG002-CLR/3_C01/ m64043_191012_102127.subreads.bam, https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_ NA24385_son_v1/PacBio_CLR/WUSTL_SV-HG002-CLR/1_A01/m64043_191010_174437.subreads.bam. • HG002 ONT dataset can be found athttps://s3-us-west-2.amazonaws.com/human-pangenomics/ NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/nanopore/HG002_ucsc_Jan_2019_Guppy_3. 0.fastq.gz. • HG002 hiasm dataset can be found athttps://zenodo.org/record/4393631/files/NA24385. HiFi.hifiasm-0.12.hap1.fa.gz, https://zenodo.org/record/4393631/files/NA24385.HiFi.hifiasm-0.12.hap2.fa.gz. 78 5.1.3 Algorithmdetails AlgorithmS1 Dening subproblems Require: X - the sorted set of starting and ending points from the set of anchors ; [s;e] - the column/row range;d2fcol;rowg; Ensure: SUB(d) - the desired set of column/row subproblems obtained by assigning points from column/rows to column/rowe, which is initialized as;; 1: procedureSub(d;s;e;X) 2: ife ==sthen 3: Construct subproblem (d;;;;;s;e;DATA) with DATA= (;;;;;;E I ;E P ;E V ) as fol- lowing: 4: Save the forward diagonals of all the starting points from column/rowe in arrayE I ; 5: Sort the forward diagonals in arrayE I in the increasing/decreasing order; 6: Initialize every entry inE V to 0; 7: Initialize every entry inE P to -1; 8: SUB(d) SUB(d)[f(d;;;;;s;e;DATA)g; 79 9: else 10: whilee>sdo 11: Construct subproblem (d;s;b(s +e)=2c;b(s +e)=2c + 1;e;DATA) with DATA= (D I ;D P ;D V ;E I ;E P ;E V ) as following: 12: Store the forward diagonals of all the ending points from column/row s to col- umn/rowb(s +e)=2c in arrayD I ; 13: Sort the forward diagonals in arrayD I in the increasing/decreasing order; 14: Initialize every entry inD V to 0; 15: Initialize every entry inD P to -1; 16: Save the forward diagonals of all the starting points from column/rowb(s + e)=2c + 1 to column/rowe in arrayE I ; 17: Sort the forward diagonals in arrayE I in the increasing order; 18: Initialize every entry inE V to 0; 19: Initialize every entry inE P to -1; 20: SUB(d) SUB(d)[f(d;s;b(s +e)=2c;b(s +e)=2c + 1;e;DATA)g; 21: Sub(d;s;b(s +e)=2c;X); 22: Sub(d;b(s +e)=2c + 1;e;X); 23: endwhile 24: endif 25: return SUB(d); 26: endprocedure 80 AlgorithmS2 Sparse Dynamic Programming with convex gap cost Get the points setX from the set of anchors ; SortX in Cartesian order; Assume all points are arranged from col 0 to colt; SUB ;; . SUB is the set storing all subproblems SUB SUB[Sub(col; 0;t;X); SortX in anti-Cartesian order; Assume all points are arranged from row 0 to rowq; SUB SUB[Sub(row; 0;q;X); for eachp i 2CartesianSort(X)do ifp i is an endpointthen Score(p i ) Score(p s ) +l i , wherep s is the corresponding startpoint andl i is match bonus of the corresponding anchor; End(p i ) Score(p i ); .End stores the optimal chaining score for each endpoint for each SUB[j]2SAdo ifScore(p i )>D V [j]then D P [j] i; D V [j] Score(p i ), wherej ='(D I ;f i ) andf i is the forward diagonal ofp i ; endif endfor 81 elseifp i is a starting pointthen maxvalue 0; for each SUB[h]2SB do j '(E I ;f i ), wheref i is the forward diagonal ofp i ; if SUB[h] is a col-based subproblemthen ifD I [E L ]>f i then E V [j] (E B ;j); else for eachk2 (E L ;C), whereC = min s D I [s]>f i do Update(D V [k];E B ); endfor E L C 1; E V [j] (E B ;j); endif elseif SUB[h] is a row-based subproblemthen ifD I [E L ]f i then E V [j] (E B ;j); else for eachk2 (E L ;C), whereC = min s D I [s]f i do Update(D V [k];E B ); endfor E L C 1; E V [j] (E B ;j); endif endif ifE V [j]>maxvaluethen Start[p i ] (SUB[h];j) endif endfor endif endfor 82 5.1.3.1 Adetailedexampleofvisualizingsubproblemsdivision A detailed example of visualizing subproblems division is given by Fig S1. The data structures for each subproblem:D I ,D V ,D P ,E I ,E V ,E P ,E L ,E B and the process of subproblems solving. The horizontal axis represents the query, while the vertical axis represents the target. Points are num- bered in Cartesian sorted order, which is the processing order. 12 points are assigned into three column subproblems (A c 0 ;B c 0 ), (A c 1 ;B c 1 ), (A c 2 ;B c 2 ) and one row subproblem (A r 0 ;B r 0 ), where start- ing points are assigned toA-part and endpoints are assigned toB-part. Leaf subproblems are not shown for simplicity. Start and End are used for the trace-back of the optimal chain.Start stores sub – the index of the subproblem which yields the optimal chaining score up to a starting point andind – the index off i in arrayE I , that is'(E I ;f i ), wheref i is the diagonal of the starting point.End stores the optimal value for each endpoint. For this toy example, gap cost of append- ing fragment j to fragment i is gap( i ; j ) = 0:25log (y e i x e i ) (y s j x s j ) + 1 + 1, where (x e i ;y e i ) is the endpoint of i and (x s j ;y s j ) is the startpoint of j . m, shows the regions that each subproblem covers and the initialized data structures for each subproblem. There are of three column subproblems and one row subproblems (leaf subproblems are not shown for simplicity): (A c 0 ;B c 0 ), (A c 1 ;B c 1 ), (A c 2 ;B c 2 ) and (A r 0 ;B r 0 ). a-l shows how the data structures of subproblems that are associated with the point being processed in each step are updated. Note that entries that are updated are highlighted by orange. a, shows forstartpoint1, it is a leaf subproblem that yields the value of the optimal chain up tostartpoint 1. b, shows when processingendpoint 2, the optimal value up to it isScore(startpoint 1) + 2, where 2 is the match bonus of the frag- ment. ArrayEnd is then updated. Sinceendpoint 2 is inA c 0 ,A c 1 ,A r 0 , the correspondingD V entries are updated to 2 and correspondingD P entries update to the index ofendpoint 2. c, 83 shows when processingstartpoint3, it is in theB-parts of subproblems (A c 1 ;B c 1 ) and (A r 0 ;B r 0 ). startpoint 3 is located inE I [2] of (A c 1 ;B c 1 ), soUpdate(D V [0];E B ) would be called to get the value of E V [2]. The purple color highlighting shows what forward diagonals in E V would be updated byD V [0].E P [2] would be updated to point toD I [0].startpoint3 is located inE I [2] of (A r 0 ;B r 0 ) and no forward diagonals inD I used to updateE V [2]. Therefore, inStart,sub andind forstartpoint3 are updated toB c 1 ; 2.d, shows when processingstartpoint4, it is inB c 1 and lo- cates inE I [0] of (A c 1 ;B c 1 ). Since there is no forward diagonal inD I can be used to updateE V [0], it is a leaf subproblem that yields the optimal chaining value up tostartpoint4 inStart. e, shows when processingstartpoint 5, it is inB c 1 andB r 0 . In (A c 1 ;B c 1 ), there is no forward diagonal can be used to updateE V [1]. In (A r 0 ;B r 0 ),Update(D V [0];E B ) is called to update the block structure E B , soE V [1] andE V [2] would be computed fromE B . InStart,sub andind forstartpoint 5 are updated toB r 0 ; 1. f,g,h,i,j,k,l: show the subproblems solving and data structures updating for the rest of points. After processing all 12 points, three optimal chains can be obtained by trac- ing back, which arechain 1 = [startpoint 1;endpoint 2;startpoint 3;endpoint 6], chain 2 = [startpoint 1;endpoint 2;startpoint 5;endpoint 9] and chain 3 = [startpoint 7;endpoint 10;startpoint 11;endpoint 12]. 84 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) Figure S1: A detailed example of visualizing subproblems division. 85 5.1.4 Impactoflocalminimizerindex We analyzed the use of a local minimizer index for alignments of 263,000 PacBio CLR reads from HG002 chromosome 20 and simulated HiFi, CLR, and ONT reads. Although banded alignment is used to rene alignments in each case, the alignments that skip the additional chaining step with local minimizer matches have an 11.5-fold increase in medium size (10-50 base) deletions and 3.8-fold increase in medium sized insertions compared to alignments that use the local minimizer index (Table S1). Although this eect could be minimized using a larger band when rening alignments, this would have a much higher computational burden. Table S1: The count of indels from PacBio CLR alignments. Using local index Matches Deletions (all) Deletions 10-50 bp Insertions (all) Insertions 10-50 bp Yes 5.58B 198M 114K 170M 972K No 5.47B 216M 1.31M 200M 3.67M To further investigate the impact that local minimizers have on mapping quality, we measured the accuracy of alignments of simulated reads with and without the local minimizer index (Fig S2). 5.1.5 Assemblyexclusivecalls Compared to read alignment, assembly alignment has the simplicity of detecting variants directly from assembly alignments rather than forming a consensus of SV signatures from multiple reads. To gauge the specicity of assembly-based calls with no raw reads alignments support, we used Truvaribench to nd large SV calls (>= 20000 bases) only in each assembly haplotype alignment but not in HiFi/CLR read pbsv-calls for lra and minimap2 respectively. We display the counts of 86 0 20 40 60 mapping quality 0.90 0.95 1.00 accuracy a overlap percentage: 10.0% 0 20 40 60 mapping quality 0.90 0.95 1.00 accuracy b overlap percentage: 40.0% 0 20 40 60 mapping quality 0.90 0.95 1.00 accuracy c overlap percentage: 70.0% Figure S2: Mapping accuracy of lra. lra without the step of rening local minimizers, minimap2 and ngmlr on simulated HiFi, CLR reads for lengths between 5-50kb and ONT reads for lengths between 1-50kb. Simulated reads were mapped to genome GRCh38. A read is considered as correctly mapping if the reported mapping interval has 10%; 40%; 70% overlap with the truth interval. paftools.js mapeval was used to evaluate the mapping accuracy. assembly exclusive large SV calls for both lra and minimap2 (Table S3) and the genome coordi- nates of all such calls (Table S4 and S2). We also show several IGV screenshot where the assembly alignments show clear insertions/deletions in regions with no read alignment pbsv-calls (Fig S3 and S4)). Table S2: minimap2 assembly exclusive large SV calls coordinates. We used Truvari bench to nd large SV calls (>= 20000 bases) only in each haplotype alignment but not in HiFi/CLR read pbsv-calls for minimap2. All the genome coordinates of such calls are given below. minimap2 hap1 hap2 11:3674982-3705654 8:145091981-145111902 X:154976-173756 - 8:145091981-145111902 - Table S3: Assembly exclusive large SV calls counts. We used Truvari bench to nd large SV calls (>= 20000 bases) only in each assembly haplotype alignment but not in HiFi/CLR read pbsv-calls for lra and minimap2 respectively. lra minimap2 hap1 53 3 hap2 60 1 87 Table S4: lra assembly exclusive large SV calls counts. We used Truvari bench to nd large SV calls (>= 20000 bases) only in each haplotype alignment but not in HiFi/CLR read pbsv-calls for lra. All the genome coordinates of such calls are given below. lra hap1 hap2 3:57387036-57435371 12:8558485-8590846 20:54123233-54164549 7:98387266-98418964 11:3292907-3338726 7:142486372-142506795 11:3675340-3706579 21:11097542-11160785 11:18963939-18986081 21:14369638-14407903 11:70801579-70840687 18:65121221-65141721 8:2250148-2266001 15:75549963-75574240 8:6868084-6887194 X:3794818-3833246 8:86780666-86808014 X:114959695-114989535 8:86795475-86819747 X:148906424-148953566 8:145091981-145111902 X:153482307-153520111 15:102292812-102314488 2:19195289-19213771 7:74962806-74990794 2:92269995-92294337 7:98387266-98418978 2:169727087-169744070 7:142486372-142506798 9:73334694-73352687 13:57714736-57734449 9:137041193-137067504 19:7515550-7532254 3:57387036-57435372 19:34883088-34900034 3:195232056-195249800 19:37793776-37839249 4:9254709-9278459 19:55345568-55361211 4:49282710-49300465 14:106094430-106114188 4:132678548-132698553 4:49122506-49143435 8:145091981-145111902 22:23852646-23876941 5:21481673-21502808 5:12685942-12706067 5:70068238-70088683 5:21481923-21503055 5:94550118-94566747 5:69511859-69527482 10:27605760-27653051 5:70067784-70108676 10:37467951-37491453 5:94550118-94566748 13:112957399-112975655 16:33293751-33330123 1:2606562-2622757 2:19206009-19224510 1:110224909-110243353 2:92269995-92294314 1:121363555-121385437 2:169727087-169744065 1:143236902-143265608 9:73316730-73334723 1:148012547-148037641 9:135950709-135971104 1:148277867-148319102 Y:6559132-6574907 1:148290499-148319089 Y:22280073-22329633 1:207702938-207721493 X:231384-270899 14:106089337-106108921 X:2649604-2699486 14:106800271-106824194 6:80084902-80112738 19:7019185-7037577 6:161033867-161067185 19:7061694-7080819 12:8574169-8594807 19:7515550-7533066 12:10575957-10591368 19:21755939-21803995 1:16892289-16912661 19:34883088-34900034 1:121397737-121416393 19:36793371-36836323 1:146421568-146451608 19:55345858-55362099 1:207702938-207721493 11:3268153-3313942 1:246982645-247000283 11:18963939-18986084 10:27605760-27653054 11:60973285-60992117 10:45837663-45853704 11:70801579-70840683 21:11097544-11113501 11:87688378-87734401 21:14369638-14407887 20:54123233-54164712 21:15258940-15288279 22:23852646-23876916 18:65121221-65141610 6:58142577-58165006 - 6:80084902-80112738 - 6:157732494-157765464 - 6:161064125-161108458 - 16:32297406-32325807 16:32525817-32551589 - 16:33237891-33254312 - 16:33248007-33293676 88 Figure S3: Example of insertion exclusively appear in lra but not minimap2 alignment. The top track shows the GIAB HG002 curated SV set. The next 2 tracks are lra alignment of HG002 hiasm assembly two haplotypes and bottom two tracks are minimap2 alignment. There is a clear insertion of 48336 bases in lra alignment, which also appears in the GIAB HG002 curated SV set. minimap2 alignment shows a inversion instead at the same location. This large insertion is missing in the HiFi and CLR read alignment. 89 Figure S4: Example of insertion exclusively appear in lra but not minimap2 alignment. The top track shows the GIAB HG002 curated SV set. The next 2 tracks are lra alignment of HG002 hiasm assembly two haplotypes and bottom two tracks are minimap2 alignment. There is a clear insertion of 48789 bases in lra alignment, which also appears in the GIAB HG002 curated SV set. minimap2 alignment splits the contig at the same location. This large deletion is missing in the HiFi and CLR read alignment. 5.1.6 Analysisofvariantcalls Table S5: Shared variant calls. Considering HiFi/CLR + pbsv callsets and ONT + Snies callsets, the size of the intersection of every two callsets are shown. The intersection is calculated using Truvari bench on two called vcfs. HiFi CLR ONT lra pbmm2 ngmlr lra pbmm2 ngmlr lra pbmm2 ngmlr lra 22159 20469 17776 22319 19868 18162 23170 15375 14092 pbmm2 20469 21430 17826 19869 21748 18344 15371 21338 14104 ngmlr 17777 17827 18779 18165 18344 19981 14091 14107 18598 90 Table S6: Calls unique to a callset in tandem repeats and segmental duplications. Entry with pbmm2 as row and lra as column means the number of unique calls in pbmm2 callset when comparing pbmm2 callset and lra callset. HiFi CLR ONT lra pbmm2 ngmlr lra pbmm2 ngmlr lra pbmm2 ngmlr lra 0 948 988 0 1811 1778 0 5899 4477 pbmm2 1677 0 938 2384 0 1610 7743 0 4469 ngmlr 4368 3588 0 4113 3376 0 9060 7203 0 In tandem repeats lra 0 344 414 0 799 809 0 2510 2315 pbmm2 483 0 391 899 0 748 2934 0 2228 ngmlr 1251 1090 0 1479 1322 0 3177 2683 0 In segmental duplications lra 0 385 155 0 509 302 0 1083 367 pbmm2 944 0 150 912 0 318 1960 0 403 ngmlr 1398 833 0 1291 903 0 2440 1600 0 91 Table S7: Comparison of the Truvari result between all combinations of aligners and SV callers on simulated HiFi, CLR and ONT dataset with simulated SVs: indels (insertions and deletions), inversions. All SVs were simulated by SUVIVOR. HiFi and CLR reads were simulated by PBSIM and ONT reads were simulated by alchemy2, which is distributed with lra source. We simulated 195 Indels (insertions and deletions) of lengths between 50-10000 bases, 97 inversions of lengths between 600-2000 bases. HiFi INDEL INV pbsv cuteSv snies . pbsv cuteSv snies aligner lra pbmm2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra pbmm2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr TP base 195 191 191 195 194 193 171 188 185 90 90 89 97 96 96 97 95 96 TP call 195 191 191 195 194 193 171 188 185 90 90 89 195 182 183 97 95 96 FP 0 4 0 0 2 6 6 1 1 0 0 2 1 5 3 3 2 4 FN 0 4 4 0 1 2 24 7 10 7 7 8 0 1 1 0 2 1 precision 1 0.979 1 1 0.990 0.970 0.966 0.995 0.995 1 1 0.978 0.995 0.973 0.984 0.970 0.979 0.960 recall 1 0.979 0.979 1 0.995 0.990 0.877 0.964 0.949 0.928 0.928 0.917 1 0.990 0.990 1 0.980 0.990 F1 score 1 0.979. 0.990 1 0.992 0.980 0.919 0.980 0.971 0.963 0.963 0.946 0.997 0.981 0.987 0.985 0.979 0.975 CLR INDEL INV pbsv cuteSV snies pbsv cuteSV snies aligner lra pbmm2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra pbmm2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr TP base 195 190 193 195 195 193 154 179 181 90 90 87 96 97 95 96 97 94 TP call 195 190 193 195 195 193 155 179 181 90 90 87 195 195 189 99 97 94 FP 0 0 3 0 3 2 3 8 5 0 0 3 2 0 3 74 2 2 FN 0 5 2 0 0 2 41 9 14 7 7 10 1 0 2 2 0 3 precision 1 1 0.985 1 0.985 0.990 0.981 0.957 0.973 1 1 0.967 0.990 1 0.994 0.572 0.980 0.979 recall 1 0.974 0.990 1 1 0.990 0.790 0.918 0.928 0.928 0.928 0.897 0.990 1 0.948 0.990 1 0.969 F1 score 1 0.987 0.987 1 0.992 0.990 0.875 0.937 0.950 0.963 0.963 0.930 0.990 1 0.971 0.725 0.990 0.974 ONT INDEL INV cuteSV snies cuteSV snies lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr TP base 188 191 189 140 143 155 97 97 96 80 93 95 TP call 188 191 189 140 143 155 193 193 192 83 93 95 FP 3 2 4 3 6 17 2 2 3 35 1 10 FN 7 4 6 55 52 40 0 0 1 17 4 2 precision 0.984 0.990 0.979 0.979 0.959 0.901 0.990 0.990 0.985 0.703 0.989 0.9047 recall 0.964 0.979 0.969 0.718 0.733 0.795 1 1 0.990 0.825 0.959 0.979 F1 score 0.974 0.985 0.974 0.828 0.831 0.845 0.995 0.995 0.987 0.759 0.974 0.940 92 Table S8: Comparison of the Truvari result between all combinations of aligners and SV callers on simulated HiFi, CLR and ONT dataset with complicated nested SVs: deletion-inversion-deletions (INVDEL), inverted-duplications (INVDUP), which were simulated by SUVIVOR. HiFi and CLR reads were simulated by PBSIM and ONT reads were simulated by alchemy2, which is distributed with lra source. We simulated 100 deletion-inversion-deletions and inverted-duplications of lengths between 600-1000 bases respectively. Inversion-deletion is a type of nested SV where an inversion is anked by 2 deletions and inversion-duplication means the duplicated sequence is inverted. For deletion-inversion-deletions, the inversion and two deletions have all been found in order to be counted as a TP. For cases where the inversion and only one anked deletion are found were counted as Partial. For inverted-duplications, both the inversion and duplica- tion need to be found in order to be counted as a TP. We found that minimap2 alignment nd inverted-duplications as insertions, therefore, we didn’t count that as TP. INVDEL INVDUP HiFi CLR ONT HiFi CLR ONT aligner lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr TP 98 93 93 1 94 74 37 1 2 94 0 98 99 0 97 100 58 98 Partial 2 5 6 79 6 21 62 97 97 - - - - - - - - - FN 0 2 1 20 0 5 1 2 1 6 100 2 1 100 3 0 42 2 93 Table S9: Comparison of the breakpoints on real datasets. The breakpoints accuracy analysis was conducted by comparing the boundaries of true positive SVs from Truvari result to the curated SVs’ breakpoints for each aligner/caller combination. Breakpoints accuracy is measured by the percentage of SVs with perfect breakpoint boundaries and the average shifting distance between the left-most coordinate of SV boundaries. pbsv snies HiFi CLR HiFi CLR ONT lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr SVs% with zero breakpoint shifting 64 64 65 59 58 60 38 68 6 33 55 7 31 57 6 Average breakpoint shifting distance(bp) 45.6 42.44 38.07 47.38 47.00 35.00 52.82 38.42 56.98 59.68 40.49 40.38 60.93 41.25 52.94 Total TP 9466 9408 8433 9438 9414 9076 8924 8700 8015 8969 8082 6533 9085 8887 8471 94 Table S10: Comparison of the breakpoints on simulated SVs: indels, inversions, deletion- inversion-deletions, inverted-duplications. The breakpoints accuracy analysis was conducted by comparing the boundaries of true positive SVs to the bondaries of ground truth SV for each aligner/caller combination. Breakpoints accuracy is measured by the percentage of SVs with per- fect breakpoint boundaries and the average shifting distance between the left-most coordinate of SV boundaries. INDEL HiFi CLR ONT aligner lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr total TP 171 188 185 154 179 181 140 143 155 SV% of zero shifting distance of breakpoint 84.2 81.9 51.9 72.1 72.1 42.0 75.0 74.8 48.4 average shing distance of breakpoint 0.25 0.25 0.78 0.37 0.37 1.10 1.72 0.90 0.93 INV HiFi CLR ONT aligner lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr total TP 97 95 96 95 97 94 95 93 95 SV% of zero shifting distance of breakpoint 5.2 6.3 6.3 8.4 33.0 26.6 27.4 20.4 27.4 average shing distance of breakpoint 4.40 1.19 1.15 7.97 0.80 0.84 3.02 0.97 0.92 INVDEL HiFi CLR ONT aligner lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr total TP 298 287 289 161 294 264 235 197 200 SV% of zero shifting distance of breakpoint 7.8 15.4 19.9 1.1 20.3 19.4 11.3 15.8 9.1 average shing distance of breakpoint 279.11 261.16 284.40 302.18 278.75 238.12 272.72 238.48 285.09 INVDUP HiFi CLR ONT aligner lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr total TP 189 100 198 176 100 195 200 155 198 SV% of zero shifting distance of breakpoint 12.7 78.0 27.8 33.5 79.0 30.3 19.0 57.4 26.3 average shing distance of breakpoint 4.38 0.27 1.37 3.04 0.32 0.82 1.61 0.48 0.90 95 Table S11: Truvari classication of cuteSV variant calls. Truvari comparisons between lra, min- imap2 and ngmlr using the Genome in a Bottle benchmark SV set. Optimal results in each cat- egory are shown in bold. TP-base means true positive calls in the benchmark SV curation set, while TP-call means true positive calls in the SV set from each aligner. False positive means the number of non-matching calls from the SV set from each aligner. False negative means the number of non-matching calls from the SV curation set. cuteSV HiFi CLR ONT lra minimap2 ngmlr lra minimap2 ngmlr lra minimap2 ngmlr TP base 9402 9370 9190 4103 4139 8980 9303 9314 9256 TP call 9386 9360 9182 4100 4158 8980 9293 9344 9249 FP 456 562 1448 243 217 1101 774 1003 1282 FN 239 271 451 5538 5502 661 338 327 385 precision 0.954 0.943 0.863 0.944 0.95 0.891 0.923 0.903 0.878 recall 0.975 0.972 0.953 0.426 0.4293 0.931 0.965 0.966 0.96 F1 score 0.964 0.957 0.906 0.587 0.5914 0.91 0.944 0.934 0.917 96 5.2 SupplementarymaterialsforChapter3 5.2.1 Softwareavailability vamos is available on GitHub github.com/chaissonlab/vamos. 5.2.2 Dataavailability • 64 haplotype-resolved assemblies produced by the Human Genome Structural Variation Consortium [27] can be downloaded fromhttp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_ collections/HGSVC2/release/v1.0/assemblies/. • The haplotpye-resolved assemblies from the Human Pangenome Reference Consortium may be downloaded from https://github.com/human-pangenomics/HPP_Year1_Assemblies/ blob/main/assembly_index/Year1_assemblies_v2_genbank.index • The NA24385 data is available from NCBI project PRJNA586863. • Motif set les and combined vcf of VNTR annotations are available through zenodo at zenodo.org/record/7884547. 5.2.3 Theoremofecientmotifsetselectionproblem Theorem1. If an ecient motif set e exists, which minimizes the sum ofjj e jj and total motifsreplacementcost P p i=1 o i P p j=1 x ij ij andsatisesthethreerequirements,thenthereexists e v j 2 e such that P j div(v j ;e v j ) P p i=1 o i P p j=1 x ij ij ;v j 2V. 97 Proof. Assume an ecient motif set e exists. Each VNTR sequencev j can be represented as a sequence of original motifsm 1 j m l j , eachm i j . By substituting eachm i j with its counterpart ecient motif f m i j , we can get e v j = f m 1 j f m l j . For each v j and e v j , it is clear thatdiv(v j ;e v j ) P i div(m i j ; f m i j ), otherwise the combination of edit operations from individ- ual motif alignments would infer a more parsimonious edit distance for v j and e v j . Therefore, P j=1 div(v j ;e v j ) P i div(m i j ; f m i j ) = P p i=1 o i P p j=1 x ij ij < . Theorem2. The ecient motif set selection problem is NP-hard. Proof. To prove the ecient motif set selection problem is NP-hard, we can prove a special case of the problem is NP-hard. When is set to1, the minimization objective turns into simply minimizing the size of the ecient motif set. When all motif countso i are the same, the second requirement in 3.2.2 will always be satised. The decision version of simplied ecient motif set selection problem can be formulated in the following way: given a k, can we nd a subset e withjj e jj k, such that the total replacing cost P p i=1 o i x ij ij < . The decision version of the set cover problem can be formulated as the following: given ak, a set of elementsfe 1 ;e 2 ;:::;e p g and a collection ofp subsetsfS 1 ;S 2 ;:::;S p g of elements, whose union equals the universe, can we nd a sub-collection ofp sets whose union equals the universe and the size of the sub-collection is less than or equal tok. To prove the optimization version of the problem is NP-hard, we can prove the decision ver- sion of simplied ecient motif selection problem is NP-complete. To prove the NP-completeness of the decision version, rst, we show the problem belongs to NP . 98 Suppose we have a solution of variablesfx ij g i;j=1;:::;p , we can check ifjj e jjk and the total cost P p i=1 o i P p j=1 ij x ij in polynomial time. Second, we prove the problem is NP-complete by reducing the decision version of the special set cover problem with equal size of elements and sets to it. Let a set cover problem instance be a set ofp elementsfe 1 ;e 2 ;:::;e p g and a collection ofp sets of elementsfS 1 ;S 2 ;:::;S p g whose union equals the universe of elements. For any set cover problem instance, we can create an instance of ecient motifs selection problem: dene a set of p motifs asfm 1 ;m 2 ;:::;m p g and associated counts o i = 1 for8i = 1;:::;p. Set = 0:5 and ij = 1(e i 62S j ). Supposefx ij g i;j=1;:::;p is a solution of the constructed instance of the ecient motifs selection problem. We can prove a sub-collection of sets which is dened asfS t g t2Sub , Sub = fj 2 f1;:::;pgj1( P p i=1 x ij 1)g is a valid solution of the set cover instance. • Sincejj e jj = P p j=1 1( P p i=1 x ij 1)k, thusjjSubjjk. • Since each motifm i must have a replacement, there exists somej such thatx ij = 1. Since P p i=1 P p j=1 ij x ij = 0:5 and ij = 1(e i 62S j ), thenx ij = 1 =) ij = 0 =)e i 2S j . Additionally by the denition ofSub, suchS j 2fS t g t2Sub . Therefore, for eache i , there existsS j 2fS t g t2Sub such thate i 2S j . Next, suppose a sub-collectionfS t g t2Sub ;Subf1;:::;pg is a solution of the set cover prob- lem instance. We denefx ij g i;j=1;:::;p (variables indicating if motifm i is replaced bym j ) as fol- lows: • Ifj = 2Sub,x ij = 0. 99 • Ifj2Sub ande i = 2S j ,x ij = 0. • Ifj2Sub,e i 2S j , andS j is the only set infS t g t2Sub that coverse i , thenx ij = 1. • Ifj2 Sub,e i 2 S j , and there are more than one set infS t g t2Sub that coverse i , suppose S j min has the minimum index among these sets, set x ij min = 1; for any other S j T , set x ij T = 0. We next prove the denedfx ij g i;j=1;:::;p is a valid solution of the constructed instance of the ecient motif selection problem. •jj e jj = P p j=1 1( P p i=1 x ij 1)jjSubjjk. • For eache i , there must exist somej such thate i 2 S j . So there must exist somej T (not necessarilyj) such thatx ij T = 1. And according to the denition, there is only onej T such thatx ij T = 1. • For everyi;j, sincex ij = 1 =)e i 2S j =) ij = 0, P p i=1 P p j=1 ij x ij = 0 = 0:5. To sum up, we prove the decision version of simplied ecient motif selection problem is NP-complete, thus the optimization version of the problem is NP-hard. Therefore, ecient motif selection problem is NP-hard. 100 Bibliography [1] Can Alkan, Bradley P Coe, and Evan E Eichler. “Genome structural variation discovery and genotyping”. In: Nature reviews genetics 12.5 (2011), pp. 363–376. [2] Mohammed Alser, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, Victor Xue, Sergey Knyazev, Benjamin D Singer, et al. “Technology dictates algorithms: recent developments in read alignment”. In: Genome biology 22.1 (2021), p. 249. [3] Shanika L Amarasinghe, Shian Su, Xueyi Dong, Luke Zappia, Matthew E Ritchie, and Quentin Gouil. “Opportunities and challenges in long-read sequencing data analysis”. In: Genome biology 21.1 (2020), pp. 1–16. [4] Peter A Audano, Arvis Sulovari, Tina A Graves-Lindsay, Stuart Cantsilieris, Melanie Sorensen, AnneMarie E Welch, Max L Dougherty, Bradley J Nelson, Ankeeta Shah, Susan K Dutcher, et al. “Characterizing the major structural variant alleles of the human genome”. In: Cell 176.3 (2019), pp. 663–675. [5] A. Auton et al. “A global reference for human genetic variation”. In: Nature 526.7571 (2015), p. 68. [6] Brenda S Baker and Raaele Giancarlo. “Sparse dynamic programming for longest common subsequence from fragments”. In: Journal of algorithms 42.2 (2002), pp. 231–254. [7] Mehrdad Bakhtiari, Jonghun Park, Yuan-Chun Ding, Sharona Shleizer-Burko, Susan L Neuhausen, Bjarni V Halldórsson, Kári Stefánsson, Melissa Gymrek, and Vineet Bafna. “Variable number tandem repeats mediate the expression of proximal genes”. In: Nature communications 12.1 (2021), p. 2075. [8] Mehrdad Bakhtiari, Sharona Shleizer-Burko, Melissa Gymrek, Vikas Bansal, and Vineet Bafna. “Targeted genotyping of variable number tandem repeats with adVNTR”. In: Genome research 28.11 (2018), pp. 1709–1719. 101 [9] Gary Benson. “Tandem repeats nder: a program to analyze DNA sequences”. In: Nucleic acids research 27.2 (1999), pp. 573–580. [10] Doruk Beyter, Helga Ingimundardottir, Asmundur Oddsson, Hannes P Eggertsson, Eythor Bjornsson, Hakon Jonsson, Bjarni A Atlason, Snaedis Kristmundsdottir, Svenja Mehringer, Marteinn T Hardarson, et al. “Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits”. In: Nature Genetics 53.6 (2021), pp. 779–786. [11] David Botstein and Neil Risch. “Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease”. In: Nature genetics 33.3 (2003), pp. 228–237. [12] Michael Brudno, Chuong B Do, Gregory M Cooper, Michael F Kim, Eugene Davydov, Eric D Green, Arend Sidow, Seram Batzoglou, NISC Comparative Sequencing Program, et al. “LAGAN and Multi-LAGAN: ecient tools for large-scale multiple alignment of genomic DNA”. In: Genome research 13.4 (2003), pp. 721–731. [13] Mark J Chaisson and Glenn Tesler. “Mapping single molecule sequencing reads using basic local alignment with successive renement (BLASR): application and theory”. In: BMC bioinformatics 13.1 (2012), p. 238. [14] Mark JP Chaisson, John Huddleston, Megan Y Dennis, Peter H Sudmant, Maika Malig, Fereydoun Hormozdiari, Francesca Antonacci, Urvashi Surti, Richard Sandstrom, Matthew Boitano, et al. “Resolving the complexity of the human genome using single-molecule sequencing”. In: Nature 517.7536 (2015), pp. 608–611. [15] Mark JP Chaisson, Ashley D Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky, Tobias Rausch, Eugene J Gardner, Oscar L Rodriguez, Li Guo, Ryan L Collins, et al. “Multi-platform discovery of haplotype-resolved structural variation in human genomes”. In: Nature communications 10.1 (2019), pp. 1–16. [16] Yu Chen, Amy Y Wang, Courtney A Barkley, Yixin Zhang, Xinyang Zhao, Min Gao, Mick D Edmonds, and Zechen Chong. “Deciphering the exact breakpoints of structural variations using long sequencing reads with DeBreak”. In: Nature Communications 14.1 (2023), p. 283. [17] Haoyu Cheng, Gregory T Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. “Haplotype-resolved de novo assembly using phased assembly graphs with hiasm”. In: Nature Methods 18.2 (2021), pp. 170–175. [18] Colby Chiang, Alexandra J Scott, Joe R Davis, Emily K Tsang, Xin Li, Yungil Kim, Tarik Hadzic, Farhan N Damani, Liron Ganel, Stephen B Montgomery, et al. “The impact of structural variation on human gene expression”. In: Nature genetics 49.5 (2017), pp. 692–699. 102 [19] Donald F Conrad, Dalila Pinto, Richard Redon, Lars Feuk, Omer Gokcumen, Yujun Zhang, Jan Aerts, T Daniel Andrews, Chris Barnes, Peter Campbell, et al. “Origins and functional impact of copy number variation in the human genome”. In: Nature 464.7289 (2010), pp. 704–712. [20] 1000 Genomes Project Consortium et al. “An integrated map of genetic variation from 1,092 human genomes”. In: Nature 491.7422 (2012), p. 56. [21] Meredith M Course, Kathryn Gudsnuk, Samuel N Smukowski, Kosuke Winston, Nitin Desai, Jay P Ross, Arvis Sulovari, Cynthia V Bourassa, Dan Spiegelman, Julien Couthouis, et al. “Evolution of a human-specic tandem repeat associated with ALS”. In: The American Journal of Human Genetics 107.3 (2020), pp. 445–460. [22] Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A Albers, Eric Banks, Mark A DePristo, Robert E Handsaker, Gerton Lunter, Gabor T Marth, Stephen T Sherry, et al. “The variant call format and VCFtools”. In: Bioinformatics 27.15 (2011), pp. 2156–2158. [23] Harriet Dashnow, Brent S Pedersen, Laurel Hiatt, Joe Brown, Sarah J Beecroft, Gianina Ravenscroft, Amy J LaCroix, Phillipa Lamont, Richard H Roxburgh, Miriam J Rodrigues, et al. “STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci”. In: Genome Biology 23.1 (2022), pp. 1–20. [24] Arne De Roeck, Lena Duchateau, Jasper Van Dongen, Rita Cacace, Maria Bjerke, Tobi Van den Bossche, Patrick Cras, Rik Vandenberghe, Peter P De Deyn, Sebastiaan Engelborghs, et al. “An intronic VNTR aects splicing of ABCA7 and increases risk of Alzheimer’s disease”. In: Acta neuropathologica 135 (2018), pp. 827–837. [25] Egor Dolzhenko, Viraj Deshpande, Felix Schlesinger, Peter Krusche, Roman Petrovski, Sai Chen, Dorothea Emig-Agius, Andrew Gross, Giuseppe Narzisi, Brett Bowman, et al. “ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions”. In: Bioinformatics 35.22 (2019), pp. 4754–4756. [26] Tatiana Dvorkina, Andrey V Bzikadze, and Pavel A Pevzner. “The string decomposition problem and its applications to centromere analysis and assembly”. In: Bioinformatics 36.Supplement_1 (2020), pp. i93–i101. [27] Peter Ebert, Peter A Audano, Qihui Zhu, Bernardo Rodriguez-Martin, David Porubsky, Marc Jan Bonder, Arvis Sulovari, Jana Ebler, Weichen Zhou, Rebecca Serra Mari, et al. “Haplotype-resolved diverse human genomes and integrated analysis of structural variation”. In: Science 372.6537 (2021), eabf7117. [28] Peter Edge, Vineet Bafna, and Vikas Bansal. “HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies”. In: Genome research 27.5 (2017), pp. 801–812. 103 [29] Evan E Eichler, Jonathan Flint, Greg Gibson, Augustine Kong, Suzanne M Leal, Jason H Moore, and Joseph H Nadeau. “Missing heritability and strategies for nding the underlying causes of complex disease”. In: Nature Reviews Genetics 11.6 (2010), pp. 446–450. [30] Tesfai Emahazion, Lars Feuk, Magnus Jobs, Sarah L Sawyer, David Fredman, David St Clair, Jonathan A Prince, and Anthony J Brookes. “SNP association studies in Alzheimer’s disease highlight problems for complex disease analysis”. In: TRENDS in Genetics 17.7 (2001), pp. 407–413. [31] David Eppstein, Zvi Galil, Raaele Giancarlo, and Giuseppe F Italiano. “Sparse dynamic programming I: Linear Cost Functions”. In: Journal of the ACM (JACM) 39.3 (1992), pp. 519–545. [32] David Eppstein, Zvi Galil, Raaele Giancarlo, and Giuseppe F Italiano. “Sparse dynamic programming II: convex and concave cost functions”. In: Journal of the ACM (JACM) 39.3 (1992), pp. 546–567. [33] Walter M Fitch and Temple F Smith. “Optimal sequence alignments”. In: Proceedings of the National Academy of Sciences 80.5 (1983), pp. 1382–1386. [34] Kelly A Frazer, Sarah S Murray, Nicholas J Schork, and Eric J Topol. “Human genetic variation and its contribution to complex traits”. In: Nature Reviews Genetics 10.4 (2009), pp. 241–251. [35] Zvi Galil and Raaele Giancarlo. “Speeding up dynamic programming with applications to molecular biology”. In: Theoretical computer science 64.1 (1989), pp. 107–118. [36] Yan Gao, Yongzhuang Liu, Yanmei Ma, Bo Liu, Yadong Wang, and Yi Xing. “abPOA: an SIMD-based C library for fast partial order alignment using adaptive band”. In: Bioinformatics 37.15 (2021), pp. 2209–2211. [37] Paras Garg, Alejandro Martin-Trujillo, Oscar L Rodriguez, Scott J Gies, Elina Hadelia, Bharati Jadhav, Miten Jain, Benedict Paten, and Andrew J Sharp. “Pervasive cis eects of variation in copy number of large tandem repeats on local DNA methylation and gene expression”. In: The American Journal of Human Genetics 108.5 (2021), pp. 809–824. [38] Rita Gemayel, Marcelo D Vinces, Matthieu Legendre, and Kevin J Verstrepen. “Variable tandem repeats accelerate evolution of coding and regulatory sequences”. In: Annual review of genetics 44 (2010), pp. 445–477. [39] “Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls”. In: Nature 464.7289 (2010), pp. 713–720. [40] Dan Guseld. “Algorithms on stings, trees, and sequences: Computer science and computational biology”. In: Acm Sigact News 28.4 (1997), pp. 41–60. 104 [41] Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M Phillippy. “A fast approximate algorithm for mapping long reads to large reference databases”. In: International Conference on Research in Computational Molecular Biology. Springer. 2017, pp. 66–81. [42] Chirag Jain, Sergey Koren, Alexander Dilthey, Adam M Phillippy, and Srinivas Aluru. “A fast adaptive algorithm for computing whole-genome homology maps”. In: Bioinformatics 34.17 (2018), pp. i748–i756. [43] Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Brian P Walenz, Sergey Koren, and Adam M Phillippy. “Weighted minimizer sampling improves long read mapping”. In: Bioinformatics 36.Supplement_1 (2020), pp. i111–i118. [44] Justin Jee, Aviram Rasouly, Ilya Shamovsky, Yonatan Akivis, Susan R Steinman, Bud Mishra, and Evgeny Nudler. “Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing”. In: Nature 534.7609 (2016), pp. 693–696. [45] Tao Jiang, Yongzhuang Liu, Yue Jiang, Junyi Li, Yan Gao, Zhe Cui, Yadong Liu, Bo Liu, and Yadong Wang. “Long-read-based human genomic structural variation detection with cuteSV”. In: Genome biology 21.1 (2020), pp. 1–24. [46] W James Kent, Robert Baertsch, Angie Hinrichs, Webb Miller, and David Haussler. “Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes”. In: Proceedings of the National Academy of Sciences 100.20 (2003), pp. 11484–11489. [47] W James Kent, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M Zahler, and David Haussler. “The human genome browser at UCSC”. In: Genome research 12.6 (2002), pp. 996–1006. [48] Jerey M Kidd, Tina Graves, Tera L Newman, Robert Fulton, Hillary S Hayden, Maika Malig, Joelle Kallicki, Rajinder Kaul, Richard K Wilson, and Evan E Eichler. “A human genome structural variation sequencing resource reveals insights into mutational mechanisms”. In: Cell 143.5 (2010), pp. 837–847. [49] Andrew Kirby, Andreas Gnirke, David B Jae, Veronika Barešová, Nathalie Pochet, Brendan Blumenstiel, Chun Ye, Daniel Aird, Christine Stevens, James T Robinson, et al. “Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing”. In: Nature genetics 45.3 (2013), pp. 299–303. [50] Melanie Kirsche, Gautam Prabhu, Rachel Sherman, Bohan Ni, Sergey Aganezov, and Michael C Schatz. “Jasmine: Population-scale structural variant comparison and analysis”. In: BioRxiv (2021), pp. 2021–05. 105 [51] Melanie Kirsche, Gautam Prabhu, Rachel Sherman, Bohan Ni, Alexis Battle, Sergey Aganezov, and Michael C Schatz. “Jasmine and Iris: population-scale structural variant comparison and analysis”. In: Nature Methods (2023), pp. 1–10. [52] Mikhail Kolmogorov, Jerey Yuan, Yu Lin, and Pavel A Pevzner. “Assembly of long, error-prone reads using repeat graphs”. In:Naturebiotechnology 37.5 (2019), pp. 540–546. [53] Jan O Korbel, Alexander Eckehart Urban, Jason P Aourtit, Brian Godwin, Fabian Grubert, Jan Fredrik Simons, Philip M Kim, Dean Palejev, Nicholas J Carriero, Lei Du, et al. “Paired-end mapping reveals extensive structural variation in the human genome”. In: Science 318.5849 (2007), pp. 420–426. [54] S Hong Lee, Teresa R DeCandia, Stephan Ripke, Jian Yang, Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC-SCZ), International Schizophrenia Consortium (ISC), Molecular Genetics of Schizophrenia Collaboration (MGS), Patrick F Sullivan, Michael E Goddard, Matthew C Keller, et al. “Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs”. In: Nature genetics 44.3 (2012), pp. 247–250. [55] Gene Levinson and George A Gutman. “Slipped-strand mispairing: a major mechanism for DNA sequence evolution.” In:Molecularbiologyandevolution 4.3 (1987), pp. 203–221. [56] Heng Li. “Minimap2: pairwise alignment for nucleotide sequences”. In: Bioinformatics 34.18 (2018), pp. 3094–3100. [57] Heng Li, Jonathan M Bloom, Yossi Farjoun, Mark Fleharty, Laura Gauthier, Benjamin Neale, and Daniel MacArthur. “A synthetic-diploid benchmark for accurate variant-calling evaluation”. In: Nature methods 15.8 (2018), pp. 595–597. [58] Heng Li, Xiaowen Feng, and Chong Chu. “The design and construction of reference pangenome graphs with minigraph”. In: Genome biology 21.1 (2020), pp. 1–19. 106 [59] Wen-Wei Liao, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, Julian K. Lucas, Jean Monlong, Haley J. Abel, Silvia Buonaiuto, Xian H. Chang, Haoyu Cheng, Justin Chu, Vincenza Colonna, Jordan M. Eizenga, Xiaowen Feng, Christian Fischer, Robert S. Fulton, Shilpa Garg, Cristian Groza, Andrea Guarracino, William T Harvey, Simon Heumos, Kerstin Howe, Miten Jain, Tsung-Yu Lu, Charles Markello, Fergal J. Martin, Matthew W. Mitchell, Katherine M. Munson, Moses Njagi Mwaniki, Adam M. Novak, Hugh E. Olsen, Trevor Pesout, David Porubsky, Pjotr Prins, Jonas A. Sibbesen, Chad Tomlinson, Flavia Villani, Mitchell R. Vollger, Human Pangenome Reference Consortium, Guillaume Bourque, Mark JP Chaisson, Paul Flicek, Adam M. Phillippy, Justin M. Zook, Evan E. Eichler, David Haussler, Erich D. Jarvis, Karen H. Miga, Ting Wang, Erik Garrison, Tobias Marschall, Ira Hall, Heng Li, and Benedict Paten. “A Draft Human Pangenome Reference”. In: bioRxiv (2022).doi: 10.1101/2022.07.09.499321. [60] Erick W Loomis, John S Eid, Paul Peluso, Jun Yin, Luke Hickey, David Rank, Sarah McCalmon, Randi J Hagerman, Flora Tassone, and Paul J Hagerman. “Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene”. In: Genome research 23.1 (2013), pp. 121–128. [61] Tsung-Yu Lu, Human Genome Structural Variation Consortium Munson Katherine M. 2 Lewis Alexandra P. 2 Zhu Qihui 3 Tallon Luke J. 4 Devine Scott E. 4 Lee Charles 3 5 6 Eichler Evan E. 2 7, and Mark JP Chaisson. “Proling variable-number tandem repeat variation across populations using repeat-pangenome graphs”. In: Nature Communications 12.1 (2021), p. 4250. [62] Tsung-Yu Lu, Paulina N Smaruj, Georey Fudenberg, Nicholas Mancuso, and Mark J Chaisson. “The motif composition of variable-number tandem repeats impacts gene expression”. In: Genome Research (2023), gr–276768. [63] Tsung-Yu Lu, Paulina N Smaruj, Georey Fudenberg, Nicholas Mancuso, and Mark JP Chaisson. “The motif composition of variable-number tandem repeats impacts gene expression”. In: BioRxiv (2022), pp. 2022–03. [64] Dheeraj Malhotra and Jonathan Sebat. “CNVs: harbingers of a rare variant revolution in psychiatric genetics”. In: Cell 148.6 (2012), pp. 1223–1241. [65] Guillaume Marçais, Arthur L Delcher, Adam M Phillippy, Rachel Coston, Steven L Salzberg, and Aleksey Zimin. “MUMmer4: A fast and versatile genome alignment system”. In: PLoS computational biology 14.1 (2018), e1005944. [66] Scott McGinnis and Thomas L Madden. “BLAST: at the core of a powerful and diverse set of sequence analysis tools”. In: Nucleic acids research 32.suppl_2 (2004), W20–W25. 107 [67] Jason D Merker, Aaron M Wenger, Tam Sneddon, Megan Grove, Zachary Zappala, Laure Fresard, Daryl Waggott, Sowmi Utiramerur, Yanli Hou, Kevin S Smith, et al. “Long-read genome sequencing identies causal structural variation in a Mendelian disease”. In: Genetics in Medicine 20.1 (2018), pp. 159–163. [68] Ryan E Mills, Klaudia Walter, Chip Stewart, Robert E Handsaker, Ken Chen, Can Alkan, Alexej Abyzov, Seungtai Chris Yoon, Kai Ye, R Keira Cheetham, et al. “Mapping copy number variation by population-scale genome sequencing”. In: Nature 470.7332 (2011), pp. 59–65. [69] Takeshi Mizuguchi, Takeshi Suzuki, Chihiro Abe, Ayako Umemura, Katsushi Tokunaga, Yosuke Kawai, Minoru Nakamura, Masao Nagasaki, Kengo Kinoshita, Yasunobu Okamura, et al. “A 12-kb structural variation in progressive myoclonic epilepsy was newly identied by long-read whole-genome sequencing”. In: Journal of human genetics 64.5 (2019), pp. 359–368. [70] Derek W Morris, Alana Rodgers, Kevin A McGhee, Siobhan Schwaiger, Paul Scully, John Quinn, David Meagher, John L Waddington, Michael Gill, and Aiden P Corvin. “Conrming RGS4 as a susceptibility gene for schizophrenia”. In: American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 125.1 (2004), pp. 50–53. [71] Nima Mousavi, Sharona Shleizer-Burko, Richard Yanicky, and Melissa Gymrek. “Proling the genome-wide landscape of tandem repeat expansions”. In: Nucleic acids research 47.15 (2019), e90–e90. [72] Ronen E Mukamel, Robert E Handsaker, Maxwell A Sherman, Alison R Barton, Yiming Zheng, Steven A McCarroll, and Po-Ru Loh. “Protein-coding repeat polymorphisms strongly shape diverse human phenotypes”. In: Science 373.6562 (2021), pp. 1499–1505. [73] Saul B Needleman and Christian D Wunsch. “A general method applicable to the search for similarities in the amino acid sequence of two proteins”. In: Journal of molecular biology 48.3 (1970), pp. 443–453. [74] Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. “The complete sequence of a human genome”. In: Science 376.6588 (2022), pp. 44–53. [75] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. “PBSIM: PacBio reads simulator—toward accurate genome assembly”. In: Bioinformatics 29.1 (2013), pp. 119–121. 108 [76] Andy W Pang, Jerey R MacDonald, Dalila Pinto, John Wei, Muhammad A Raq, Donald F Conrad, Hansoo Park, Matthew E Hurles, Charles Lee, J Craig Venter, et al. “Towards a comprehensive structural variation map of an individual human genome”. In: Genome biology 11 (2010), pp. 1–14. [77] Catherine Paquet, Andre Krumel Portella, Spencer Moore, Yu Ma, Alain Dagher, Michael J Meaney, James L Kennedy, Robert D Levitan, Patricia P Silveira, and Laurette Dube. “Dopamine D4 receptor gene polymorphism (DRD4 VNTR) moderates real-world behavioural response to the food retail environment in children”. In: BMC Public Health 21.1 (2021), pp. 1–9. [78] Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo Van Iersel, Leen Stougie, Gunnar W Klau, and Alexander Schönhuth. “WhatsHap: weighted haplotype assembly for future-generation sequencing reads”. In: Journal of Computational Biology 22.6 (2015), pp. 498–509. [79] Matthew Pendleton, Robert Sebra, Andy Wing Chun Pang, Ajay Ummat, Oscar Franzen, Tobias Rausch, Adrian M Stütz, William Stedman, Thomas Anantharaman, Alex Hastie, et al. “Assembly and diploid architecture of an individual human genome via single-molecule technologies”. In: Nature methods 12.8 (2015), pp. 780–786. [80] Laurent Perron and Vincent Furnon. OR-Tools. Version v9.6. Google, Mar. 13, 2023.url: https://developers.google.com/optimization/. [81] Helge Ræder, Stefan Johansson, Pål I Holm, Ingfrid S Haldorsen, Eric Mas, Véronique Sbarra, Ingrid Nermoen, Stig Å Eide, Louise Grevle, Lise Bjørkhaug, et al. “Mutations in the CEL VNTR cause a syndrome of diabetes and pancreatic exocrine dysfunction”. In: Nature genetics 38.1 (2006), pp. 54–62. [82] Mikko Rautiainen and Tobias Marschall. “GraphAligner: rapid and versatile sequence-to-graph alignment”. In: Genome biology 21.1 (2020), p. 253. [83] Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry, T Daniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wenwei Chen, et al. “Global variation in copy number in the human genome”. In: nature 444.7118 (2006), pp. 444–454. [84] Jingwen Ren and Mark JP Chaisson. “lra: A long read aligner for sequences and contigs”. In: PLOS Computational Biology 17.6 (2021), e1009078. [85] Inga Reynisdottir, Gudmar Thorleifsson, Rafn Benediktsson, Gunnar Sigurdsson, Valur Emilsson, Anna Sigurlin Einarsdottir, Eyrun Edda Hjorleifsdottir, Gudbjorg Th Orlygsdottir, Gudrun Thora Bjornsdottir, Jona Saemundsdottir, et al. “Localization of a susceptibility gene for type 2 diabetes to chromosome 5q34–q35. 2”. In: The American Journal of Human Genetics 73.2 (2003), pp. 323–335. 109 [86] Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. “Reducing storage requirements for biological sequence comparison”. In: Bioinformatics 20.18 (2004), pp. 3363–3369. [87] William J Rowell, AM Wenger, A Kolesnikov, P Chang, A Carroll, RJ Hall, and P Peluso. “Comprehensive variant detection in a human genome with highly accurate long reads”. In: EUROPEAN JOURNAL OF HUMAN GENETICS. Vol. 27. NATURE PUBLISHING GROUP MACMILLAN BUILDING, 4 CRINAN ST, LONDON N1 9XW, ENGLAND. 2019, pp. 1723–1723. [88] Alba Sanchis-Juan, Jonathan Stephens, Courtney E French, Nicholas Gleadall, Karyn Mégy, Christopher Penkett, Olga Shamardina, Kathleen Stirrups, Isabelle Delon, Eleanor Dewhurst, et al. “Complex structural variants in Mendelian disorders: identication and breakpoint resolution using short-and long-read genome sequencing”. In: Genome medicine 10 (2018), pp. 1–10. [89] Fritz J Sedlazeck, Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt Von Haeseler, and Michael C Schatz. “Accurate detection of complex structural variations using single-molecule sequencing”. In: Nature methods 15.6 (2018), pp. 461–468. [90] Fritz J Sedlazeck, Philipp Rescheneder, and Arndt Von Haeseler. “NextGenMap: fast and accurate read mapping in highly polymorphic genomes”. In: Bioinformatics 29.21 (2013), pp. 2790–2791. [91] Kishwar Shan, Trevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, et al. “Nanopore sequencing and the Shasta toolkit enable ecient de novo assembly of eleven human genomes”. In: Nature Biotechnology (2020), pp. 1–10. [92] Temple F Smith, Michael S Waterman, et al. “Identication of common molecular subsequences”. In: Journal of molecular biology 147.1 (1981), pp. 195–197. [93] Jun Sone, Satomi Mitsuhashi, Atsushi Fujita, Takeshi Mizuguchi, Kohei Hamanaka, Keiko Mori, Haruki Koike, Akihiro Hashiguchi, Hiroshi Takashima, Hiroshi Sugiyama, et al. “Long-read sequencing identies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease”. In: Nature genetics 51.8 (2019), pp. 1215–1221. [94] Janet HT Song, Craig B Lowe, and David M Kingsley. “Characterization of a human-specic tandem repeat associated with bipolar disorder and schizophrenia”. In: The American Journal of Human Genetics 103.3 (2018), pp. 421–430. [95] Martin Šošić and Mile Šikić. “Edlib: a C/C++ library for fast, exact sequence alignment using edit distance”. In: Bioinformatics 33.9 (2017), pp. 1394–1395. 110 [96] Peter H Sudmant, Tobias Rausch, Eugene J Gardner, Robert E Handsaker, Alexej Abyzov, John Huddleston, Yan Zhang, Kai Ye, Goo Jun, Markus Hsi-Yang Fritz, et al. “An integrated map of structural variation in 2,504 human genomes”. In: Nature 526.7571 (2015), pp. 75–81. [97] Arvis Sulovari, Ruiyang Li, Peter A Audano, David Porubsky, Mitchell R Vollger, Glennis A Logsdon, Wesley C Warren, Alex A Pollen, Mark JP Chaisson, Evan E Eichler, et al. “Human-specic tandem repeat expansion and dierential gene expression during primate evolution”. In: Proceedings of the National Academy of Sciences 116.46 (2019), pp. 23243–23253. [98] Brian Teague, Michael S Waterman, Steven Goldstein, Konstantinos Potamousis, Shiguo Zhou, Susan Reslewic, Deepayan Sarkar, Anton Valouev, Christopher Churas, Jerey M Kidd, et al. “High-resolution human genome structure by single-molecule analysis”. In: Proceedings of the National Academy of Sciences 107.24 (2010), pp. 10848–10853. [99] Janniche Torsvik, Stefan Johansson, Anders Johansen, Jakob Ek, Jayne Minton, Helge Ræder, Sian Ellard, Andrew Hattersley, Oluf Pedersen, Torben Hansen, et al. “Mutations in the VNTR of the carboxyl-ester lipase gene (CEL) are a rare cause of monogenic diabetes”. In: Human genetics 127.1 (2010), pp. 55–64. [100] Petros Vaadis, Simon T Bennett, John A Todd, Joseph Nadeau, Rosemarie Grabs, Cynthia G Goodyer, Saman Wickramasinghe, Eleanor Colle, and Constantin Polychronakos. “Insulin expression in human thymus is modulated by INS VNTR alleles at the IDDM2 locus”. In: Nature genetics 15.3 (1997), pp. 289–292. [101] Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. “Five years of GWAS discovery”. In: The American Journal of Human Genetics 90.1 (2012), pp. 7–24. [102] Amy J Vogler, Christine Keys, Yoshimi Nemoto, Rebecca E Colman, Zack Jay, and Paul Keim. “Eect of repeat copy number on variable-number tandem repeat mutations in Escherichia coli O157: H7”. In: Journal of bacteriology 188.12 (2006), pp. 4253–4263. [103] Joachim Weischenfeldt, Orsolya Symmons, Francois Spitz, and Jan O Korbel. “Phenotypic impact of genomic structural variation: insights from and for human disease”. In: Nature Reviews Genetics 14.2 (2013), pp. 125–138. [104] Aaron M Wenger, Paul Peluso, William J Rowell, Pi-Chuan Chang, Richard J Hall, Gregory T Concepcion, Jana Ebler, Arkarachai Fungtammasan, Alexey Kolesnikov, Nathan D Olson, et al. “Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome”. In: Nature biotechnology 37.10 (2019), pp. 1155–1162. 111 [105] Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Henders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin, Grant W Montgomery, et al. “Common SNPs explain a large proportion of the heritability for human height”. In: Nature genetics 42.7 (2010), pp. 565–569. [106] Jianzhi Yang and Mark JP Chaisson. “TT-Mars: structural variants assessment based on haplotype-resolved assemblies”. In: Genome Biology 23.1 (2022), p. 110. [107] Justin M Zook, Nancy F Hansen, Nathan D Olson, Lesley Chapman, James C Mullikin, Chunlin Xiao, Stephen Sherry, Sergey Koren, Adam M Phillippy, Paul C Boutros, et al. “A robust benchmark for detection of germline large deletions and insertions”. In: Nature biotechnology (2020), pp. 1–9. [108] Justin M Zook, Jennifer McDaniel, Nathan D Olson, Justin Wagner, Hemang Parikh, Haynes Heaton, Sean A Irvine, Len Trigg, Rebecca Truty, Cory Y McLean, et al. “An open resource for accurately benchmarking small variant and reference calls”. In: Nature biotechnology 37.5 (2019), pp. 561–566. 112
Abstract (if available)
Abstract
Human genomes are rich in genetic variations, and their correlations with phenotypes have long been a topic of interest among researchers. Over that last two decades, Genome-wide association studies (GWAS) have focused primarily on single nucleotide polymorphisms (SNPs) and small indels that co-occur with specific disease or trait phenotypes. Despite the wealth of data generated from GWAS, these markers explain only a fraction of the heritability of complex traits and disorders. Consequently, researchers have shifted their attention to other genetic variations such as structural variations (SVs) and variable number tandem repeats (VNTRs) to address the "missing heritability" problem. However, identifying and characterizing SVs and VNTRs using short-read sequencing (SRS) technology has been challenging, primarily due to limitations in read length and reference bias. Recent advancements in long-read sequencing (LRS) technology and assembly algorithms have facilitated better resolution of SVs and VNTRs. In this thesis, we present algorithms that can effectively resolve large SVs (>=50bp) and VNTRs using LRS and assemblies. The proposed methodology has the potential to significantly improve our understanding of the functional impact of SVs and VNTRs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Complex mechanisms of cryptic genetic variation
PDF
Application of machine learning methods in genomic data analysis
PDF
Understanding the genetic architecture of complex traits
PDF
Exploring the genetic basis of complex traits
PDF
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
Asset Metadata
Creator
Ren, Jingwen
(author)
Core Title
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2023-05
Publication Date
05/09/2023
Defense Date
05/03/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment,computational algorithm,human genetic variations,OAI-PMH Harvest,structural variations,variable number tandem repeat
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chaisson, Mark (
committee chair
), Mangul, Serghei (
committee member
), Smith, Andrew (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
jingwenr@usc.edu,jingwenren7@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113120650
Unique identifier
UC113120650
Identifier
etd-RenJingwen-11810.pdf (filename)
Legacy Identifier
etd-RenJingwen-11810
Document Type
Dissertation
Format
theses (aat)
Rights
Ren, Jingwen
Internet Media Type
application/pdf
Type
texts
Source
20230511-usctheses-batch-1041
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
alignment
computational algorithm
human genetic variations
structural variations
variable number tandem repeat