Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Validating structural variations: from traditional algorithms to deep learning approaches
(USC Thesis Other)
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Validating Structural Variations: from Traditional Algorithms to Deep
Learning Approaches
by
Jianzhi Yang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
May 2024
Copyright 2024 Jianzhi Yang
To Song, Yang, the love of my life, and my dear parents Yang, Yi and Wang, Xiaoling.
ii
Acknowledgments
I am deeply grateful to my advisor, Dr. Mark Chaisson, for his guidance and patience during
my entire PhD study. His insight, knowledge, instructions, and encouragement have been so
important along my PhD journey. I have learned to be a researcher from the best advisor
in the world.
I would like to thank my qualifying exam and dissertation committee, Dr. Fengzhu Sun,
Dr. Geoffrey Fudenberg, Dr. Liang Chen and Dr. Muhao Chen. Thank you for your
constructive feedback and essential suggestions for the qualifying exam and this thesis.
I would like to thank Dr. Andrew Smith, for all the instructions in classes and lab
meetings. And I want thank all the professors in the Department of Quantitative and
Computational Biology (QCB) for your teaching and help during the past six years. I am
also thankful to all the QCB staff for your help and support.
I am thankful to my lab mates, Dr. Tsung-Yu Lu, Dr. Jingwen Ren, Robel Dagnew, Keon
Rabbani, Bida Gu, Walfred Ma, Ishaan Thota and Seung Jae Lee, for your collaboration
and help, also for every single lab meeting and lab party!
Finally, my family has provided great support for me. I would like to say a great thanks
to my parents and grand parents for everything. I am also grateful to my wife for her
love and support. Lastly, I want to acknowledge our cat Mars and dog Tuantuan, for
their companionship. Thank you all for your care, support and encouragement along this
unforgettable journey.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1:
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Structural Variation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structural Variation Validation . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2:
Structural Variants Assessment Based on Haplotype-resolved Assemblies 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Validation of calls on GIAB HG002 assembly . . . . . . . . . . . . . . 21
2.3.2 Performance on 10 HGSVC sample genomes . . . . . . . . . . . . . . 24
2.3.2.1 Comparison to VaPoR and dipcall . . . . . . . . . . . . . . 30
2.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 3:
Applications of TT-Mars in large-scale sequencing studies . . . . . . . . . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
Chapter 4:
Filtering of Structural Variation ex post facto by Deep Learning Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Training data sets and feature extraction . . . . . . . . . . . . . . . . 46
4.2.2 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Training and Validation on HPRC Samples . . . . . . . . . . . . . . . 50
4.3.2 Benchmark by TT-Mars . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Evaluation of Structural Variation from global diversity sequencing . 55
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 5:
Conclusions and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Appendix A: Chapter 2 Supplementary Materials . . . . . . . . . . . . . . . . 66
A.1 Curation of dipcall+truvari and TT-Mars SV annotation . . . . . . . . . . . 66
A.1.1 Dipcall+Truvari FP and TT-Mars TP results . . . . . . . . . . . . . 66
A.1.2 Dipcall+Truvari TP and TT-Mars FP results . . . . . . . . . . . . . 74
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
v
List of Tables
2.1 Fraction of genome that is excluded by TT-Mars. Including regions that are
not covered by 1 or 2 haplotypes of the assembly and the centromere. . . . . 21
2.2 TT-Mars results on callsets from four SV discovery algorithms on HG002,
including all calls larger than 10bp produced by each method. . . . . . . . . 23
2.3 Comparison of TT-Mars and GIAB benchmarks on HG002. The two methods
have the same classification results on more than 96% of the analyzed calls,
while TT-Mars analyzed more candidate calls on all the four callsets. A length
filter (50bp - 10Mbp) is applied and only deletions and insertions are included
to match the truvari parameter settings. . . . . . . . . . . . . . . . . . . . . 24
2.4 TT-Mars validation results of simulated true and false duplications. . . . . . 30
2.5 Comparison of TT-Mars and VaPoR. The two methods agree on most calls
and can analyze a similar number of calls across short-read callsets. On the
long read callset, TT-Mars evaluates 39,187 additional variants, the majority
of which are under 100 bases. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Devas and Samplot-ML benchmarking results on DELs by TT-Mars. . . . . 53
4.2 The FDR of different types of SVs with and without Devas, as well as Recall
values (all evaluated by TT-Mars) . . . . . . . . . . . . . . . . . . . . . . . . 54
vi
List of Figures
2.1 TT-Mars Workflow. a, Assembly contigs are aligned to the reference, and the
shorter of two overlapping contigs are trimmed to generate a unique mapping.
Regions on the reference that are not covered by contigs are excluded. b,
the alignment is used to construct an orthology map at specific intervals (e.g.
every 20 bases). c, For an SV, called at an interval [s, e] on the reference,
TT-Mars searches the orthology map for matches outside the interval that
most closely reflects the length of the SV. In this example, the interval [a, b]
immediately flanking a deletion SV maps to an interval on the assembly that
does not reflect the SV, but a wider search in the orthology map shows that
[c, b] spans a deletion in the assembly. d-g illustrate validation details by a
deletion example. d, TT-Mars takes the candidate call with w flanking bases
on both sides. The interval on the reference is compared with an interval
on the assembly (e) before and after the SV operation. In f and g, the
deletion operation removes the corresponding sequence on the reference, and
is validated if the modified reference is more similar to the assembly than the
original. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 SV Operations of deletions, insertions, inversions and tandem duplications.
For each type of SVs, an operation is defined as the direct edit of the reference sequence reflect the specific nature of the corresponding SV. If an SV
is true, the reference sequence after the SV operation will be more similar to
the assembly contig compared to the reference sequence before the operation.
For example, a deletion operation will delete the corresponding reference sequence and the result is expected to match the contig sequence, which does
not contain the deleted sequence. . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The true positive rate of pbsv from TT-Mars versus the portion of genomes
covered by TT-Mars confident regions. . . . . . . . . . . . . . . . . . . . . . 17
vii
2.4 The distribution of assembly scores across the sample genome HG002, defined
by contiguity of sequence alignment. We use Pacbio reads to assess the assembly scores. Raw reads (before haplotype partition) are mapped back to the
assemblies using lra with the options -CCS/CLR -p s. The genome is divided
into 100-base bins. A read that spans a bin with at least 1kb flanking length
on both sides is considered as a valid read. We count the number of valid reads
as the quality score for each bin. Noisy alignment regions and low-coverage
regions are given low quality scores. . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 An example of classification of individual deletion calls by TT-Mars on HG002
made by DELLY a and LUMPY b. Each dot represents a candidate call. The
classification boundaries determined empirically for human data are shown
by the light green region. Panel b highlights different classifications made by
TT-Mars and GIAB+truvari. c, The length distribution of SVs in the HG002
callsets by four SV detection algorithms. The distribution of calls that are
analyzed or where TT-Mars does not provide an annotation (NA) are shown. 22
2.6 a, SV metrics for four algorithms on 10 HGSVC genome samples, triangles
mark samples using HiFi assemblies. Benchmark results are given as a distribution by TT-Mars. b, The length distribution of results by dipcall+truvari,
TT-Mars and VaPoR for pbsv HG00096 calls. The red solid and dashed lines
indicate that dipcall+truvari and TT-Mars congruent calls have similar length
distributions. The green and blue dashed lines show that VaPoR has more
NA and disagreed results with TT-Mars for small SVs (size < 100 bases). . . 25
2.7 Count of small SVs for all the SVs and true positive SVs for ten sample genomes. 26
2.8 Loci of dipcall+truvari and TT-Mars calls from the HG00096 pbsv call set.
Green bars are the calls that two method agree on, red bars are the calls that
two method disagree on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 Scatter plot of TT-Mars scores on pbsv HG002 callset. . . . . . . . . . . . . 28
2.10 Length distribution of calls that analyzed by both TT-Mars and truvari and
analyzed by TT-Mars only on pbsv HG002 callset. . . . . . . . . . . . . . . . 29
2.11 The number of calls, TT-Mars analysis results and BAM size over ten sample
genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12 Comparison of dipcall+truvari and TT-Mars on short read a and long read b
callsets of 10 sample genomes. The combinations of TP and FP for dipcall+truvari/TTMars annotations are given with a horizontal scatter to distinguish points in
individual categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
2.13 An example shows TT-Mars correctly validates an insertion but dipcall+truvari
does not in a tandem repeat region, where the SV is split into two smaller
insertions on the assembly alignment. The insertion with length 997 bases is
on chromosome 4, coordinate 1046348. The first track is the pbsv callset, and
the second track is the dipcall callset, followed by alignments used by TTMars (two tracks) and dipcall (two tracks). The bottom track is a tandem
duplication region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Validation of SV call sets against using haplotype-resolved assemblies for deletion (DEL) and duplication (DUP) calls in the sample genome NA12878 and
NA19238. A: Evaluation by TT-Mars. B: Validations from both the TT-Mars
and Truvari methods. C: The size distribution of all calls (red), the count validated by TT-Mars (green), and the count validated by Truvari (blue) for the
combination of both genomes. [56] . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 TT-Mars evaluation of Hapdup and Sniffles2 calls. The calls were either
validated by the alignment (green), not validated (orange), or could not be
annotated by TT-Mars (blue). [74] . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Overview of Devas. Training data preparation, model structures and available
sequencing datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Training performance on 40 HPRC samples and benchmarking results by TTMars on 7 testing samples. a: A break down of the number of positive
and negative samples by SV types. Within the positive samples, there are
24,336 deletions (DEL), 4,349 duplications (DUP), 1,357 inversions (INV) and
18,498 insertions (INS). Within the negative samples, there are 7,580 deletions
(DEL), 6,663 duplications (DUP), 3,312 inversions (INV) and 6,811 insertions
(INS). The violin plot represent the distribution of the number of SVs of the
40 samples. b: Size distribution of training data. Length distributions of all
the types of SVs are as expected and the small peak in the INS length distribution represents mobile elements. The distribution of positive and negative
samples does not have an obvious pattern. c: Validation results of 7 unseen
randomly selected samples, benchmarked by TT-Mars. Overall Devas has F1
score of 89.86% and accuracy of 83.73%. DELs and INSs validation results
are slightly better than DUPs and INVs. . . . . . . . . . . . . . . . . . . . . 51
ix
4.3 Validation results of 7 unseen randomly selected samples, benchmarked by
TT-Mars of different size categories for each type of SVs. All the four types
keep a stable results of different metric including ACC (accuracy), PRE (precision), REC (recall), TNR (true negative rate) and F1 (f1 score). Only large
INVs show a minor decrease on the metrics, medium size INVs have a highest f1 score of 84.29% while large size INVs have a lowest f1 score of 77.86%
among the three groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Genomic loci distribution of positive (a) and negative (b) training samples on
selected regions of randomly chosen chromosome 1 and chromosome 7. Chromosome 1 region: 10,000,000-100,000,000. Chromosome 7 region: 8,000,000-
50,000,000. Each region covers a significant part of the underlying chromosome. There is no obvious pattern of the loci of positive or negative samples. 52
4.5 Validation results by Devas of samples have allele count 1 to 10 from the 1kGP
samples. Overall 98,908 out of 112,263 SVs are validated by Devas. All the
different types of SVs achieved a high validation rate except for the INVs. . . 56
4.6 Validation results of Devas on 1kGP samples SV callsets. a: The total number of SVs and validated SVs for AC 1-10. As the value of AC increases,
the total SV count decreases and the validated rate has an increasing trend
from 90.56% to 97.29%. b: Validation rate of trio violation SVs and no violation SVs. The validation rate of no violation SVs is noticeably higher. c:
Comparison between Devas and TT-Mars for the sample genomes overlapping
HPRC assembly samples. Overall two methods output 86.00% matched results. The percentage of matched results are 92.77%, 89.80%, and 82.41% for
DELs, DUPs, and INSs respectively. There are only 4 INVs included in the
comparison, 3 matched and 1 mismatched. d: For SVs validated by TT-Mars,
the total number of SVs and validated SVs for AC 1-10. The validation rate
is higher compared to a, which plots the validation rate for all the SVs. e:
Distribution of Gnocchi z-scores for valid and invalid SVs by Devas. Scores
for both valid and invalid SVs exhibit distributions similar to normal distribution. Welch’s t-test shows that the mean of valid scores (-0.074) is significantly
smaller than the mean of invalid sores (0.012) with P-value 0.00084. f: Loci
distribution of valid and invalid SVs on chromosome1. No obvious patterns
are showed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Distributions of validation rate of each SV with allele count from 1 to 10.
For each SV, validation rate is calculated as the number of sample genomes
support it (validated) divided by the total number of sample genomes for this
particular SV. Most of SVs have a high validated rate. . . . . . . . . . . . . 59
4.8 Distributions of Gnocchi z-scores of valid SVs for each allele count from 1 to
10. There is no obvious bias from the plots, and the all mean values of the
scores of each allele count categories are negative. . . . . . . . . . . . . . . . 60
x
4.9 Distributions of Gnocchi z-scores of invalid SVs for each allele count from 1 to
10. There is no obvious bias from the plots. The means of AC=5 and AC=8
invalid SVs are -0.16 and -0.19 respectively. This is likely because of the small
sample size of each AC categories of invalid SVs. Removing these two groups,
the remaining invalid Gnocchi scores still have a positive mean value closed
to 0.42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.10 Example of valid and invalid SVs on 1KG samples SV callsets. The left column
are the valid SVs, the right column are the invalid SVs. There are deletions,
duplications and insertions from top to bottom respectively. . . . . . . . . . 62
xi
Abstract
Structural variation has been intensively studied for the past years. However, it is still
challenging to detect and characterize structural variation with a high accuracy.
To alleviate the need of benchmarking structural variations, this dissertation addresses
validation of structural variations through traditional algorithms and deep learning natural
language processing models. We developed TT-Mars and Devas to assess different types
structural variations from a variety of resources. Based on haplotype-resolved assemblies,
TT-Mars provides validation of structural variations by evaluating variant calls based on
how well their call reflects the content of the ground truth assembly. It is able to provide
benchmarking results of structural variations detection tools on a number of genomes. Devas
is a deep learning model with Transformer architecture using sparse attention. This sequencebase approach reduces the false discovery rate while ensuring a high detection rate of true
structural variations. With short read sequencing data as the input, Devas can be generalized
to callsets of large-scale cohort and output robust results. Both of the methods have been
demonstrated to be reliable on different datasets. TT-Mars has been published and used in
multiple studies. The integration of TT-Mars and Devas presents a robust framework that
advances structural variations detection and contributes to our understanding of genomic
complexities.
xii
Chapter 1
Introduction
1.1 Background
The human genome comprises approximately 3 billion DNA base pairs, containing the genetic
materials passed down from ancestors [1]. There is abundant variation among individuals
spanning a large scale ranging from single nucleotide changes to large and complex variations
[2]. These genetic differences contribute to the rich diversity across human populations,
related to a wide range of traits from physical appearances to diseases [3]. While most
genetic variations are benign, some are important to individual health and are the focus of
extensive research aimed at understanding genetic disorders [4].
Structural variations (SVs) are genomic alterations that involve changes in the structure
of DNA segments ranging from 50 base pairs (bp) to several mega-bp. These variations
include insertion (INS), deletion (DEL), duplication (DUP), inversion (INV), translocation,
and other complex variations [5]. SVs are widely diversified in type and size and the effects
of SVs on the genome is greater than any other types of sequence variant [6]. Recent
years, large-scale sequencing studies enable association of genetic variation such as common
structural variations (SVs) [3], or rare variation of any [7] with traits and diseases. Studies
have shown that SVs have a considerable role in genetic diversity [8, 9], developmental
disorders [10, 11], and cancer [12]. SVs have also been shown to influence diseases including
1
autism [13, 14], Alzheimer’s disease, and can also provide the genetic basis for schizophrenia
[12, 15]. SVs play an important role in neurodegenerative disease risk, such as the repeat
expansion in C9orf72 and the tri-nucleotide repeat in ATXN2, both of which are associated
with sporadic amyotrophic lateral sclerosis (ALS) [16]. Compared to variations such as single
nucleotide variants (SNVs), SVs have been historically more difficult to detect, particularly
due to the vast diversity of sizes and breakpoint complexity that SVs span [17]. Various
types of complex SVs are in uncharacterized part of the human genome, therefore it is a
prominent area of research [18].
DNA sequencing technologies have been improved since the introduction of the Sanger
sequencing method in the 1970s, which once served as the gold standard for sequencing
genomes [19]. An automated version of the Sanger sequencing method was instrumental in
sequencing the human genome during the Human Genome Project [20].
Emerging in the mid-2000s, Next-Generation Sequencing (NGS) technologies such as
Illumina sequencing, have drastically improved genome sequencing by enabling parallel sequencing of millions of DNA fragments [21]. Illumina NGS sequencing technology simplifies
and accelerates the creation of sequencing libraries by fragmenting DNA and attaching specialized adapters to both ends [22]. Then sequencing by synthesis (SBS) is used to alter
nucleotide with chemicals so that a fluorescent signal can be detected to illustrate the underlying nucleotide, next computational tools are used to analyze the data and output predicted
accurate nucleotide bases [23]. These technologies reduced the cost and time required for
genome sequencing, making them available for large-scale studies including whole-genome
sequencing and targeted sequencing [22].
In recent years, single-molecule sequencing (SMS) has emerged as an alternative to NGS.
Technologies like Pacific Biosciences (PacBio) and Oxford Nanopore offer SMS that can span
thousands of bases to several millions of bases [24, 25]. PacBio uses Single Molecule, RealTime (SMRT) sequencing technology, which attaches primers to ligated hairpin adapters and
creates a circular template for a variety of insert size. The SMRT chip is loaded with DNA
2
polymerase and fluorescently labeled nucleotides are incorporated to be detected in real-time
[26]. Oxford Nanopore Technologies (ONT) identifies nucleotides by determining an ionic
current of DNA molecule when it is passing through nanopores on a synthetic membrane.
It can provide longer reads [27]. Long-read sequencing is particularly beneficial for resolving
structurally complex genomic regions, detecting large structural variations, and spanning
repetitive sequences. Despite their advantages in read length, these technologies initially
struggled with higher error rates and costs per base compared to NGS, although accuracy
has improved significantly with recent advancements [26]. One of the early applications
of SMS has been to generate high quality structural variant callsets [28, 29, 30, 31, 32, 33]
because long reads or their de novo assemblies are more capable of spanning SVs, particularly
in complex and repetitive regions [5, 34].
Genome assembly is a computational process where DNA sequencing data are combined
to reconstruct the original chromosome sequences [35]. Recent advances in single molecule
sequencing and assembly algorithms have enabled more routine generation of high-quality
haplotype-resolved assemblies [33]. More recently, assemblies have been used to generate SV
callsets to leverage the improved de novo assembly quality from the advance of algorithms
and technologies [36, 37]. The Human Genome Structural Variation Consortium (HGSVC)
has generated 32 haplotype-resolved assemblies of human genomes at Phred quality scale
over 40 with contig N50 values over 25 Mb [33]. The Human Pangenome Reference Consortium has used a combination of Pacific Biosciences HiFi sequencing reads and the hifiasm
method [38] to generate 47 haplotype-resolved assemblies with base quality approaching
QV50 and assembly N50 over 40 Mb. Algorithms have been developed to handle repetitive
sequences and to phase haplotypes over long stretches of DNA better. The advancement can
more efficiently handle long reads and complex genomic structures [39]. These high-quality
resources provide opportunities for us to develop robust and reliable analysis tools for SVs.
3
1.2 Structural Variation Detection
With the development of these technologies, the identification and characterization of SVs
have been improved with increasing precision and accuracy. SVs can now be detected through
various techniques, including Illumina pair-end and split-read sequencing, and single-molecule
real-time (SMRT) sequencing.
The study of SVs is a rapidly evolving field of research with several ongoing efforts to
catalog and annotate SVs. Detection of SVs using short read sequencing data from Illumina
platforms has been a topic of intense research and development over the past decade. A
number of algorithms have been developed to identify and classify SVs using short read
signatures [6]. One of the earliest algorithms for detecting SVs was BreakDancer [40], which
uses discordant read pairs and split-reads to identify deletions, insertions, inversions, and
translocations. Another popular algorithm is DELLY [41], which uses a combination of splitread and read-pair analysis to identify SVs. Other widely used algorithms include Pindel
[42], LUMPY [43], and Manta [44].
Long-read sequencing technologies, such as PacBio SMRT and ONT, have emerged as
powerful tools for the detection of SVs, as they can span complex repetitive regions and
resolve larger events that are difficult to detect using short-read sequencing [5]. Multiple
SV detection algorithms were developed for long-read sequencing data, including Picky [45],
Sniffles [30], PBHoney [46], smartie-sv [47], and NanoSV [48]. NanoSV is designed to detect a
various types of SVs based on Nanopore or Pacbio long reads sequencing [48]. PHHoney and
smartie-sv can output accurate inserts and deletions [46, 47]. Picky can produce inversion
calls and Sniffles is able to call all types of SVs using long reads alignment [45]. Long-read
sequencing has shown promising results in improving the accuracy and resolution of SV
detection although these technologies are relatively expensive and not as widely available as
Illumina short-read sequencing [5].
In recent years, deep learning models has been extensively studied and has shown ex4
ceptional capabilities in handling large and complex datasets [49]. To leverage its ability to
automatically learn and model complex patterns from data without explicit programming,
deep learning approaches have been increasingly applied in genomics, demonstrating success in tasks such as sequence analysis and variant calling [50]. DeepVariant [51] has been
demonstrated to have a dominant performance in single nucleotide polymorphisms (SNPs)
and small indel calling. It takes pile-up read images as input and utilizes a convolutional
neural network (CNN) architecture to learn features. Cue [52] uses a hour-glass CNN structure with generated input images to detect large and complex SVs. For SV detection, deep
learning methods have the potential to overcome the limitations of traditional computational
approaches by efficiently learning from complex patterns in NGS data [50].
The evaluation of short reads SV calling methods has illustrated that less than 1% SVs are
captured by all the four different popular detection algorithms [43]. The accuracy of NanoSV
and smartie-sv using long read sequencing all are less than 70% [53]. Although Sniffles2
achieved highest accuracy when compared with other tools, it could not solve the issue for
highly rearranged regions where SVs can be overlapping with each other [54]. Therefore, tools
use multiple algorithms to call SVs, then merge the outputs to increase the precision and/or
the recall [55]. Recently the The Trans-Omics for Precision Medicine (TOPMed) consortium
conducted a study of SVs across 138,134 individuals [56], revealing 355,667 SVs on autosomes
and the X chromosome. Their novel SVs pipeline includes a multi-caller strategy on Illumina
short-read sequencing data. Vista [57] uses novel combination algorithms to merge the
results of individual SV callers and output reliable results, which achieved a robust F1 score
compared to other consensus SV detection methods.
1.3 Structural Variation Validation
Although the aforementioned computational methods can detect many types of SV, no single
algorithm can accurately and sensitively detect all types and all sizes of SVs. The perfor5
mance of these algorithms has been evaluated in multiple benchmarking studies [58], which
have shown that the accuracy of SV calling methods can vary depending on various factors
including the type and the size of SVs, the loci of SVs in the genome and the characteristics
of the sequencing data. The 1000 Genomes Project Consortium identified more than 20,000
unreported SVs from short read with the false-positive rate as high as 89% [46, 59]. The
output SV call sets from different methods are also highly varied [43]. The characteristics,
including the mappability and the short length, of the short read sequencing limit the ability
of the variants detection algorithm, in particularly at the complex regions such as highly
repetitive genome regions.
Despite recent development of single-molecule sequencing (SMS) and its application in
producing high-quality structural variant callsets [33, 60], SV discovery on the scale of
biobanks relies on short-read sequencing (SRS) [61, 62]. To date, multiple large wholegenome sequencing (WGS) studies have been conducted including the Trans-Omics for Precision Medicine (TOPMed) consortium (53,831 samples) [63], the All of Us Research Program
245,388 (samples) [64], and the UK Biobank (UKBB) (150,119 samples) [61]. These large
scale studies enable the power to associate rare structural variants with traits [61], as well
as aggregation analysis of rare or singleton variants to study constraint of variation across
the genome [65, 3]. In contrast to long-read sequencing that can identify variation directly
from aligned reads or their assemblies [33, 30], most SVs are indirectly inferred from shortread data thorough discordantly aligned paired-end sequences, split-read mapping, and read
depth [41, 40, 66, 67, 43, 68]. Detecting SVs using SRS has posed greater challenges than
indels and single nucleotide variations (SNVs) [69, 5, 70, 6] due to the extensive range of
sizes and complexity of breakpoints associated with SVs compared to the relatively short
length of SRS.
While a number of studies have shown the importance of SVs in the human genome and
the association between SVs and traits, it is still challenging to produce accurate callsets
from the whole genome short reads sequencing data. The limitations and challenges in SV
6
detection by SRS and SMS data, combined with an importance in clinical sequencing motivates the need for efficient and precise tools to evaluate SV callset accuracy. Evaluating the
performance of SV calling methods is crucial for determining their accuracy and reliability.
Several studies have been conducted to benchmark SV callers using simulated data and
real data. A common framework for this benchmark is to establish a ground truth as a
consensus between calls from multiple sequencing technologies and SV discovery algorithms,
and to compare new calls to the ground truth. One of the first approaches to generate a gold
standard was the svclassify method that used machine learning to classify SVs as true positive
(TP), false positive (FP) or unclear [71]. Recently the Genome in a Bottle Consortium
(GIAB) made a high-quality benchmark set of large (≥ 50 bp) insertions and deletions using
multiple SV callsets produced by a wide range of analysis methods [72]. It can be used as
a truth set when benchmarking arbitrary SV callsets generated by different combinations of
algorithms and sequencing data [72, 58]. The accompanying computational method, Truvari
[72], is used to compare SV calls based on agreement between the breakpoints in the test and
benchmark calls, with the option to compare sequences of variants. This approach has been
an invaluable standard for benchmarking method accuracy, however the notion of comparing
callsets may be considered as a proxy comparison against the overarching goal of determining
how well an algorithm estimates the content of a genome with a particular sequence input.
To be considered a true positive, the test variant must be within a specified size and distance
of a ground truth call [72]. This truth set has become the standard set for benchmarking,
however it is limited to one sample genome and lacks flexibility in repetitive regions of a
genome. The tool Truvari requires manual setting of the validation parameters to achieve
the best results in different validation cases, therefore the validation results can vary because
of researcher understanding of the SV call sets and lack of consistency [58]. More over, in
repetitive regions there may be multiple placements of breakpoints that have equal support
for a variant, or similar placement of breakpoints that depend on parameters for scoring
alignments [73]. In part due to the breakpoint degeneracy, and difficulty in calling variants
7
in repetitive regions, the benchmark callset excludes many variants in repetitive regions [72],
although repetitive regions are enriched for SVs [31].
Motivated by the challenges mentioned above, this thesis aims to present two methods of
validating structural variants. First we introduce TT-Mars, structural variants assessment
based on haplotype-resolved assemblies, that utilizes high-quality, haplotype-resolved assemblies to evaluate SVs. Unlike traditional approaches that rely solely on comparing variant
calls, TT-Mars uses these assemblies as a ground truth input, enabling the benchmarking
of SV callers across multiple genomes. This method excels in accurately assessing SVs in
repetitive regions by comparing the sequences implied by an SV call directly to the assemblies. This is particularly advantageous as it avoids the direct comparison of variant calls,
which can be problematic in regions with complex genomic environment. TT-Mars complements existing validation methods such as those using the curated gold-standard callset
from the GIAB Consortium. It extends the possibility of benchmarking any genome with a
sufficiently high-quality haplotype-resolved assembly, enhancing validation capabilities, especially in repetitive areas where explicit breakpoints are ambiguous.
The efficacy of TT-Mars has been documented in its publication in Genome Biology
in 2022 [69] and has been applied in recent studies of structural variations [74, 56]. In
comparisons within high-confidence regions annotated by GIAB, TT-Mars shows consistent results with the Truvari analysis using the GIAB gold-standard callset. Moreover,
it aligns well with two other benchmarking methods: VaPoR [75], a long-read validation
tool, and dipcall+truvari [76], which uses assembly-based variant calls as a standard. Compared to VaPoR, TT-Mars requires less input and shorter runtime to achieve similar results, and it is less dependent on alignment gap parameters than dipcall+truvari. Utilizing
ten different assemblies, we evaluated the distribution of call accuracy for three short-read
SV calling algorithms—LUMPY [43], Wham [68], and DELLY [41], as well as one longread SV detection algorithm, pbsv [77]. The software for TT-Mars is available at GitHub
(https://github.com/ChaissonLab/TT-Mars.git), which also provides utilities for download8
ing assemblies and alignment maps.
Since short-read sequencing remains the primary technology for sequencing large populations, many algorithms have been developed to detect SVs using different types of information from short-read sequencing, but there are reports of both high rates of false positives
[5] and low recall rates [30]. A lower FDR is necessary for studies of rare and de novo SVs.
A common approach to lowering FDR for SV discovery has been to integrate multiple SV
callers and produce callsets as the consensus of multiple callers [78, 57, 14]. These methods
have been shown to have an FDR as low as 3% on de novo [14] variation. The FDR rate has
been established using direct molecular validation (e.g. PCR) [14, 13, 79], and orthogonal
data including microarrays [4], and more recently comparison to long-read callsets [3, 17,
80]. We have demonstrated that high-quality haplotype-resolved assemblies can be used to
measure SV callset accuracy across multiple genomes using the method TT-Mars [69]. The
availability of a large cohort of 47 high-quality, haplotype resolved diverse genomes paired
with short-read callsets creates the possibility of being used as a large corpus of adjudicated
calls using TT-Mars.
The Transformer model is a popular deep learning architecture with a unique attention
mechanism in its layers [81]. It has revolutionized the field of deep learning, particularly in
the domain of natural language processing (NLP) since introduced. The Transformer model
is built to capture long-range dependency of input text or sequence data. It has demonstrated remarkable generalization capability in different tasks from language translation to
protein structure prediction [82]. This capability is crucial in understanding the context
and relationships in sequencing data. We adapted the attention mechanism, and proposed
a deep learning method to validate structural variations (Devas), based on the transformer
encoder to validate different types of SVs from short reads sequencing data. This model
aims to validate and benchmark different SV detection methods on thousands of short read
sequencing samples.
Devas is trained on the reliable large corpus of adjudicated calls produced by TT-Mars,
9
and it is capable to produce accurate validation for candidate sets with short read sequencing
data as the input. Devas generalizes the validation ability of TT-Mars to a much larger
population samples by deep learning. Different from other deep learning methods, which
often tackle the structural variation problem as a image understanding task [51, 52], Devas
model it as a natural sequence processing task. The attention-based Transformer neural
network was adapted to learn and comprehend the feature sequences derived from short
read sequencing data and thereby confirm or reject input SVs as a result. In particularly,
sparse attention mechanism [83], which focuses on a subset attention to reduce the required
computation time and memory, is used to capture the distant relationship of long input
sequences in our case. Devas was shown to be able to validate DEL, DUP, INV and INS with
comparable results to TT-Mars on testing samples. We demonstrated that Devas achieved
a 3-fold reduction in the FDR of DELs compared to Samplot-ML [84], while retaining over
93% of the true variants. It was utilized to analyze a large callset from the 1KG project
samples, providing insights into rare structural variations.
1.4 Thesis Outline
This thesis is organized into five chapters. The first chapter is this introduction, providing
background information on structural variations and the development of sequencing technologies. It outlines the motivations for study of robust and reliable SV validation methods
and their application in large-scale studies. The second chapter focuses on the validation
method with haplotype-resolved assemblies as input. In the third chapter, two important
applications of the method are introduced. Next in the fourth chapter, we introduce our
deep learning method, detailing how it extends the capabilities of the previous method to
accommodate a larger scale of data while achieving comparable results. The final chapter
offers a summary and discussion, encapsulating the key findings and implications of the
research.
10
Chapter 2
Structural Variants Assessment Based
on Haplotype-resolved Assemblies
2.1 Introduction
Large-scale sequencing studies enable association of genetic variation such as common structural variations (SVs) [3], or rare variation of any [7] with traits and diseases. SVs include
deletions, insertions, duplications, and rearrangements at least 50 bases that as a class have
a considerable role in genetic diversity [8, 9], developmental disorders [10, 11], and cancer
[12]. Compared to variations such as SNVs, SVs have been historically more difficult to
detect using high-throughput short-read data, particularly due to the vast diversity of sizes
and breakpoint complexity that SVs span [17]. In recent years, single-molecule sequencing
(SMS) has been used to generate high quality structural variant callsets [30, 31, 32, 33] because long reads or their de novo assemblies span SVs, particularly in complex and repetitive
regions [5, 34].
Despite recent performance gains in SMS instrument throughput [85, 86], short-read
sequencing remains the primary technology for sequencing large populations [65, 3]. Many
algorithms have been developed to detect SVs using different types of information from
short-read sequencing [43, 41, 44], but there are reports of both high rates of false positives
11
[5] and low recall rates [30]. Additionally, detecting complex SVs is still challenging [30] and
callers can produce different SV callsets on the same genome sample [72], which makes it
more difficult to output a complete high-quality SV callset.
The limitations and challenges in SV detection, combined with an importance in clinical
sequencing motivates the need for efficient and precise tools to evaluate SV callset accuracy.
A common framework for this benchmark is to establish a ground truth as a consensus
between calls from multiple sequencing technologies and SV discovery algorithms, and to
compare new calls to the ground truth. One of the first approaches to generate a gold standard was the svclassify method that used machine learning to classify SVs as true positive
(TP), false positive (FP) or unclear [71]. Recently the Genome in a Bottle Consortium
(GIAB) made a high-quality benchmark set of large (≥ 50 bp) insertions and deletions using
multiple SV callsets produced by a wide range of analysis methods [72]. This benchmark
set can be used to evaluate arbitrary SV callsets generated by different combinations of algorithms and sequencing data. The accompanying method, Truvari [72], is used to compare
SV calls based on agreement between the breakpoints in the test and benchmark calls, with
the option to compare sequences of variants. To be considered a true positive, the test variant must be within a specified size and distance of a ground truth call [72]. This approach
has been an invaluable standard for benchmarking method accuracy, however the notion of
comparing callsets may be considered as a proxy comparison against the overarching goal
of determining how well an algorithm estimates the content of a genome with a particular
sequence input. In repetitive regions there may be multiple placements of breakpoints that
have equal support for a variant, or similar placement of breakpoints that depend on parameters for scoring alignments [73]. In part due to the breakpoint degeneracy, and difficulty
in calling variants in repetitive regions, the benchmark callset excludes many variants in
repetitive regions [72], although repetitive regions are enriched for SVs [31].
As de novo assembly quality has increased with improvements in algorithms and SMS
technologies, assemblies have been used to generate SV callsets [36, 37]. SV calls based on
12
haplotype-resolved assemblies are factored into the GIAB truth set, and haplotype-resolved
assemblies have been used to create benchmark callsets for structurally divergent regions [36],
however the difficulties in comparing calls in repetitive regions remain. Here we propose an
alternative approach to validate variant calls by comparing the sequences implied by an SV
call to assemblies rather than comparing variant calls. This complements the validation
method by the curated gold-standard callset from the GIAB such that any genome with a
haplotye-resolved assembly of sufficiently high quality may be included as a benchmark, and
can help validate SVs in repetitive regions because explicit breakpoints are not compared.
While this resource has historically not been available, recent advances in single molecule
sequencing and assembly have enabled more routine generation of high-quality haplotyperesolved assemblies. The Human Genome Structural Variation Consortium (HGSVC) has
generated 32 haplotype-resolved assemblies of human genomes at Phred quality scale over
40 with contig N50 values over 25 Mb [33]. The Human Pangenome Reference Consortium is
using a combination of Pacific Biosciences HiFi sequencing reads and the hifiasm method [38]
to generate haplotype-resolved assemblies with base quality approaching QV50 and assembly
N50 over 40 Mb. We have implemented our approach as a method, TT-Mars (structural
variants assessment based on haplotype-resolved assemblies), which assesses candidate SV
calls by comparing the putative content of a genome given an SV call against a corresponding
haplotype-resolved de novo assembly.
We demonstrate that within the regions annotated as high-confidence by GIAB, TT-Mars
has consistent results with truvari analysis using the GIAB gold-standard callset. TT-Mars
also has consistent results with two other benchmarking methods: VaPoR [75], which is a
long-read validation tool, and dipcall+truvari [76] using assembly-based variant calls as a
gold standard. We demonstrate that compared to VaPoR, TT-Mars requires smaller input
and shorter runtime to achieve comparable results, and validation of calls using TT-Mars is
less dependent on alignment gap parameters compared to dipcall+truvari. Using 10 assemblies, we evaluate the distribution of call accuracy for three different short-read SV calling
13
algorithms, LUMPY [43], Wham [68], and DELLY [41], as well as one long-read SV detection algorithm, pbsv [77]. The software is available at https://github.com/ChaissonLab/TTMars.git, which provides a utility to download assemblies and alignment maps.
2.2 Methods
2.2.1 Workflow
TT-Mars uses a haplotype-resolved long-read assembly, and a lift-over map to the human
reference to assess each candidate SV call (Figure 2.1a-2.1c). Each SV is considered independently as an edit operation on the reference genome (e.g. deletion, duplication, etc.) as
shown in Figure 2.2. For each type of SVs, an operation is the direct edit of the reference
sequence reflect the specific nature of the corresponding SV. For example, a insertion operation will insert the corresponding sequence fragment to the reference and the result is
expected to match the contig sequence, which contains the inserted sequence. SV calls are
evaluated by comparing both the reference modified by the edit operation and the original
reference genome to the assembly (Figure 2.1d-2.1g).
To speed up the reference/assembly comparison, only the local region surrounding the SV
call are compared using an assembly-genome orthology map constructed from whole-genome
alignments (Figure 2.1b). Each classification is made as a comparison between the relative
scores of the alignment of the reference/modified reference and assembly at the SV locus,
and/or the relative length, as detailed below.
2.2.2 Validation
The haplotype-resolved assemblies are pre-processed by TT-Mars to generate an orthology
map between the assembly and the reference for regions that are annotated as assembled
without error and have a clear 1-1 relationship. The contigs are aligned using lra [87]
14
Figure 2.1: TT-Mars Workflow. a, Assembly contigs are aligned to the reference, and the
shorter of two overlapping contigs are trimmed to generate a unique mapping. Regions on
the reference that are not covered by contigs are excluded. b, the alignment is used to
construct an orthology map at specific intervals (e.g. every 20 bases). c, For an SV, called
at an interval [s, e] on the reference, TT-Mars searches the orthology map for matches
outside the interval that most closely reflects the length of the SV. In this example, the
interval [a, b] immediately flanking a deletion SV maps to an interval on the assembly that
does not reflect the SV, but a wider search in the orthology map shows that [c, b] spans a
deletion in the assembly. d-g illustrate validation details by a deletion example. d, TT-Mars
takes the candidate call with w flanking bases on both sides. The interval on the reference
is compared with an interval on the assembly (e) before and after the SV operation. In f
and g, the deletion operation removes the corresponding sequence on the reference, and is
validated if the modified reference is more similar to the assembly than the original.
15
Figure 2.2: SV Operations of deletions, insertions, inversions and tandem duplications. For
each type of SVs, an operation is defined as the direct edit of the reference sequence reflect
the specific nature of the corresponding SV. If an SV is true, the reference sequence after
the SV operation will be more similar to the assembly contig compared to the reference sequence before the operation. For example, a deletion operation will delete the corresponding
reference sequence and the result is expected to match the contig sequence, which does not
contain the deleted sequence.
with the options -CONTIG -p s, from which an orthology map at fixed intervals (20 bp
by default) is generated using samLiftover (https://github.com/mchaisso/mcutils) (Figure
2.1b). When two contigs have overlapping alignments on the reference, the alignment of
the shorter contig is trimmed to exclude the overlap (Figure 2.1a). Alignments of reads
back to assemblies are used to flag potential misassemblies that should not be used to
validate calls. Correctly assembled short intervals (e.g. 100 base windows) should have
reads that map contiguously from at least one haplotype. We define an interval as correctly
assembled if at least five reads are aligned starting/ending at least 1k bases before/after
the interval. Intervals are misassembled if this condition does not hold, and the intervals in
the flanking 1-3kb are supported. These parameters are defined by the data to stringently
detect misjoins in assemblies and not be affected by read mapping artifacts in repetitive DNA.
Next, a high-confidence filter is created by excluding centromeres, intervals not mapped by
both haplotypes in autosomal and X chromosomes in females, and regions, and low-quality
assembled regions flagged by raw-read alignments (Figure 2.4). SV calls that are flagged as
analyzed if they are within the high-confidence filter.
16
Figure 2.3: The true positive rate of pbsv from TT-Mars versus the portion of genomes
covered by TT-Mars confident regions.
17
The orthology map is accessed as a function Lookup(c, p) that for a position p on chromosome c returns the assembly contig and position on the contig corresponding to the last
position on the orthology map at or before p. To evaluate each SV call, we compare the
local region surrounding the SV in either the original or modified reference to both assembled haplotypes (Figure 2.1b, 2.1c). For calls with a defined starting position s and ending
position e, such as deletion calls (Figure 2.1d), this region may be defined as a region on the
reference defined by these positions, expanded by a small number of bases on either side (w
bases, w=500), and the corresponding intervals on the assembly defined as Lookup(c, s − w)
and Lookup(c, e + w). We assume that true-positive calls will produce a modified reference sequence that matches one or both haplotypes with high identity, as shown in the top
figures in Figure 2.1e-2.1g, and that false-positive calls will produce a modified reference
sequence that is different from both haplotypes as shown in the bottom figures in Figure
2.1e-2.1g. However, in highly repetitive regions, such as variable-number tandem-repeats,
the alignment used to generate the orthology map may be different from that which is used
to generate the SV call, and the the interval defined directly from fixed offsets from the SV
boundaries may not reflect this.
To account for potentially inconsistent breakpoints, we search for a combination of boundaries that maximises the congruence between the modified reference and assembly. Two
criteria are used to quantify the discrepancy of the reference and contig intervals. First,
relative length is defined as:
relative length =
contig interval length – reference interval length
sv length . (2.1)
Second, the contig interval is aligned to the reference interval before and after the SV
operation. The relative score is calculated by
relative score =
alignment score af ter – alignment score before
|alignment score before|
, (2.2)
18
(a) Haplotype1 Assembly Score (b) Haplotype2 Assembly Score
Figure 2.4: The distribution of assembly scores across the sample genome HG002, defined
by contiguity of sequence alignment. We use Pacbio reads to assess the assembly scores.
Raw reads (before haplotype partition) are mapped back to the assemblies using lra with
the options -CCS/CLR -p s. The genome is divided into 100-base bins. A read that spans a
bin with at least 1kb flanking length on both sides is considered as a valid read. We count
the number of valid reads as the quality score for each bin. Noisy alignment regions and
low-coverage regions are given low quality scores.
19
where alignment score af ter is the alignment score of the contig interval and the reference interval after the SV operation and alignment score before is the alignment score
before the SV operation. Figure 2.1c illustrates how inconsistent alignments in repetitive
regions may cause incorrect inference of SV breakpoints in a lifted region.
TT-Mars validates separate classes of SVs by slightly different strategies. Deletion and
sequence-resolved insertion calls are annotated as true positive if the relative length is close
to one and the relative score is positive. Specifically, the true positive region of these two
types is defined as (Figure 2.5a):
−α × relative score + 1 − β ≤ relative length ≤ α × relative score + 1 + β,
0 ≤ relative score ≤ γ;
1 − δ ≤ relative length ≤ 1 + δ, relative score > γ,
(2.3)
where α, β, γ and δ are empirical parameters.
Insertion SV calls that do not include the inserted sequence, such as many insertion calls
made by short-read callers, are evaluated by relative length only, while inversion calls that
do not change the length of the sequence are evaluated by relative score.
TT-Mars validates duplication calls by considering possibilities of both tandem duplication and interspersed duplication. A tandem duplication operation adds a copy of the
duplicated sequence to the reference at the duplication locus in a tail to head orientation.
The start and end coordinates of the duplication that are lifted over include both copies
of the tandem duplication in true-positive calls. Similar to deletions and sequence-resolved
insertions, it is validated by using both the relative length and the relative score. Interspersed duplications are evaluated by aligning the duplicated sequence to the assemblies
using minimap2 [88] allowing for multiple alignments, and are validated if at least 90%
of the duplicated sequence aligns to a sequence in the assembly annotated as an insertion
20
Genome HG00096 HG00171 HG00513 HG00731 HG00732
Fraction 9.9% 10.0% 9.3% 9.5% 9.3%
Genome HG00864 HG01596 HG03009 HG01114 HG01505
Fraction 9.7% 10.1% 10.0% 10.2% 9.9%
Table 2.1: Fraction of genome that is excluded by TT-Mars. Including regions that are not
covered by 1 or 2 haplotypes of the assembly and the centromere.
relative to the reference.
The approach that compares genome content rather than callsets inherently does not
support measurements of missed calls, or false negatives (FN). To enable reporting FN, TTMars provides the option by comparing against a callset offered as ground truth. Variants
at least 50 bases produced by dipcall on autosomes and X chromosomes are used as the
truth set for annotating false negatives. Calls overlapping with contigs that are trimmed out
(Figure 2.1a) are ignored since they may produce duplicated truth calls. Because variants
from the candidate callset are not validated by matching calls, dipcall variants within a
specified number of bases of validated candidate calls (default 1kb) are considered matched.
Remaining unmatched calls are reported as FN.
2.3 Results
2.3.1 Validation of calls on GIAB HG002 assembly
On human genome sample HG002, we used TT-Mars to validate insertion, deletion, inversion
and duplication calls produced by four methods: LUMPY, Wham, DELLY and pbsv (Table
2.2). Translocation/break-end calls and calls on the Y-chromosome were ignored. The
runtime scales by the number of calls, most short read data sets can be evaluated within one
hour using a single core, and pbsv (long-read) callsets within 9 hours, both using up to 28G
of memory.
The analyzed rates are in the range of 92.2-96.7%. Among the analyzed calls, the TP
21
Figure 2.5: An example of classification of individual deletion calls by TT-Mars on HG002
made by DELLY a and LUMPY b. Each dot represents a candidate call. The classification
boundaries determined empirically for human data are shown by the light green region. Panel
b highlights different classifications made by TT-Mars and GIAB+truvari. c, The length
distribution of SVs in the HG002 callsets by four SV detection algorithms. The distribution
of calls that are analyzed or where TT-Mars does not provide an annotation (NA) are shown.
22
Callers TP FP NA Total
Wham 1130 144 74 1348
LUMPY 2990 825 324 4139
DELLY 12784 1833 812 15429
pbsv 45625 5057 1741 52423
Table 2.2: TT-Mars results on callsets from four SV discovery algorithms on HG002, including all calls larger than 10bp produced by each method.
rates range between 78.4-90.0%. Among the calls that are not analyzed (NA), a majority
of calls (65.7-78.4%) are in regions which are not covered by one or both assemblies and
14.8-31.4% are in centromeres. Figure 2.5a shows an example of scatter plot of validation
results from TT-Mars, where the TP and FP calls are separated as two clear clusters. Figure
2.5c gives the distribution of length of calls that are analyzed and missed by TT-Mars. The
distribution conforms with the length distribution of the benchmark SV callset [72].
We compared the TT-Mars results to the GIAB HG002 benchmark callset validated using
truvari (Table 2.3) set to search for calls within 1kb (e.g. refdist) and not using sequence
identity comparison. For long-read callsets, truvari was used with the option that compares
the sequences of SV for validation. Since the GIAB benchmark set only contains deletions
and insertions, validation of other types of SVs are not included. Generally, the two validation methods have consistent TP and FP results. For SV that are evaluated by both
methods (e.g. excluding the NA calls), 96.0%-99.6% of calls have the same classification
results in the four callsets (Figure 2.5b). For the three short read callsets, there is a net
increase of 121-1,476 calls analyzed by TT-Mars, and on the pbsv callset, TT-Mars analyzes
10,966 more calls (Figure 2.9 and 2.10). The regions annotated as confident for validating
variant calls by TT-Mars and excluded by truvari include 141 Mbp. These sequences predominantly overlap segmental duplications (99 Mbp) and highly repetitive sequences such
as variable-number tandem repeats (4 Mbp).
23
TT-Mars
Wham LUMPY
TP FP NA Sum TP FP NA Sum
TP 882 3 21 906 2091 8 51 2150
GIAB FP 1 13 0 14 22 218 3 243
NA 118 24 20 162 582 215 99 896
Sum 1001 40 41 1082 2695 441 153 3289
DELLY pbsv
TP FP NA Sum TP FP NA Sum
TP 2927 11 57 2995 8903 94 191 9188
GIAB FP 4 412 6 422 286 165 16 467
NA 1179 360 145 1684 9710 1463 721 11894
Sum 4110 783 208 5101 18899 1722 928 21549
Table 2.3: Comparison of TT-Mars and GIAB benchmarks on HG002. The two methods
have the same classification results on more than 96% of the analyzed calls, while TTMars analyzed more candidate calls on all the four callsets. A length filter (50bp - 10Mbp)
is applied and only deletions and insertions are included to match the truvari parameter
settings.
2.3.2 Performance on 10 HGSVC sample genomes
Gold-standard callsets such as those produced by the GIAB require intense effort to create. Because TT-Mars only needs haplotype-resolved assemblies, multiple samples may be
evaluated in order to give benchmark results as a distribution. Recently, the HGSVC has
generated 32 high-quality haplotype-resolved assemblies of human genomes [33]. We used
TT-mars to evaluate three short-read and one long-read SV-discovery algorithms on ten
samples sequenced and assembled by the HGSVC (Figure 2.6a). The pbsv callsets, except
the HG002 callset, were generated by the HGSVC [33].
Broadly, these calls are generated using the pbmm2 alignment method (using minimap2
[88]) and an SV detection algorithm designed for PacBio raw read data. A second long-read
SVs discovery method, cuteSV, had a low TP rate across every evaluation method, and is
excluded from results. The number of analyzed, true positive and genotype matched sites
assessed by TT-Mars are consistent across different samples, for both the short and long
24
Figure 2.6: a, SV metrics for four algorithms on 10 HGSVC genome samples, triangles
mark samples using HiFi assemblies. Benchmark results are given as a distribution by TTMars. b, The length distribution of results by dipcall+truvari, TT-Mars and VaPoR for pbsv
HG00096 calls. The red solid and dashed lines indicate that dipcall+truvari and TT-Mars
congruent calls have similar length distributions. The green and blue dashed lines show that
VaPoR has more NA and disagreed results with TT-Mars for small SVs (size < 100 bases).
read callsets. The fraction of genome that is excluded due to coverage of assemblies is from
9.3% to 10.2% (Table 2.1). The fraction of SV calls that are analyzed counts calls that are
made by an algorithm in less repetitive regions of the genome that are amenable to long-read
assembly, in particular excluding centromeres and high-identity long segmental duplications.
Wham callsets have the highest analyzed rate (93.3% on average) among the three short
reads algorithms.
The pbsv callsets have the analyzed rate in the range of 93.0-95.2%. The TP rates on
short-read callsets range between 63.9-85.2%, while pbsv calls have TP rates from 71.4-94.3%
considering all calls, and 91.7-94.8% for calls ≥ 30 bases. The sample HG00513 has lower TP
rates among all short read SV discovery algorithms, but also has roughly double the input
size of other samples that may have required parameter optimization (Figure 2.11). DELLY
and pbsv output genotype. Among the TP calls, genotype matched rates of DELLY are from
90.1% to 92.1% and of pbsv are from 73.9-90.5% (78.3-91.2% for SV ≥ 30bp). TT-Mars also
25
Figure 2.7: Count of small SVs for all the SVs and true positive SVs for ten sample genomes.
26
Figure 2.8: Loci of dipcall+truvari and TT-Mars calls from the HG00096 pbsv call set.
Green bars are the calls that two method agree on, red bars are the calls that two method
disagree on.
27
(a) Relative length and relative score of SVs analyed by both truvari and TT-Mars.
(b) Relative length and relative score of SVs analyed by TT-Mars and exclued by truvari.
Figure 2.9: Scatter plot of TT-Mars scores on pbsv HG002 callset.
reports the following recall results: The average recall of Wham, LUMPY, DELLY and pbsv
are 7.7%, 13.1%, 16.0% and 88.7% respectively.
The availability of assembled sequences allows TT-Mars to validate duplications, specifically checking the genomic organization is in tandem or interspersed. The average number
of duplications called by Wham, LUMPY, DELLY and pbsv on the ten HGSVC genome
samples are 1,463, 4,577, 761 and 1,458 respectively. Duplications that are tandemly organized may be validated as inserted elements at the location of the insertion because of the
tail-to-head organization. Interspersed duplications are more challenging to validate because
the target destination is unknown.
Because the GIAB benchmark callset does not include duplication calls, we estimated
the performance of TT-Mars using simulations. We ran 1,000 simulations of interspersed
duplications with an exact sequence copy and length uniformly taken from 100 to 100,000
on the HG002 assembly. Not all duplications were analyzed by TT-mars because there was
no requirements placed on the target site of duplications. In total, 82.8% duplications were
analyzed, of which 98.9% were validated (Table 2.4), indicating that interspersed duplications
28
Figure 2.10: Length distribution of calls that analyzed by both TT-Mars and truvari and
analyzed by TT-Mars only on pbsv HG002 callset.
29
TP FP NA Total
True DUP simulation 819 9 172 1000
False DUP simulation 0 882 118 1000
Table 2.4: TT-Mars validation results of simulated true and false duplications.
(a) BAM size and TP rate among short read algorithms
(b) The number of all the calls, analyzed calls
and TP calls per genome
Figure 2.11: The number of calls, TT-Mars analysis results and BAM size over ten sample
genomes.
from non-repetitive euchromatic regions of the genome may be validated by TT-mars. We
also simulated 1000 false duplications, TT-Mars analyzed 88.2% of them of which all were
annotated as FP (Table 2.4).
2.3.2.1 Comparison to VaPoR and dipcall
We compared our method to two other approaches for validating variant calls that use information from long reads or their assembly: VaPoR, which compares SV calls with individual
long reads, and dipcall+truvari, where a callset from a de novo assembly is used as a gold
standard for evaluating an SV callset. Both methods are compared with TT-Mars on callsets
produced by the four SV discovery methods (Wham, LUMPY, DELLY and pbsv). Only insertion and deletion variants were in dipcall output, so that other types are not included in
the comparison.
Due to the runtime of VaPoR, we limited analysis of data from one sample with each
30
Figure 2.12: Comparison of dipcall+truvari and TT-Mars on short read a and long read b
callsets of 10 sample genomes. The combinations of TP and FP for dipcall+truvari/TT-Mars
annotations are given with a horizontal scatter to distinguish points in individual categories.
SV discovery method (Table 2.5). On short read datasets, TT-Mars and VaPoR provide the
same validation result on 80.5%-92.0% for calls analyzed by both methods, however there
was a net increase of 302-1,863 calls that could be analyzed by TT-Mars. On the long read
dataset, the two methods agree on 67.6% of calls analyzed by both methods and TT-Mars
analyzes 39,187 more calls. Because of the efficiency of dipcall, we were able to compare all
SV callsets on the ten sample genomes. The validation of short read callsets is similar, with
the methods agreeing on validation results for 96.1%-99.8% of calls analyzed by both methods
(Figure 2.12a). This is an expected result because of the relatively low repetitive nature of
sequence where short-read algorithms detect SVs [73], and because both validation methods
rely on the same assemblies. Similar to the VaPoR comparison, there was a net increase
of 16-373 calls for which TT-Mars provides a classification compared to dipcall+truvari.
On the ten pbsv callsets, TT-Mars and dipcall+truvari have same classification results for
83.8%-86.9% of the analyzed calls (Figure 2.12b), although TT-Mars analyzes 1,497-2,229
more insertions/deletions than dipcall+truvari.
The increased number of calls from long-read callsets provide granularity for which the
31
TT-Mars
Wham LUMPY
TP FP NA Sum TP FP NA Sum
TP 1503 98 76 1677 2614 288 282 3184
VaPoR FP 42 108 29 179 163 434 276 873
NA 267 149 80 496 345 515 1370 2230
Sum 1812 355 185 2352 3122 1237 1928 6287
HG00171 HG00096
DELLY pbsv
TP FP NA Sum TP FP NA Sum
TP 5211 405 480 6096 42268 8694 3531 54493
VaPoR FP 1009 627 581 2217 12243 1320 870 14433
NA 2343 581 991 3915 37646 5942 2765 46353
Sum 8563 1613 2052 12228 92157 15956 7166 115279
HG03009 HG00096
Table 2.5: Comparison of TT-Mars and VaPoR. The two methods agree on most calls and
can analyze a similar number of calls across short-read callsets. On the long read callset,
TT-Mars evaluates 39,187 additional variants, the majority of which are under 100 bases.
validation approaches may be compared. We analyzed the classification results of TT-Mars,
VaPoR, and dipcall+truvari on the pbsv callset on HG00096 (arbitrarily selected from the ten
benchmark samples due to VaPoR runtime). There was no association between length (Figure 2.6b) nor genomic organization (Figure 2.8) for whether TT-Mars and dipcall+truvari
provided a validation, however there is considerable enrichment in disagreement between
methods in tandem duplications. For example, in HG00096, the 92.8% of calls with disagreeing validation overlap tandem repeats. The majority of the variants where VaPoR does
not provide a call while TT-Mars does are less than 100 bases. To gauge the true results when
two methods disagree, we randomly selected 50 (out of 2,434) calls with TT-Mars TP and
dipcall+Truvai FP results for manual inspection using IGV [89]. There are various sources
for the discrepancy of validation result. In 27 cases, a call made by pbsv is fragmented into
multiple calls by dipcall (Figure 2.13), highlighting how the assembly-based approach can
account for validation in repetitive DNA where breakpoints are uncertain. In 19 other cases
32
there are correct variants near by and the validation depends on the genomic region considered by truvari. The remaining four cases do not show a matched SV in the alignments
generated by minimap2 that are used for validation, but do show matched SV(s) in the lra
[87] based alignments, indicating how call-based validation rather than assembly-based validation is reliant upon agreement of gap penalty between detection and validation algorithms.
We also manually inspected the converse case 50 (out of 280) calls with TT-Mars FP and
dipcall+truvai TP results. Of these, 48 calls have matched SVs in the assembly alignment
used by dipcall (using minimap2) and 45 calls do not have matched SVs in the assembly
used by TT-Mars (lra). These represent the cases the assembly has unexpected complicated
alignment to the reference leading to spurious gaps.
2.4 Discussions
Our method relies on highly accurate genome assemblies that do not have rearrangements
that could miss a validation. We measured assembly by mapping all reads back to each
haplotype, and checking for contiguity of reads aligned across each position in the genome.
Misassemblies not supported by reads are expected to show punctate drops in coverage.
Overall the assembly quality was quite high: only ten sites in HG002 were identified as
possible misassemblies, and no SV overlapped these sites.
The majority of calls that are evaluated by both the GIAB + truvari and TT-Mars
(96.0%-99.6%) have the same classification, while evaluating 121-10,966 additional calls per
dataset. Furthermore, we are able to evaluate 302-39,187 additional calls per callset compared to the long-read validation tool VaPoR, while not requiring users to download entire
read data sets. We show that the assembly-content approach improves over benchmark data
sets created from assembly alignment in repetitive regions of the genome because variants
split into multiple calls from the assembly alignment are matched in the breakpoint searching
procedure.
33
Figure 2.13: An example shows TT-Mars correctly validates an insertion but dipcall+truvari
does not in a tandem repeat region, where the SV is split into two smaller insertions on the
assembly alignment. The insertion with length 997 bases is on chromosome 4, coordinate
1046348. The first track is the pbsv callset, and the second track is the dipcall callset,
followed by alignments used by TT-Mars (two tracks) and dipcall (two tracks). The bottom
track is a tandem duplication region.
34
Duplication calls are a particularly challenging class of variant to validate. This has been
done using comparison to array data [4], and more recently through semi-automated and
automated visual inspection of calls [84]. Ground truth SV callsets created from assemblies
detect duplications as insertions relative to the reference and are annotated by the site of
insertion in the genome, compared to duplication calls in short-read callsets that annotate the
source interval on the genome that is duplicated. By checking for both tandem duplications,
which can be confirmed as insertions in assembly callsets, as well as interspersed duplications
using the assemblies, TT-Mars can help provide insight on the classes of variation detected by
SV discovery algorithms. One caveat of this analysis is duplications are enriched in repetitive
sequences [90], and some validations will require high quality assemblies using ultra-long and
accurate (HiFi) sequencing technologies.
Our analysis enables SV benchmarks to be reported as a spread over multiple genomes,
rather than a point estimate on one curated data set, and provides insight on how extensible
methods are on additional genomes from a single curated callset. Fortunately, most methods had accuracy within 5%-10% across all genomes, indicating results are extensible across
genomes and, for example, the results in current large-scale short-read based genomics studies likely hold valid. Currently, TT-Mars only validates deletion, insertion, inversion, and
duplication calls. As additional methods emerge to detect more complex forms of variation
[91], these models can be incorporated into the edit operations used by TT-Mars.
TT-Mars inherently provides a rough estimate of sensitivity because it does not fit into
the paradigm of comparing inferred content, and requires variants to be called. This estimate
simply considers false negatives as variants detected by haplotype-resolved assemblies that
are not within the vicinity of the validated calls, and one should consider a class of variant
that may have multiple representations when reporting results. Furthermore, because of the
difference in scale of the number of variants discovered in short and long-read studies, it is
important to consider the sensitivity of medically relevant variants compared to the total
variant count.
35
To conclude this chapter, here we demonstrated an approach to validate SV using
haplotype-resolved de novo assemblies where a call is validated by comparing the inferred
content of a genome to the assembly, rather than comparing variant calls. This approach is
implemented in the software TT-Mars, which is distributed along with accompanying data
files required to benchmark variant calls on up to ten samples. It can validate SVs from both
short and long reads calling algorithms with high accuracy compared to other methods, as
well as maintain a broad usability. TT-Mars enables SV benchmarks to be reported as a
distribution over different genomes and provides information on the performance of methods
on multiple samples without curated benchmark sets.
36
Chapter 3
Applications of TT-Mars in large-scale
sequencing studies
3.1 Introduction
Large-scale studies are crucial in advancing our understanding of rare structural variants
with their relations to traits [61]. It is essential to establish and report the False Discovery
Rate (FDR) to interpret the results. FDR indicates the expected false SVs among all the
reported SVs, researchers can present more reliable results with a low FDR. TT-Mars was
used to control the FDR of SVs in two studies as described below. It produced precise
validation results, contributed to the production of higher quality SV results, and confirmed
the reliability of the reported SVs.
3.2 Validation Results
Motivated by the challenges and limitations of the SVs study, a recent study analyzed the
whole genome sequence data of more than 138k samples from the National Heart, Lung,
and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program [56]. They
generated more than 355k SVs and identified impacts of these SVs on multiple genes and
medical relevant regions. 59.34% of the SVs are novel and 690 SV hot spots and deserts are
37
identified [56]. This work is currently under reviewed.
TT-Mars was used to benchmark the SV callsets along with Truvari. The positive predictive value (PPV) estimated by TT-Mars averaged 0.90 for NA12878 deletion calls and
0.87 for NA19238 deletion calls. TT-Mars identified a large false discovery rate (FDR) of
duplications in the original callset. This result matched the validation result from Truvari.
Both methods output a PPV smaller than 0.15 for the duplications from 32 callsets (each
sample was sequenced by different sequencing centers) of sample NA12878 and NA19238.
Most of the calls that were validated by Truvari were also validated by TT-Mars, but a
portion of duplications that were validated by TT-Mars were missed by Truvari when using the default parameters. This is because TT-Mars is more flexible for determining the
underlying true breakpoints of potential duplications. To achieve comparable results with
TT-Mars (Figure 3.1b), Truvari requires manual selection of certain parameters and has the
risk of over tuning.
Based on the validation results from TT-Mars and Truvari, the duplication SV calls are
revised by discarding regions having false positive duplications. The final version of the
callsets contain duplications with PPV at about 86.5% (Figure 3.1a). Figure 3.1c shows the
size distribution of SVs of both genomes. It illustrates that TT-Mars is capable of validating
more large SVs than Truvari.
SVs that were not validated by TT-Mars or Truvari were further analyzed by Samplot
[84], results showed that they were in regions of low map-ability or repetitive regions [56].
TT-Mars demonstrates that it can produce robust and stable validation results for novel SV
callsets from different methods and samples. When compared to Truvari, the results are
proved to be reliable and TT-Mars keeps its advantages by successfully analyzing more SVs
and accurately validating flexible SV breakpoints.
To overcome the limitations of short reads sequencing, recently long reads sequencing
technologies have been used for SV detections. Long reads or their de novo assemblies span
SVs, particularly in complex and repetitive regions. The longer reads are potentially more
38
Figure 3.1: Validation of SV call sets against using haplotype-resolved assemblies for deletion
(DEL) and duplication (DUP) calls in the sample genome NA12878 and NA19238. A:
Evaluation by TT-Mars. B: Validations from both the TT-Mars and Truvari methods. C:
The size distribution of all calls (red), the count validated by TT-Mars (green), and the
count validated by Truvari (blue) for the combination of both genomes. [56]
39
Supplementary Figure 6. Comparison of Hapdup and Sniffles2 F1-scores against HiFi
assemblies.
Supplementary Figure 7. TT-Mars evaluation of Hapdup and Sniffles2 calls. Structural
variant calls from Hapdup and Sniffles2 were compared to the assemblies from the HPRC for
HG002 (top), HG00733 (middle), and HG02723 (bottom) with TT-Mars. The calls were either
validated by the alignment (green), not validated (orange), or couldn’t be annotated by TT-Mars
(blue). We evaluated all SVs across the genome (left), as well as the subset of SVs that don’t
overlap centromeres or segmental duplications larger than 10 Kbp (right).
32
available under aCC-BY 4.0 International license.
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
bioRxiv preprint doi: https://doi.org/10.1101/2023.01.12.523790; this version posted April 5, 2023. The copyright holder for this preprint (which
Figure 3.2: TT-Mars evaluation of Hapdup and Sniffles2 calls. The calls were either validated
by the alignment (green), not validated (orange), or could not be annotated by TT-Mars
(blue). [74]
capable of capturing signatures of larger and more complex SVs [5]. As a part of a project
for the NIH Center for Alzheimer’s and Related Dementias (CARD), recently researchers
developed a wet lab and computational protocol for genomics study using Oxford Nanopore
Technologies (ONT) long-read sequencing, which overcomes two major challenges of using
long reads sequencing technologies: too expensive and not scalable enough. They generated
SVs calls on different sample genomes and TT-Mars confirmed that when compared to other
tools, their callsets have a high quality as shown in Figure 3.2 ([74]). Structural variants
from Hapdup and Sniffles2 of samples HG002, HG00733, and HG02723 were evaluated by
TT-Mars. The results demonstrated the Hapdup-based SVs have a high quality.
3.3 Discussions
To conclude this short chapter, the validation results in the above TopMed and CARD SV
datasets demonstrated that TT-Mars is able to produce accurate validation results to help
control FDR in these large cohort sequencing studies. It is a robust tool to confirmed the
40
reliability of the reported SVs for both studies. As more high-quality assemblies become
available, TT-Mars is able to contribute on a larger scale.
41
Chapter 4
Filtering of Structural Variation ex post
facto by Deep Learning Neural Networks
4.1 Introduction
Despite recent development of single-molecule sequencing (SMS) and its application in producing high-quality structural variant callsets [33, 60], SV discovery on large scale cohorts
relies on short-read sequencing (SRS) [61, 62]. For example, multiple large whole-genome sequencing studies have been conducted including the TOPMed consortium [63], the All of Us
Research Program [64], and the UKBB [61]. These large-scale studies enhance the ability to
study rare structural variants with their relations to traits [61], and to perform aggregation
analysis of rare or singleton variants to examine the constraints of variation throughout the
genome [65, 3].
Given the prevalence of short-read sequencing (SRS) in large population studies, a number of algorithms have been developed to detect SVs using SRS [6]. One reason for many
methods having been written is that each method suffers from high false positive rates [5,
30]. Detecting SVs using SRS has posed greater challenges than indels and single nucleotide
42
variations (SNVs) [69, 5, 70, 6] due to the extensive range of sizes and complexity of breakpoints associated with SVs compared to the relatively short length of SRC. As a result, the
estimates of false discovery rates (FDR) for individual algorithms ranges from >10-1% [73,
17], which is 1-2 orders of magnitude greater than the FDR measured on single-nucleotide
polymorphisms [92].
Traditionally, efforts to lower FDR in SV discovery have involved integrating outputs
from multiple SV callers to form a consensus callset, an approach that has demonstrated
FDRs as low as 3% for de novo variation [14, 78, 57]. Each ensemble approach requires a
step that adjudicates calls using a complex set of scoring filters that are manually tuned or
are learned using machine learning approaches such as random forests and support-vector
machines [78, 93].
Newer approaches have used deep learning in order to train signals from true and false
variant calls [84]. This allows one to improve SV callset by predicting whether a call is
correct that has not been observed specifically in training data. One primary shortcoming
of previous work is a lack of large, gold-standard training data to use to build deep-learning
models. Direct validation using PCR is limited to a small number of loci [14, 13], and until
recently the truth set available from GIAB [72] was limited to one genome only.
Our work with TT-Mars has shown that high-quality haplotype-resolved assemblies can
significantly can produce validated SV callsets for any genome [69] that has a high-quality
haplotpe-resolved assembly and a paired SRS dataset. While the initial number of such
genomes was limited [73], the sequencing of the human pangenome [60] has produced a
resource of 40 such genomes that may be used as gold-standard training.
Early methods to detect small-scale variation including SNVs using machine learning
[51, 92] were based off of fixed-input size recurrent neural network architectures [49]. While
this has been applied to SV discovery and genotyping [84, 94], due to the nature of structural
variation spanning a wide distribution of lengths, alternative deep learning architectures that
can account for long-range patterns in input data may more appropriately model SV.
43
An alternative approach to machine learning is natural language processing (NLP) which
was primarily designed to interpret human language [95], and has been widely extended to
handle tasks with sequences [96, 97]. The Transformer architecture is an NLP approach
recently developed to facilitate faster and more efficient training, and improves over earlier
approaches because of the design of the architecture to learn the dependency between different part of input [98]. It has demonstrated remarkable generalization capability in different
tasks from language translation to protein structure prediction [82].
Inspired by its robust performance, we adapted the Transformer encoder architecture to
develop a structural variant (SV) validation model. The input features are extracted from
each base along the interval of a structural variant call allowing variable and lengthy input
sequences, and employ a sparse attention mechanism [99] to train on input sequences of
up to 2000 bases. This approach enhances the model ability to learn crucial relationships
between DNA bases while optimizing memory usage efficiently.
In this chapter, we propose a deep learning method to validate structural variations
(Devas), based on the transformer encoder to validate different types of SVs from short
reads sequencing data. This model aims to adjudicate variant calls made from short-read
datasets without relying on population frequency or ensemble approaches. Unlike other deep
learning approaches that treat SV detection as an image recognition task, Devas approaches
it as a sequence processing problem, taking advantage of the Transformer ability to model
complex patterns and relationships within data. This methodology allows Devas to extend
the validation capabilities of TT-Mars to novel sample populations efficiently. Devas is
trained on reliable sets generated by TT-Mars with SV calling methods including DELLY
[41] and Manta [44], and it is capable to produce robust validation results with short read
sequencing data as the input. Moreover, Devas integrates a sparse attention mechanism,
which focuses on a subset of the attention model to manage computation and memory more
efficiently, making it particularly suited for handling the long sequences typical of genomic
data [83].
44
Callsets TT-Mars
Short Read Sequencing
Truth
Set
Extracted Features
Extracted Features
Encoder Layer
Positional
Encoding
Multi-head attention
/ sparse attention
Layer Norm
Feed Forward
Layer Norm
Feed Forward
Classification
1KG Project Sequencing Data
TOPMed Study
Figure 4.1: Overview of Devas. Training data preparation, model structures and available
sequencing datasets.
As shown in the following sections, Devas is able to benchmark SVs calls from various of
calling methods on thousands of sample genomes. The model was benchmarked by TT-Mars
to show it is able to generate robust validation results for short read sequencing tools. The
validation can be generalized to all the sample genomes with only the short reads sequencing
BAM file as the input. We demonstrate that Devas can lead to a three-fold reduction in
FDR over samplot-ml [84], and apply Devas to predict the validity of SV calls from a recent
publication on 1000-genomes data [80], demonstrating an enrichment of false-positive variant
calls in singleton data.
45
4.2 Methods
4.2.1 Training data sets and feature extraction
The 1000 Genomes Project (1kGP) [100] stands as the most extensive openly available
resource for whole-genome sequencing (WGS) data. With a recent update, 3,202 samples
and 602 complete trios are available with 30X Illumina sequencing. We built a robust
training set with TT-Mars [69] validation results of DELLY [41] and Manta [44] on 40
sample genomes that had both short-read and high-quality haplotype-resolved sequencing
data available (Figure 4.1). DELLY and Manta are SV detection methods on short read data
that achieved top performance [57]. TT-Mars output validated and invalidated deletions,
insertions and duplications from candidate DELLY and Manta callsets. With high-quality
long-read assemblies as reference, TT-Mars results were demonstrated to be robust [69].
Among the training samples, in total there are 31,916 DELs, 11,012 DUPs, 4,669 INVs
and 25,309 INSs as shown in Figure 4.2a. Using TT-Mars to adjudicate variant calls, we
found the percentage of true calls among the various variant classes were 76.25% DELs,
39.49% DUPs, 29.06% INVs, and 73.09% INSs. Figure 4.2a also shows the distribution of
the number of SVs of each type across the training samples. Most of the samples contribute
similar number of each type of SVs to the training set, confirming that TT-Mars is able
to produce stable validation across samples. Figure 4.2b illustrates the length distributions
of all the types of SVs are as expected and the small peak in the INS length distribution
represents mobile elements. The positive and negative samples also have similar distributions
of size.
Important signals of the truth validated (True) and invalidated (False) SVs were extracted
from Illumina sequencing data for each base, including read depth (rd), soft-clipped reads
(sc), paired-end reads insert size (insert) and a local embedding of 3-mer sequences (emb).
Research has shown that 3-mer embedding is helpful to model and classify DNA sequences
46
([101]). 3-mer is chosen also to avoid a much larger vocabulary space and much sparser
embeddings. The read depth is extracted for each base. The number of soft-clipped reads
are calculated by a window with a length of 25 bp. All the reads overlapping with a window
are retrieved, the number of reads that are soft-clipped by more than 2% of the read length
are counted as the the number of sc for this window. The base positions in the current
window (25 bases) share this number as their number of soft-clipped reads. Similar to the
sc reads counting, the insert size is calculated by a 25-bp window also. To avoid duplicated
calculation, the insert size of all the forward reads (of the paired-end reads) are retrieved for
each window. Different from the sc read counting, the average value of insert size of all the
obtained reads is calculated to become the insert size of the 25 base positions in the current
window. Finally, we get an embedding vector of the local 3-mer for each base position. For
each base position, the 3-mer is the reference base before the position, the reference base at
the position and the reference base after the position. We trained Word2vec [102] embedding
models to learn 3-mer relations and represent each 3-mer as a vector of size five. For each
type of SVs, we trained a Word2vec model with the reference sequences overlapping with
our training samples and expected the models to learn and interpret each 3-mer with and
without overlapping SVs. To summarize, we build a sequence of feature vectors for each SV
including a flanking region at the beginning and the end. A feature vector is a vector of
floating values of length eight, including a concatenation of read depth, the number of softclipped reads, the average insert size and a vector of size five representing the underlying
3-mer is built for each base within the region. The model can handle SVs with different
length, and the features are scaled, concatenated and padded to become an input feature
matrix. For each candidate SV, its feature matrix has the columns to be the feature vectors
of each base position, . The model takes the feature matrix as input and output a binary
classification result for each SV as described in the following sections.
47
4.2.2 Model Architectures
We proposed a novel Transformer classifier model that integrates a sliding window attention
mechanism [99], designed to efficiently process sequences of substantial length. The model
is able to handle input sequence length as large as 2,000, making it suitable for the classification of SVs and output True or False. The core architecture of the model begins with
a linear transformation layer that maps the extracted input features from raw sequencing
data into a higher-dimensional embedding space. To incorporate important base-wise order
information, the positional encoding is added to the features, enabling the model to account
for information about the relative or absolute position of tokens within the input feature
matrix. The positional encoding equations for even and odd indices are written as Eq. 4.1
and Eq. 4.2 ([81]),
PE(pos, i) = sin
pos
10000
2i
dmodel
(4.1)
PE(pos, i) = cos
pos
10000
2i
dmodel
(4.2)
where pos denotes the position of the token (feature vector of each DNA base) in the
sequence, i represents the dimension in the embedding space, dmodel is the total number
of dimensions in the embedding space (the model depth). These equations use a decaying
factor 10000
2i
dmodel to ensure that the positional embedding representations diminish for higher
dimensions, allowing the model to more easily learn the relative positions. The positional
encodings are added to the input embeddings as shown in Figure 4.1:
x
′ = x + PE(pos, i). (4.3)
Finally, dropout is applied to the input for regularization:
48
x
′′ = Dropout(x
′
). (4.4)
The core structure of the model is the multi-head attention or sparse attention mechanism, which processes the input feature matrix for each candidate SV and calculates attention
matrix to learn the local contextual information. When using the sparse attention, the localized attention matrix offers computation and memory efficiency advantages over global
attention mechanisms, particularly for lengthy input for long SVs where focusing on immediate contextual information is more practical. Following the attention mechanism, the model
employs layer normalization (Eq. 4.5) and feed forward layer to aggregate the computed
attention information following the design of the standard Transformer network [81].
xˆi =
xi − µ
√
σ
2 + ϵ
(4.5)
where xi
is the i-th feature of the input, µ and σ are the mean and variance, ϵ is a small
constant for numerical stability.
The output vector is mapped to the output classes by a linear classification layer with
cross-entropy loss function (Figure 4.1). The model was trained with Adam optimizer [103]
with a scheduled learning rate decay exponentially. The design of the model allows for
adjustments in parameters such as model depth (default value 8), window size (default
value 32) and the number of attention heads (default value 4). In the following section, we
demonstrated that this novel SV classifier can have comparable performance to the robust
TT-Mars method on the HPRC samples, and it can achieve a solid understanding of SVs on
SRS data.
49
4.3 Results
4.3.1 Training and Validation on HPRC Samples
40 sample genomes were randomly chosen from the 47 reliable phased, diploid assemblies
from HPRC [60], the remaining 7 sample genomes are used for testing purpose. With a
combination of SV callsets from DELLY and Manta, the number of positive samples which
can be used in the training set is 48,540 and the number of negative samples is 24,336. A
break down of the number of positive / negative samples by different types of SVs and the
number of training samples of each genome are shown in Figure 4.2a. The genomic loci
distribution of positive and negative training samples on chromosome 1 and chromosome 7
are shown in Figure 4.4. There is no clear pattern is shown for positive or negative samples.
Weighted learning was used to compensate the unbalanced number of positive / negative and
different types of SVs, to improve fairness of the model and to enhance the recognition of
the smaller class. Insertions and deletions account for the majority of training samples and
there are more positive samples than negative samples. There are fewer duplications and
inversions in the training sets and they have more negative samples compared to insertions
and deletions, but the training results confirmed that the sample size is sufficiently large.
Different weights were assigned to each sub-class when calculating the cumulative loss. The
length of training SVs is distributed as expected (Figure 4.2b). The length range of insertions
is restricted from 50bp to 500bp because DELLY does not produce large insertions due to
high false positive rate [41]. Inversions have length between 50bp and 2,000bp, while deletions
and duplications are restricted to 400 - 2,000bp.
4.3.2 Benchmark by TT-Mars
Our previous version validation tools, TT-Mars, has been proved to have the ability of
producing trustful validation results using HPRC assemblies and has been used in multiple
50
Figure 4.2: Training performance on 40 HPRC samples and benchmarking results by TTMars on 7 testing samples. a: A break down of the number of positive and negative samples
by SV types. Within the positive samples, there are 24,336 deletions (DEL), 4,349 duplications (DUP), 1,357 inversions (INV) and 18,498 insertions (INS). Within the negative
samples, there are 7,580 deletions (DEL), 6,663 duplications (DUP), 3,312 inversions (INV)
and 6,811 insertions (INS). The violin plot represent the distribution of the number of SVs
of the 40 samples. b: Size distribution of training data. Length distributions of all the
types of SVs are as expected and the small peak in the INS length distribution represents
mobile elements. The distribution of positive and negative samples does not have an obvious pattern. c: Validation results of 7 unseen randomly selected samples, benchmarked by
TT-Mars. Overall Devas has F1 score of 89.86% and accuracy of 83.73%. DELs and INSs
validation results are slightly better than DUPs and INVs.
Figure 4.3: Validation results of 7 unseen randomly selected samples, benchmarked by TTMars of different size categories for each type of SVs. All the four types keep a stable results
of different metric including ACC (accuracy), PRE (precision), REC (recall), TNR (true
negative rate) and F1 (f1 score). Only large INVs show a minor decrease on the metrics,
medium size INVs have a highest f1 score of 84.29% while large size INVs have a lowest f1
score of 77.86% among the three groups.
51
Figure 4.4: Genomic loci distribution of positive (a) and negative (b) training samples on
selected regions of randomly chosen chromosome 1 and chromosome 7. Chromosome 1 region:
10,000,000-100,000,000. Chromosome 7 region: 8,000,000-50,000,000. Each region covers a
significant part of the underlying chromosome. There is no obvious pattern of the loci of
positive or negative samples.
studies recently [69]. For about 15,000 testing samples, Devas took about 6hrs to pre-process
the required features and took less than 2mins to validate all the SVs. We used five metrics
to evaluate the performance of Devas, including ACC (accuracy), PRE (precision), REC
(recall), TNR (true negative rate) and F1 (f1 score) as defined in following equations:
ACC = T P + T N
T P + T N + F P + F N
, (4.6)
PRE = T P
T P + F P
, (4.7)
REC = T P
T P + F N
, (4.8)
TNR = T N
T N + F P
, (4.9)
F1 = 2 ×
PRE × REC
PRE + REC. (4.10)
52
Table 4.1: Devas and Samplot-ML benchmarking results on DELs by TT-Mars.
Accuracy Precision Recall F1 Score FPR
Devas 91.52% 96.09% 93.42% 94.73% 16.86%
Samplot-ML 87.44% 88.26% 97.68% 92.73% 59.40%
Figure 4.2c demonstrates the benchmarking results with a breakdown of DEL, DUP,
INV and DUP on the testing samples. When benchmarked by TT-Mars, Devas shows an
accuracy of 83.57%, a precision of 96.13%, a recall of 84.35%, a true negative rate 80.06%
and an f1 score of 89.86% on the 7 testing samples. Generally, DELs show the best results
among the four types by different metrics. DELs have the highest ACC of 91.52%, while
the lowest ACC is 78.64% from DUPs. INVs have the highest TNR of 88.87%, and INSs
have the highest PRE of 97.50%. The benchmarking results illustrate that Devas is able to
produce robust results which are comparable to TT-Mars. We further break each type of
SVs into three size groups (small, medium, large) as shown in Figure 4.3 to demonstrate the
benchmarking results across different size of SVs. All the four types keep a stable results of
the five metric, except that large INVs have slightly smaller PRE and F1.
When compared with Samplot-ML [84], which can validate DELs automatically with
a convolutional neural network (CNN), our model outperforms it with a better accuracy,
precision and F1 score (Table 4.1). When we removed false SVs in the testing variants
output by the two methods, Devas is able to reduced the false discovery rate (FDR) from
21.75% to 3.91%, while Samplot-ML could only reduce the FDR to 11.74% for the DELs.
Our model achieved a 3-fold reduction in the FDR compared Samplot-ML, while retaining
over 93% of the true variants. Samplot-ML only has a marginal improvement in recall, but
this comes at the cost of a much higher FPR as shown in Table 4.1. Devas also has a better
f1 score of 94.74% when compared to Samplot-ML. Overall we conclude that Devas has
a better performance than Samplot-ML on the testing sets. While Samplot-ML can only
validate DELs, Devas is able to validate different types of SVs including DEL, DUP, INV
and INS. All types of SVs have a robust accuracy, F1 score and false positive rate (FPR).
53
Table 4.2: The FDR of different types of SVs with and without Devas, as well as Recall
values (all evaluated by TT-Mars)
SV Types ALL DEL DUP INV INS
FDR 17.54% 21.75% 60.51% 70.94% 8.25%
FDR with Devas 3.86% 3.91% 24.61% 25.32% 2.50%
Devas Recall 84.35% 93.42% 73.26% 91.58% 82.38%
The accuracy for insertions (INS) is lower than for deletions (DEL) because the INS training
samples from short reads SV calling tools, including DELLY and Manta, are concentrated
at smaller sizes bounded by the read length (150 bp) as shown in Figure 4.2b. Consequently,
this makes INS validation less generalizable compared to DEL. As for duplications (DUP),
it is harder to detect with short read sequencing data comparing to DEL and INS because
DUPs often involves more complex read mapped patterns [5]. When we evaluated the FDR
for each type of SVs (Table 4.2), overall Devas reduced the FDR of all the SVs from 17.54%
to 3.86% while keeping 84.35% of the original SVs.
For DUPs and INVs, when evaluated by TT-Mars the FDR is reduced by a large portion
although it is still relatively high. Consider the validation results on DELLY and Manta
separately, the F1 score for DEL and DUP are similar (92.75%, 76.72% for DELLY and
94.83%, 73.78% for Manta). For INS DELLY has an F1 score to be 94.12% and Manta has
88.93%, because the size of DELLY INSs are limited. There are not many INV samples from
Manta. As for the FDR, Devas is able to reduce the DEL FDR to 3.90% and 3.95% for
DELLY and Manta respectively. For DUPs, it is able to produce 31.25% and 20.72%. And
for INSs, the FDR after filtering by Devas is 1.23% and 2.78%. This benchmarking result
demonstrated that the model can outperform existing models and it is capable of producing
comparable validation results to TT-Mars on unseen sample genomes with only short read
sequencing data as input. Devas is also useful to filter out false SVs of a candidate callset
and output a refined callset with most of the true SVs. Devas exhibits strong generalization
capabilities and can be effectively applied to larger sample genome datasets as shown in the
following section.
54
4.3.3 Evaluation of Structural Variation from global diversity sequencing
The 1000-genomes cohort (1kGP) has been used as a benchmark for defining global genetic
diversity in multiple phases at increased variant resolution from pooled variant discovery in
phase 1 [2] to variant discovery in individuals in phase 3 [4]. The original 1kGP studies
used low-coverage sequencing. In a recent study each of the 2,504 genomes from phases 1-3,
along with 602 parent-child trios, were sequenced with a mean coverage of 30× [80]. In this
study, two different approaches using ensemble variant discovery with GATK-SV [65] and
joint population analysis with svtools [104]. A single unified callset was created by merging
calls based on a machine learning model [105] trained in part on variant calls from long read
callsets [73]. The high-coverage sequencing enabled more sensitive structural variant discovery increasing the average number of calls per genome from 3,431 in phase 3 to 9,655 [80]
and a total of 173,366 SV loci. Importantly, this callset reflects a fundamentally different
source of algorithms than what were used to train Devas.
To establish an independent estimate of the FDR for this callset, we validated SV calls
on 40 samples that have quality assemblies from the HPRC [60] using TT-Mars. Using our
established approach, we were able to validate 89.5% of DEL, 85.7% DUP, 41.15% INV, and
86.08% INS calls. In contrast, validation rates of 95.8%, 15.8%, 81.9%, and 95.6% of calls
were reported to be validated using overlap with PacBio based callsets [80].
We used Devas to validate the integrated SV callset from [80]. In total, 98,908 out of
112,263 SVs are validated by Devas, among which 90.30% of DELs and 88.20% of INSs are
validated, in close agreement to the TT-Mars validation on a subset of calls. The validation
rate of DUPs and INVs are lower (Figure 4.5), in agreement with the pattern observed in
both the TT-Mars and PacBio call overlaps.
The validation rate increased from 90.56% for singletons (AC = 1) to 97.29% for AC
= 10 SVs (Figure 4.6a), following expectations that validation rate increases with allele
55
Figure 4.5: Validation results by Devas of samples have allele count 1 to 10 from the 1kGP
samples. Overall 98,908 out of 112,263 SVs are validated by Devas. All the different types
of SVs achieved a high validation rate except for the INVs.
56
frequency [106]. Figure 4.7 shows the distributions of validation rate of each SV with allele
count smaller than 10. The validation rate is the number of sample genomes support it
(validated by Devas) divided by the allele count. Most of SVs have a high validated rate as
shown in the figure. The portion of 0 value validation rate has a decreasing trend as AC
increases, this confirms the result of Figure 4.6a. TT-Mars was used to compare with Devas
on SVs from sample genome that overlap HPRC samples, as shown in Figure 4.6c. The two
methods output matched validation results on more than 86% of SVs, including 92.77% of
DELs are matched. This demonstrates again that Devas can produce results comparable
to TT-Mars, and additionally, Devas can extend its capabilities to a much wider range of
datasets.
Furthermore, the validation rate of SVs validated by TT-Mars were calculated (Figure
4.6d), and the rates are higher compared to Figure 4.6a as expected. The study includes
602 complete trios to allow validation through inheritance patterns [80]. We leveraged the
trios data to provide validation on trio violation and no violation SVs in Figure 4.6b. On
autosomes, if both maternal genotype and paternal genotype are 0/0, then child genotype
0/1 and 1/1 will be a violation. If maternal genotype or paternal genotype is 0/0, then child
genotype 1/1 will be a violation. If both maternal genotype and paternal genotype is 1/1,
then child genotype 0/1 will be a violation. All the other cases are counted as no violation.
The validation rate of SVs with no violation is higher than the validation rate of SVs with
violations by 0.2.
Genomic non-coding constraint of haploinsufcient variation (Gnocchi) is a measure developed to quantify genomic constraint based on the analysis of variant calls from 76,156
human genomes [107]. Larger Gnocchi value indicates a region is more constrained and
therefore mutations in the region is more likely to be deleterious. It is well-established
that deleterious variation suffers negative selection and is expected to have a lower allele
frequency [108]. However, we hypothesized that erroneous variants are not biased against
affecting constrained sequences and the constraint score for erroneous variation would be
57
Figure 4.6: Validation results of Devas on 1kGP samples SV callsets. a: The total number
of SVs and validated SVs for AC 1-10. As the value of AC increases, the total SV count
decreases and the validated rate has an increasing trend from 90.56% to 97.29%. b: Validation rate of trio violation SVs and no violation SVs. The validation rate of no violation SVs
is noticeably higher. c: Comparison between Devas and TT-Mars for the sample genomes
overlapping HPRC assembly samples. Overall two methods output 86.00% matched results.
The percentage of matched results are 92.77%, 89.80%, and 82.41% for DELs, DUPs, and
INSs respectively. There are only 4 INVs included in the comparison, 3 matched and 1
mismatched. d: For SVs validated by TT-Mars, the total number of SVs and validated SVs
for AC 1-10. The validation rate is higher compared to a, which plots the validation rate for
all the SVs. e: Distribution of Gnocchi z-scores for valid and invalid SVs by Devas. Scores
for both valid and invalid SVs exhibit distributions similar to normal distribution. Welch’s
t-test shows that the mean of valid scores (-0.074) is significantly smaller than the mean of
invalid sores (0.012) with P-value 0.00084. f: Loci distribution of valid and invalid SVs on
chromosome1. No obvious patterns are showed.
58
Figure 4.7: Distributions of validation rate of each SV with allele count from 1 to 10. For
each SV, validation rate is calculated as the number of sample genomes support it (validated)
divided by the total number of sample genomes for this particular SV. Most of SVs have a
high validated rate.
higher than that of correctly called variants. Because rare variation, in particular singletons
is enriched for erroneous variation, this can conflate evolutionary signal with methodological
noise.
To test this, we compared Gnocchi scores for valid and invalid SVs with AC ≤ 10 (Figure
4.6e). The Gnocchi z-scores are provided as a value for each of the 1kb region on the
chromosomes. The score for each SV is calculated as the average scores of regions overlapping
with the SV. We first calculated the Gnocchi score for SVs as the average of all scores that
the SV spans. The valid SVs have an average Gnocchi score of -0.074 and the invalid SVs
have an average Gnocchi score of 0.012 (P = 0.0008, Welch’s t-test). Because a large SV may
affect a small region that is highly constrained, we also tested the maximum Gnocchi score
for the entire region overlapped by an SV. This had a similar difference shifted to greater
values, with the average Gnocchi score of an invalid call 0.149, and valid call 0.079 (P =
0.007, Welch’s t-test). When limiting analysis to only singleton calls (AF=1), there remains
a small yet significant difference of Gnocchi scores between invalid (score=-0.01) and valid
59
Figure 4.8: Distributions of Gnocchi z-scores of valid SVs for each allele count from 1 to 10.
There is no obvious bias from the plots, and the all mean values of the scores of each allele
count categories are negative.
(score = -0.06) (P=0.037, Welch’s t-test). The result aligned with our expectations that
incorrectly called rare SV is more likely to affect more constrained regions.
Finally, Figure 4.6f shows the valid and invalid loci are distributed on chromosome1 with
no obvious patterns. Figure 4.10 contains examples of valid and invalid SVs on the 1KG
samples SV callsets. There are valid and invalid deletions, duplications and insertions from
top to bottom. Figure 4.10a shows a clear deletion pattern, but Figure 4.10b does not.
Figure 4.10c shows a duplication with reads having incorrect orientation (green colors), but
Figure 4.10d does not. Figure 4.10e shows an insertion in the assembly that matches the
SV, but Figure 4.10f has an insertion in the assembly with a much smaller size.
To conclude the validation results on the 1kGP cohort, our model gives validation results
of SVs with AC 1-10 and matched a majority of TT-Mars results. The validation results
also aligned with the trio data and Gnocchi score expectation.
60
Figure 4.9: Distributions of Gnocchi z-scores of invalid SVs for each allele count from 1 to
10. There is no obvious bias from the plots. The means of AC=5 and AC=8 invalid SVs
are -0.16 and -0.19 respectively. This is likely because of the small sample size of each AC
categories of invalid SVs. Removing these two groups, the remaining invalid Gnocchi scores
still have a positive mean value closed to 0.42
4.4 Discussion
In this chapter we proposed a deep learning model (Devas) to evaluate structural variations
with short-reads sequencing data. The model is proved to be robust to reduce the false
discovery rate of different types of SVs while keeps a majority of true SVs. Devas is based
on the Transformer architecture with a sparse attention matrix. It becomes a novel model
which handles DNA sequencing data as a sequence processing task, where the token is built
base-wise. We demonstrated that sequencing data can be modeled and learned by natural
language processing models. Devas was trained on high-confident SV sets generated by our
previous validation method (TT-Mars). Training and learning were conducted on limited
sample genomes with ground truth assemblies, and the results are generalized to both a
large cohort of short read sequencing data and variant calls made by other methods than
what Devas was trained on. Devas was trained on NVIDIA A40 GPU for more than 500
61
Figure 4.10: Example of valid and invalid SVs on 1KG samples SV callsets. The left column
are the valid SVs, the right column are the invalid SVs. There are deletions, duplications
and insertions from top to bottom respectively.
62
epochs within eight hours on 45,000 samples, the pre-processing steps took a similar amount
of time. We aimed to keep the model building and training time to be flexible and easily
available.
Devas is proved to be robust on a variety of SV types, although the performance on
duplications and inversions is not as good as deletions and insertions. This shows that
the potential of generating more confident duplications and inversions samples can help the
generalization of Devas. Complex SVs have been detected by several methods, TT-Mars
has the ability to validate SVs by combinations over a region instead of only validate by
individual. To achieve this goal with a deep learning model, more training data that can
capture complex SV structures are in need.
Deep learning research and its applications have emerged as a popular topic in recent
years, particularly with the large language models like ChatGPT [109], which have demonstrated exceptional capabilities across numerous tasks. This allows a potential of transfer
learning of large models handle tasks in sequencing. As increasingly reliable datasets become
available, deep learning shows a promising future in the field of genomics.
63
Chapter 5
Conclusions and Discussions
In the previous chapters, we have explored the validation of structural variations using highquality haplotype-resolved de novo assemblies, integrating both traditional and novel deep
learning approaches to improve the robustness and accuracy of SV detection across multiple
genomes.
Our tool, TT-Mars, relies on accurate genome assemblies, enabling detailed validation
and reduces the disruptions caused by rearrangements. The majority of SV classifications by
TT-Mars align closely with those assessed by GIAB + Truvari, and it allows for the evaluation
of significantly more calls than other methods like the long-read validation tool VaPoR. This
method proves particularly effective in repetitive regions, where assembly-content approaches
provide more reliable variant matching compared to traditional methods. Additionally, we
introduced a deep learning model, Devas, developed to enhance the validation of SVs from
short-read sequencing data to a large scale. Utilizing the Transformer architecture, Devas
processes DNA sequencing data through a sequence-based approach rather than traditional
image-based methods, effectively lowering the false discovery rate while maintaining a high
detection rate of true SVs. This model has been trained on high-confidence SV sets generated by TT-Mars, demonstrating its ability to generalize across a large cohort of sequencing
data and the ability of deep learning in genomic analysis. Despite its successes, challenges
remain, particularly in the validation of duplications and inversions where Devas shows varying levels of performance. Enhancing the training dataset with more complex and varied
64
SV examples could improve its efficacy in these areas. To summarize, the methodologies
presented, TT-Mars and Devas, highlight significant advancements in the validation of SVs
using both traditional and deep learning approaches. By integrating these methods, we offer
a comprehensive toolset that improves the accuracy of SV detection across multiple genomes,
contributes to the understanding of genomic complexity, and leverages the latest advancements in machine learning to address longstanding challenges in genomics research. As more
robust datasets become available, the potential for deep learning in genomics continues to
expand, promising further innovations in the field.
65
Appendix A
Chapter 2 Supplementary Materials
A.1 Curation of dipcall+truvari and TT-Mars SV annotation
On the pbsv call set of sample HG00096, we randomly selected 100 calls that are given
different results by dipcall+truvari and TT-Mars (50 calls dipcall+truvari FP TT-Mars TP
and 50 calls dipcall+truvari TP TT-Mars FP) for manual inspection in IGV. In each figure,
the first track is the pbsv call set, the second track is the dipcall call set. The following
two tracks are alignments used to generate the orthology map in TT-Mars (lra), the next
two tracks are alignments used by dipcall (pbmm2). The track at bottom is the tandem
duplication regions. Annotation of each figure is in the caption.
A.1.1 Dipcall+Truvari FP and TT-Mars TP results
66
67
68
69
70
71
72
73
A.1.2 Dipcall+Truvari TP and TT-Mars FP results
74
75
76
77
78
79
80
Bibliography
[1] J Craig Venter et al. “The sequence of the human genome”. In: science 291.5507
(2001), pp. 1304–1351.
[2] 1000 Genomes Project Consortium et al. “A map of human genome variation from
population scale sequencing”. In: Nature 467.7319 (2010), p. 1061.
[3] Haley J Abel et al. “Mapping and characterization of structural variation in 17,795
human genomes”. In: Nature 583.7814 (2020), pp. 83–89.
[4] Peter H Sudmant et al. “An integrated map of structural variation in 2,504 human
genomes”. In: Nature 526.7571 (2015), pp. 75–81.
[5] Medhat Mahmoud et al. “Structural variant calling: the long and the short of it”. In:
Genome biology 20.1 (2019), p. 246.
[6] Steve S Ho, Alexander E Urban, and Ryan E Mills. “Structural variation in the
sequencing era”. In: Nature Reviews Genetics 21.3 (2020), pp. 171–189.
[7] UK10K consortium et al. “The UK10K project identifies rare variants in health and
disease”. In: Nature 526.7571 (2015), p. 82.
[8] Jason D Merker et al. “Long-read genome sequencing identifies causal structural variation in a Mendelian disease”. In: Genetics in Medicine 20.1 (2018), pp. 159–163.
[9] Alba Sanchis-Juan et al. “Complex structural variants in Mendelian disorders: identification and breakpoint resolution using short-and long-read genome sequencing”.
In: Genome medicine 10.1 (2018), pp. 1–10.
[10] Yong-hui Jiang et al. “Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing”. In: The American Journal of Human
Genetics 93.2 (2013), pp. 249–263.
[11] Mari EK Niemi et al. “Common genetic variants contribute to risk of rare severe
neurodevelopmental disorders”. In: Nature 562.7726 (2018), pp. 268–271.
[12] Geoff Macintyre, Bauke Ylstra, and James D Brenton. “Sequencing structural variants
in cancer for precision therapeutics”. In: Trends in Genetics 32.9 (2016), pp. 530–542.
[13] William M Brandler et al. “Frequency and complexity of de novo structural mutation
in autism”. In: The American Journal of Human Genetics 98.4 (2016), pp. 667–679.
81
[14] Donna M Werling et al. “An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder”. In: Nature genetics
50.5 (2018), pp. 727–736.
[15] EE Eichler. “Copy number variation and human disease”. In: Nat Educ 1.3 (2008),
p. 1.
[16] Evan Udine, Angita Jain, and Marka van Blitterswijk. “Advances in sequencing technologies for amyotrophic lateral sclerosis research”. In: Molecular neurodegeneration
18.1 (2023), p. 4.
[17] Xuefang Zhao et al. “Expectations and blind spots for structural variation detection
from long-read assemblies and short-read genome sequencing technologies”. In: The
American Journal of Human Genetics (2021).
[18] Frances Theunissen et al. “Structural variants may be a source of missing heritability
in sALS”. In: Frontiers in Neuroscience 14 (2020), p. 47.
[19] Frederick Sanger, Steven Nicklen, and Alan R Coulson. “DNA sequencing with chainterminating inhibitors”. In: Proceedings of the national academy of sciences 74.12
(1977), pp. 5463–5467.
[20] Jay Shendure and Hanlee Ji. “Next-generation DNA sequencing”. In: Nature biotechnology 26.10 (2008), pp. 1135–1145.
[21] Michael A Quail et al. “A large genome center’s improvements to the Illumina sequencing system”. In: Nature methods 5.12 (2008), pp. 1005–1010.
[22] Elaine R Mardis. “DNA sequencing technologies: 2006–2016”. In: Nature protocols
12.2 (2017), pp. 213–218.
[23] Iraad F Bronner et al. “Improved protocols for illumina sequencing”. In: Current
protocols in human genetics 79.1 (2013), pp. 18–2.
[24] Adam Ameur, Wigard P Kloosterman, and Matthew S Hestand. “Single-molecule
sequencing: towards clinical applications”. In: Trends in biotechnology 37.1 (2019),
pp. 72–85.
[25] Paul W Hook and Winston Timp. “Beyond assembly: the increasing flexibility of
single-molecule sequencing technology”. In: Nature Reviews Genetics 24.9 (2023),
pp. 627–641.
[26] Anthony Rhoads and Kin Fai Au. “PacBio sequencing and its applications”. In: Genomics, Proteomics and Bioinformatics 13.5 (2015), pp. 278–289.
82
[27] Christopher P Stefan et al. “Comparison of illumina and Oxford nanopore sequencing
technologies for pathogen detection from clinical matrices using molecular inversion
probes”. In: The Journal of Molecular Diagnostics 24.4 (2022), pp. 395–405.
[28] Mark JP Chaisson, Richard K Wilson, and Evan E Eichler. “Genetic variation and
the de novo assembly of human genomes”. In: Nature Reviews Genetics 16.11 (2015),
pp. 627–640.
[29] Matthew Pendleton et al. “Assembly and diploid architecture of an individual human
genome via single-molecule technologies”. In: Nature methods 12.8 (2015), pp. 780–
786.
[30] Fritz J Sedlazeck et al. “Accurate detection of complex structural variations using
single-molecule sequencing”. In: Nature methods 15.6 (2018), pp. 461–468.
[31] John Huddleston et al. “Discovery and genotyping of structural variation from longread haploid genome sequence data”. In: Genome research 27.5 (2017), pp. 677–685.
[32] Mircea Cretu Stancu et al. “Mapping and phasing of structural variation in patient
genomes using nanopore sequencing”. In: Nature communications 8.1 (2017), pp. 1–
13.
[33] Peter Ebert et al. “Haplotype-resolved diverse human genomes and integrated analysis
of structural variation”. In: Science (2021).
[34] Sai Chen et al. “Paragraph: a graph-based structural variant genotyper for short-read
sequence data”. In: Genome biology 20.1 (2019), pp. 1–13.
[35] Jang-il Sohn and Jin-Wu Nam. “The present and future of de novo whole-genome
assembly”. In: Briefings in bioinformatics 19.1 (2018), pp. 23–40.
[36] Chen-Shan Chin et al. “A diploid assembly-based benchmark for variants in the major
histocompatibility complex”. In: Nature communications 11.1 (2020), pp. 1–9.
[37] Adam C English et al. “Assessing structural variation in a personal genome—towards
a human reference diploid genome”. In: BMC genomics 16.1 (2015), p. 286.
[38] Haoyu Cheng et al. “Haplotype-resolved de novo assembly using phased assembly
graphs with hifiasm”. In: Nature Methods 18.2 (2021), pp. 170–175.
[39] Heng Li and Richard Durbin. “Genome assembly in the telomere-to-telomere era”.
In: Nature Reviews Genetics (2024), pp. 1–13.
[40] Ken Chen et al. “BreakDancer: an algorithm for high-resolution mapping of genomic
structural variation”. In: Nature methods 6.9 (2009), pp. 677–681.
83
[41] Tobias Rausch et al. “DELLY: structural variant discovery by integrated paired-end
and split-read analysis”. In: Bioinformatics 28.18 (2012), pp. i333–i339.
[42] Kai Ye et al. “Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads”. In: Bioinformatics
25.21 (2009), pp. 2865–2871.
[43] Ryan M Layer et al. “LUMPY: a probabilistic framework for structural variant discovery”. In: Genome biology 15.6 (2014), R84.
[44] Xiaoyu Chen et al. “Manta: rapid detection of structural variants and indels for
germline and cancer sequencing applications”. In: Bioinformatics 32.8 (2016), pp. 1220–
1222.
[45] Liang Gong et al. “Picky comprehensively detects high-resolution structural variants
in nanopore long reads”. In: Nature methods 15.6 (2018), pp. 455–460.
[46] Adam C English, William J Salerno, and Jeffrey G Reid. “PBHoney: identifying genomic variants via long-read discordance and interrupted mapping”. In: BMC bioinformatics 15 (2014), pp. 1–7.
[47] ZN Kronenberg et al. High-resolution comparative analysis of great ape genomes.
Science 360: eaar6343. 2018.
[48] Mircea Cretu Stancu et al. “Mapping and phasing of structural variation in patient
genomes using nanopore sequencing”. In: Nature communications 8.1 (2017), p. 1326.
[49] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553
(2015), pp. 436–444.
[50] Jianxiao Liu et al. “Application of deep learning in genomics”. In: Science China Life
Sciences 63 (2020), pp. 1860–1878.
[51] Ryan Poplin et al. “A universal SNP and small-indel variant caller using deep neural
networks”. In: Nature biotechnology 36.10 (2018), pp. 983–987.
[52] Victoria Popic et al. “Cue: a deep-learning framework for structural variant discovery
and genotyping”. In: Nature Methods (2023), pp. 1–10.
[53] Mei-Wei Luan et al. “Evaluating structural variation detection tools for long-read
sequencing datasets in Saccharomyces cerevisiae”. In: Frontiers in Genetics 11 (2020),
p. 159.
[54] Moritz Smolka et al. “Detection of mosaic and population-level structural variants
with Sniffles2”. In: Nature biotechnology (2024), pp. 1–10.
84
[55] Shunichi Kosugi et al. “Comprehensive evaluation of structural variation detection
algorithms for whole genome sequencing”. In: Genome biology 20 (2019), pp. 1–18.
[56] Goo Jun et al. “Structural variation across 138,134 samples in the TOPMed consortium”. In: bioRxiv (2023), pp. 2023–01.
[57] Varuni Sarwal et al. “VISTA: An integrated framework for structural variant discovery”. In: bioRxiv (2023), pp. 2023–08.
[58] Adam C English et al. “Truvari: refined structural variant comparison preserves allelic
diversity”. In: Genome Biology 23.1 (2022), pp. 1–20.
[59] Shu Mei Teo et al. “Statistical challenges associated with detecting copy number variations with next-generation sequencing”. In: Bioinformatics 28.21 (2012), pp. 2711–
2718.
[60] Wen-Wei Liao et al. “A draft human pangenome reference”. In: Nature 617.7960
(2023), pp. 312–324.
[61] Bjarni V Halldorsson et al. “The sequences of 150,119 genomes in the UK Biobank”.
In: Nature 607.7920 (2022), pp. 732–740.
[62] Gareth Hawkes et al. “Whole genome association testing in 333,100 individuals across
three biobanks identifies rare non-coding single variant and genomic aggregate associations with height”. In: bioRxiv (2023), pp. 2023–11.
[63] Daniel Taliun et al. “Sequencing of 53,831 diverse genomes from the NHLBI TOPMed
Program”. In: Nature 590.7845 (2021), pp. 290–299.
[64] All of Us Research Program Genomics Investigators et al. “Genomic data in the All
of Us Research Program”. In: Nature 627.8003 (2024), p. 340.
[65] Ryan L Collins et al. “A structural variation reference for medical and population
genetics”. In: Nature 581.7809 (2020), pp. 444–451.
[66] Arda Soylev et al. “Discovery of tandem and interspersed segmental duplications
using high-throughput sequencing”. In: Bioinformatics 35.20 (2019), pp. 3923–3930.
[67] Can Alkan, Bradley P Coe, and Evan E Eichler. “Genome structural variation discovery and genotyping”. In: Nature Reviews Genetics 12.5 (2011), pp. 363–376.
[68] Zev N Kronenberg et al. “Wham: identifying structural variants of biological consequence”. In: PLoS computational biology 11.12 (2015), e1004572.
85
[69] Jianzhi Yang and Mark JP Chaisson. “TT-Mars: structural variants assessment based
on haplotype-resolved assemblies”. In: Genome Biology 23.1 (2022), p. 110.
[70] Shanika L Amarasinghe et al. “Opportunities and challenges in long-read sequencing
data analysis”. In: Genome biology 21.1 (2020), p. 30.
[71] Hemang Parikh et al. “svclassify: a method to establish benchmark structural variant
calls”. In: BMC genomics 17.1 (2016), p. 64.
[72] Justin M Zook et al. “A robust benchmark for detection of germline large deletions
and insertions”. In: Nature biotechnology (2020), pp. 1–9.
[73] Mark JP Chaisson et al. “Multi-platform discovery of haplotype-resolved structural
variation in human genomes”. In: Nature communications 10.1 (2019), pp. 1–16.
[74] Mikhail Kolmogorov et al. “Scalable Nanopore sequencing of human genomes provides
a comprehensive view of haplotype-resolved variation and methylation”. In: bioRxiv
(2023), pp. 2023–01.
[75] Xuefang Zhao, Alexandra M Weber, and Ryan E Mills. “A recurrence-based approach
for validating structural variation using long-read sequencing technology”. In: GigaScience 6.8 (2017), gix061.
[76] Heng Li et al. “A synthetic-diploid benchmark for accurate variant-calling evaluation”. In: Nature methods 15.8 (2018), pp. 595–597.
[77] Aaron M Wenger et al. “Accurate circular consensus long-read sequencing improves
variant detection and assembly of a human genome”. In: Nature biotechnology 37.10
(2019), pp. 1155–1162.
[78] Fritz J Sedlazeck et al. “Tools for annotation and comparison of structural variation”.
In: F1000Research 6 (2017).
[79] William M Brandler et al. “Paternally inherited cis-regulatory structural variants are
associated with autism”. In: Science 360.6386 (2018), pp. 327–331.
[80] Marta Byrska-Bishop et al. “High-coverage whole-genome sequencing of the expanded
1000 Genomes Project cohort including 602 trios”. In: Cell 185.18 (2022), pp. 3426–
3440.
[81] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[82] John Jumper et al. “Highly accurate protein structure prediction with AlphaFold”.
In: Nature 596.7873 (2021), pp. 583–589.
86
[83] Rewon Child et al. “Generating long sequences with sparse transformers”. In: arXiv
preprint arXiv:1904.10509 (2019).
[84] Jonathan R Belyeu et al. “Samplot: a platform for structural variant visual validation
and automated filtering”. In: Genome biology 22.1 (2021), pp. 1–13.
[85] John Eid et al. “Real-time DNA sequencing from single polymerase molecules”. In:
Science 323.5910 (2009), pp. 133–138.
[86] James Clarke et al. “Continuous base identification for single-molecule nanopore DNA
sequencing”. In: Nature nanotechnology 4.4 (2009), pp. 265–270.
[87] Jingwen Ren and Mark JP Chaisson. “lra: A long read aligner for sequences and
contigs”. In: PLOS Computational Biology 17.6 (2021), e1009078.
[88] Heng Li. “Minimap2: pairwise alignment for nucleotide sequences”. In: Bioinformatics
34.18 (2018), pp. 3094–3100.
[89] James T Robinson et al. “Integrative genomics viewer”. In: Nature biotechnology 29.1
(2011), pp. 24–26.
[90] Andrew J Sharp et al. “Segmental duplications and copy-number variation in the
human genome”. In: The American Journal of Human Genetics 77.1 (2005), pp. 78–
88.
[91] Jiadong Lin et al. “Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants”. In: Genomics, Proteomics & Bioinformatics (2021). issn:
1672-0229. doi: https://doi.org/10.1016/j.gpb.2021.03.007. url: https:
//www.sciencedirect.com/science/article/pii/S1672022921001431.
[92] Kishwar Shafin et al. “Haplotype-aware variant calling with PEPPER-Margin-DeepVariant
enables high accuracy in nanopore long-reads”. In: Nature methods 18.11 (2021),
pp. 1322–1332.
[93] Danny Antaki, William M Brandler, and Jonathan Sebat. “SV2: accurate structural
variation genotyping and de novo mutation detection from whole genomes”. In: Bioinformatics 34.10 (2018), pp. 1774–1777.
[94] Jiadong Lin et al. “SVision: a deep learning approach to resolve complex structural
variants”. In: Nature methods 19.10 (2022), pp. 1230–1233.
[95] Diksha Khurana et al. “Natural language processing: State of the art, current trends
and challenges”. In: Multimedia tools and applications 82.3 (2023), pp. 3713–3744.
87
[96] Abdul Wahab et al. “DNA sequences performs as natural language processing by
exploiting deep learning algorithm for the identification of N4-methylcytosine”. In:
Scientific reports 11.1 (2021), p. 212.
[97] Dan Ofer, Nadav Brandes, and Michal Linial. “The language of proteins: NLP, machine learning & protein sequences”. In: Computational and Structural Biotechnology
Journal 19 (2021), pp. 1750–1758.
[98] Sanghyuk Roy Choi and Minhyeok Lee. “Transformer architecture and attention
mechanisms in genome data analysis: a comprehensive review”. In: Biology 12.7
(2023), p. 1033.
[99] Iz Beltagy, Matthew E Peters, and Arman Cohan. “Longformer: The long-document
transformer”. In: arXiv preprint arXiv:2004.05150 (2020).
[100] 1000 Genomes Project Consortium et al. “A global reference for human genetic variation”. In: Nature 526.7571 (2015), p. 68.
[101] Zhaochong Yu et al. “KMer-Node2Vec: Learning Vector Representations of K-mers
from the K-mer Graph”. In: bioRxiv (2022), pp. 2022–08.
[102] Tomas Mikolov et al. “Efficient estimation of word representations in vector space”.
In: arXiv preprint arXiv:1301.3781 (2013).
[103] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”.
In: arXiv preprint arXiv:1412.6980 (2014).
[104] David E Larson et al. “svtools: population-scale analysis of structural variation”. In:
Bioinformatics 35.22 (2019), pp. 4782–4787.
[105] Guolin Ke et al. “Lightgbm: A highly efficient gradient boosting decision tree”. In:
Advances in neural information processing systems 30 (2017).
[106] Karri Kaivola et al. “Genome-wide structural variant analysis identifies risk loci for
non-Alzheimer’s dementias”. In: Cell genomics 3.6 (2023).
[107] Siwei Chen et al. “A genomic mutational constraint map using variation in 76,156
human genomes”. In: Nature 625.7993 (2024), pp. 92–100.
[108] Monkol Lek et al. “Analysis of protein-coding genetic variation in 60,706 humans”.
In: Nature 536.7616 (2016), pp. 285–291.
[109] Partha Pratim Ray. “ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope”. In: Internet of Things
and Cyber-Physical Systems (2023).
88
Abstract (if available)
Abstract
Structural variation has been intensively studied for the past years. However, it is still challenging to detect and characterize structural variation with a high accuracy.
To alleviate the need of benchmarking structural variations, this dissertation addresses validation of structural variations through traditional algorithms and deep learning natural language processing models. We developed TT-Mars and Devas to assess different types structural variations from a variety of resources. Based on haplotype-resolved assemblies, TT-Mars provides validation of structural variations by evaluating variant calls based on how well their call reflects the content of the ground truth assembly. It is able to provide benchmarking results of structural variations detection tools on a number of genomes. Devas is a deep learning model with Transformer architecture using sparse attention. This sequence- base approach reduces the false discovery rate while ensuring a high detection rate of true structural variations. With short read sequencing data as the input, Devas can be generalized to callsets of large-scale cohort and output robust results. Both of the methods have been demonstrated to be reliable on different datasets. TT-Mars has been published and used in multiple studies. The integration of TT-Mars and Devas presents a robust framework that advances structural variations detection and contributes to our understanding of genomic complexities.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Application of machine learning methods in genomic data analysis
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Machine learning of DNA shape and spatial geometry
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Profiling transcription factor-DNA binding specificity
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Data-driven learning for dynamical systems in biology
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Genome-wide association study of factors influencing gene expression variation and pleiotropy
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Data modeling approaches for continuous neuroimaging genetics
Asset Metadata
Creator
Yang, Jianzhi
(author)
Core Title
Validating structural variations: from traditional algorithms to deep learning approaches
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2024-05
Publication Date
05/21/2024
Defense Date
05/03/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,genomics,OAI-PMH Harvest,structural variations
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chaisson, Mark (
committee chair
), Chen, Liang (
committee member
), Chen, Muhao (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
jianzhi0515@gmail.com,jianzhiy@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113950169
Unique identifier
UC113950169
Identifier
etd-YangJianzh-12982.pdf (filename)
Legacy Identifier
etd-YangJianzh-12982
Document Type
Thesis
Format
theses (aat)
Rights
Yang, Jianzhi
Internet Media Type
application/pdf
Type
texts
Source
20240521-usctheses-batch-1157
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
genomics
structural variations