Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
danbing-tk: a computational genetics framework for studying variable number tandem repeats
(USC Thesis Other)
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
danbing-tk :
A Computational Genetics Framework for
Studying Variable Number Tandem Repeats
by
Tsung-Yu Lu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
August 2022
Copyright 2022 Tsung-Yu Lu
Acknowledgements
The five years of my PhD life are full of experiences I could never have imagined before coming
to the US. There are lots of highs and lows, which all become a part of me that I treasure dearly.
Without all the amazing people around me, I would not be able to overcome all the obstacles,
form new knowledge, share exciting research findings, create gazillions of happy memories, and
become a better person.
My advisor, Mark, has been one of the most important people to me for the past few
years. I am so fortunate and grateful for having an advisor like him. On the research side, he
always gives critical and constructive advice and makes my work more rigorous. When I have
too much self-doubt, he pushes me to go forward and have faith in myself. He connected me
with so many research groups both in academia and in industry and cares a lot about my career
after PhD. Beyond research, he is also a great mentor for cycling and life in general. I learned so
much from him about cycling. He even saved me a few hundred bucks by offering a quick fix to
my bike. When going through difficult life situations, I was also able to chat with him about my
thoughts and feelings. He was always very supportive and compassionate, which makes it much
easier to survive the tough times.
I am also very grateful for having Jingwen, Robel, Quentin, and Keon in the lab. We have
gone through so much and had tons of fun together. I learned a lot from you guys and appreciate
all the support from you all. Outside of my lab, I want to thank Brendon particularly for being
such a great friend and my emotional support. I am glad that you survived all my venting
sessions. Naturally, I must thank all the other people in the MCB squad – Yingfei, Ming and
George. We were so productive and happy when working together. I also want to thank the other
ii
people in the board game group – Obadiah and Yibei. Our after-hours gatherings would not be as
much fun without you guys.
I am also immensely grateful to all the teammates from the USC triathlon. You all have
been my greatest companions throughout the PhD, especially Bryant and Stefan. I will forever
miss all the fun workouts with you all, from Monday/Wednesday swims to Tuesday evening
tracks, from weekend workouts to collegiate races and eventually the Nationals!
Pursuing a PhD in the US makes me feel that I am losing the connections with Taiwan
sometimes. I appreciate all the precious memories created with my dear Taiwanese friends –
Tsu-Pei and Roy from QCB; Cho-Ying, Chou-Chun, Te-Lin, Jeremy, Chien-Sheng, Chen-Yu,
and Wang Kai from Viterbi; Yen-Ching, Michelle, Tim, Huang Chun, Ya-Wen, Yi-Jheng,
Camille, and Alisha from HSC.
I want to thank Nilson and Aaron for being the special people in my life. May we all find
happiness eventually. Lastly, I would like to send endless love to my family – Dad, Mom, and
my sister. Thanks for always being there for me even though sometimes I did not check in on
you often enough. I am eternally grateful to you all.
iii
Table of Contents
Acknowledgements ii
List of Tables vi
List of Figures vii
Abstract ix
Chapter 1: Introduction 1
1.1 From human genetics to VNTR genotyping ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1
1.2 Genotyping VNTRs from short reads ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 4
1.3 Mapping functional motifs in VNTRs ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 7
Chapter 2: Profiling Variable Number Tandem Repeat Variation across Populations
Using Repeat-Pangenome Graphs 10
2.1 Abstract ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 11
2.2 Introduction ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 12
2.3 Materials and methods ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 15
2.3.1 Repeat-pangenome graph construction ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 15
2.3.1.1 Initial discovery of tandem repeats ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 15
2.3.1.2 Boundary expansion of VNTRs ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 16
2.3.1.3 Read-to-graph alignment ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 17
2.3.1.4 Graph pruning and merging ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 18
2.3.1.5 Alignment quality analysis ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 18
2.3.1.6 Data filtering ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 18
2.3.1.7 Predicting VNTR lengths ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 19
2.3.1.8 Comparing with GraphAligner ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 20
2.3.2 Population analysis ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 21
2.3.2.1 VST calculation ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 21
2.3.2.2 Properties of VST ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 21
2.3.2.3 Identifying unstable loci ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 22
2.3.2.4 Identifying differential motif usage and expansion ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 23
2.3.3 eQTL mapping ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 24
2.3.3.1 Retrieving datasets ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 24
2.3.3.2 Genotype data preprocessing ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 24
2.3.3.3 Expression data preprocessing ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 25
2.3.3.4 Association test ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 26
iv
2.4 Results ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 26
2.4.1 Repeat pan-genome graph construction ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 26
2.4.2 Read-to-graph alignment in VNTR regions ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 29
2.4.3 Profiling VNTR length and motif diversity ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 32
2.4.4 Association of VNTR with nearby gene expression ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 35
2.5 Discussion ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 43
Chapter 3: The Motif Composition of Variable-Number Tandem Repeats Impacts
Gene Expression 45
3.1 Abstract ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 46
3.2 Introduction ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 47
3.3 Materials and Methods ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 51
3.3.1 Repeat pangenome graph construction ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 51
3.3.2 VNTR genotyping ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 51
3.3.3 Alignment quality analysis ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 51
3.3.4 Graph compaction and motif dosage computation ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 52
3.3.5 Quality control of motifs ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 52
3.3.6 eQTL mapping ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 53
3.3.7 Fine-mapping ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 54
3.4 Results ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 56
3.4.1 Repeat pangenome graphs enable accurate profiling of motif composition ⋅ ⋅ ⋅ ⋅ 56
3.4.2 VNTR motif composition has pervasive cis-effects on gene expression ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 57
3.4.3 Disease relevance of eMotifs ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 59
3.5 Discussion ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 69
Chapter 4: Conclusions and Future Work 72
References 75
Appendix A: Supplementary Information for Chapter 2 87
A.1 Supplemental Figures ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 87
A.2 Supplemental Tables ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 129
Appendix B: Supplementary Information for Chapter 3 133
B.1 Supplemental Figures ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 133
B.2 Supplemental Tables ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 144
v
List of Tables
A.1: Initial VNTR discoveries ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 129
A.2: False mapping of reads by danbing-tk over the initial 73,582 loci. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 129
A.3: eVNTRs discovered in this work that overlap with other studies ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 130
A.4: Data source ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 131
A.5: Augmenting database with disease-related tandem repeats ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 132
A.6: Comparison of alignment statistics between danbing-tk and GraphAligner. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 132
A.7: Realignment statistics of misaligned VNTR reads from bwa. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 133
B.1: Additional disease-relevant tandem repeats included in our reference set. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 144
B.2: eMotif discoveries for disease-relevant genes. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 146
B.3: eQTL mapping using a genome-wide P-value cutoff. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 147
vi
List of Figures
A.1: Comparison between the tandem repeat database in GangSTR and this work. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 87
A.2: An example of multiple STR annotations within a VNTR. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 88
A.3: An example VNTR annotation split by adVNTR-NN. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 89
A.4: Completeness of VNTR annotations in individual genomes. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 90
A.5: Classes of VNTRs removed by alignment quality filtering. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 91
A.6: LSB at repetitive regions. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 92
A.7: LSB at non-repetitive regions of all genotyped samples. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 93
A.8: LSB at non-repetitive regions preserves the relation between samples at repetitive
regions. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 94
A.9: Nearest neighbor search for LSB at VNTR regions using LSB at nonrepetitive regions
as a proxy. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 95
A.10: Profile of prediction accuracy for each sample. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 96
A.11: Performance of per-locus length prediction accuracy relative to GRCh38. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 97
A.12: Correlation between the estimation error in VNTR length and in LSB. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 97
A.13: Example of deviation in LSB across samples. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 98
A.14: Distribution of length estimation error for loci with or without a missing haplotype. ⋅ ⋅ 98
A.15: Correlation between length estimation error and fraction of novel k-mers. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 99
A.16: Relationship between GC content and length prediction error. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 99
A.17: Effect of GC content change on bias and length estimation. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 100
A.18: Examples of unstable loci with individuals > 10 standard deviations above the mean. ⋅ ⋅ 101
A.19: Null and observed distributions of kmcd and rd2 between the EAS and AFR
populations. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 102
A.20: Distance of TRs and eTRs to telomere. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 103
A.21: Association between the top 50 pairs of eVNTR and eGene. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 104
A.22: Conditional association of chr5:96896863-96896963 VNTR with ERAP2 expression
over chr5_96916885_T_C_b38. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 113
A.23: Linkage disequilibrium (LD) between chr5:96896863-96896963 VNTR and nearby
SNPs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 113
A.24: Spurious alignment of Illumina reads to GRCh38 at a VNTR locus. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 114
A.25: Boundary expansion recovers the proper boundary of VNTR alleles. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 115
A.26: Distribution of number of genes overlapping shuffled high VST loci. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 116
A.27: Distribution of genes and UTR regions overlapping shuffled unstable loci. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 117
A.28: Number of eVNTRs shared between or specific to each tissue. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 118
A.29: Length distribution of VNTRs and eVNTRs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 119
A.30: Sample QC on VNTR genotypes of the 1000 Genomes. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 120
vii
A.31: Sample QC on VNTR genotypes the GTEx Genomes. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 121
A.32: Growth of relative VNTR-graph size. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 122
A.33: Example of under-alignment of orthologous VNTR sequences by pggb. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 122
A.34: Misalignment of simulated VNTR reads by bwa. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 123
A.35: Misalignment of VNTR reads to GRCh38 rescued by danbing-tk. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 124
A.36: Relationship between VNTR length and prediction error. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 124
A.37: Relationship between eVNTR P-value and prediction error. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 125
A.38: Comparing the alignment accuracy with and without threading. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 126
A.39: Replication of Vst on the 698 genomes related to the 1KGP samples. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 127
A.40: Incremental RPGG construction and change in boundary annotations. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 128
B.1: Distribution of VNTR size in GRCh38 and in the 70 HGSVC assemblies. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 133
B.2: Distribution of VNTR size versus allele properties. Private alleles are those
observed only once across the 70 haplotypes. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 134
B.3: Distribution of aln-r2 in this work and Lu 2021. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 135
B.4: Distribution of eVNTR discoveries from Lu 2021 (Lu and Chaisson 2021) missing
in this work. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 135
B.5: Size of locus-RPGG before and after compaction. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 136
B.6: Distribution of the standard deviation of absolute percentage error for all motifs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 136
B.7: Distribution of the number of outlier samples across all motifs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 137
B.8: Distribution of eMotif P-values from Geuvadis missing in GTEx. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 137
B.9: Dot plot analysis of AVPR1A RS3 VNTR. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 138
B.10: Frequency of CACNA1C risk motifs across the 35 HGSVC assemblies. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 139
B.11: Number of eGenes with likely causal eMotifs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 140
B.12: Number of eVNTRs with likely causal eMotifs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 141
B.13: Number of likely causal eMotifs. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 142
B.14: Frequency of likely causal eMotifs across the 35 HGSVC assemblies. ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 143
viii
Abstract
Variable number tandem repeats (VNTRs) are consecutive repetitive elements in the human
genome. They are subject to expansion and contraction during replication and exhibit
considerable variability between individuals. Most genetics studies exclude these regions due to
unreliable read alignments and variant calls. However, VNTRs contribute to abundant
phenotypic variations in the human population and have extensive impact on disease risk. While
methods exist to genotype VNTRs, no tools consider the following key components at the same
time. First, genotyping VNTRs from short reads can benefit from the vast amount of population
or cohort studies based on short-read sequencing. Second, multiallelic variant calls are preferred
for the copy number varying nature of VNTRs. Third, motif variations are a common mode of
mutation besides expansion/contraction, and their functional impacts have not been
systematically evaluated. In this dissertation, I present our efforts to genotype VNTRs from short
reads using a novel data structure – the repeat-pangenome graph – which provides a unified
framework for profiling the functional impact of VNTRs at both copy number and motif levels.
ix
Chapter 1
Introduction
1.1 From human genetics to VNTR genotyping
Human genetics aims to understand the relationship between genetic and phenotypic variations.
It allows us to describe the modes of heritability for a phenotype such as recessive and dominant.
It illuminates the fact that quantitative traits can be explained by allelic dosage. It provides us
with an explanation for the complex interplay between genetic loci and phenotypes such as
epistasis and pleiotropy. Building upon decades of research, scientists were finally able to start
answering one of the fundamental questions in this field – what are the causal genetic variants
for human disease? Most of the diseases that could be addressed in the early ages are monogenic,
or caused by a single mutation in the human genome, such as sickle cell anemia (Pauling and
Itano 1949) , cystic fibrosis (Kerem et al. 1989) and Huntington's disease (MacDonald et al.
1993) . Linkage analysis of affected pedigrees is a useful approach to narrow down genetic loci
for follow-up studies, in order to pinpoint the exact causal variants and establish disease
mechanisms. The fruitful results provide insight to human disease and enable translational
research and disease prevention. That said, there are several limitations in conventional linkage
analysis (Visscher et al. 2012) . First, the mapping resolution for causal loci is low, typically at
the megabase scale, due to homologous recombination events within a pedigree. Second, the
method performs best with monogenic diseases since they have higher penetrance and fewer
factors are required for modeling. Common diseases and complex traits that involve multiple
causal variants with minor effects would be intractable with this approach.
1
For the last two decades, the field has shifted towards using genome-wide association
studies (GWASs) as the major approach to map causal variants (Dewan et al. 2006; Klein et al.
2005; Wellcome Trust Case Control Consortium 2007) . Several advances made GWAS possible
and more powerful along the way. First, the development of genotyping arrays for the whole
genome allows causal variants to be tagged by nearby single nucleotide variants (SNVs).
Second, the population linkage disequilibrium (LD) structure from the HapMap project (The
International HapMap Consortium 2005) increases the mapping resolution by imputing more
SNVs from a genotyping array. The outcomes were uplifting for risk locus identification and
translational research of several diseases such as type 2 diabetes and schizophrenia. For the
former case, population-specific variants with relatively large effect sizes were found in P AX4 for
East Asians (Fuchsberger et al. 2016) and in TBC1D4 for Inuit people (Moltke et al. 2014) . A
candidate drug was developed based on the finding of loss-of-function mutations in SLC30A8
(Flannick et al. 2014) . As for schizophrenia, the C4A gene was implicated in the etiology (Sekar
et al. 2016) based on extensive GWAS signals in the huge MHC locus (Schizophrenia Working
Group of the Psychiatric Genomics Consortium 2014) . The polygenic nature of schizophrenia
also led to the proposal of drug repurposing for multi-target drug development (de Jong et al.
2016) .
Despite efforts to unravel the genetic architecture of complex traits or common diseases
through GWA studies, the fraction of phenotypic variance explained by reported signals is
generally low (Tam et al. 2019; Eichler et al. 2010; Manolio et al. 2009) , and the effect sizes are
typically small for the majority of the reported signals and moderate for a few (Visscher et al.
2012; Eichler et al. 2010) . Using pedigree or twin studies, researchers were able to estimate the
upper bound for the fraction of variance ( ) that could be explained by additive genetic effects,
2
referred to as narrow sense heritability (Coventry and Keller 2005; Eaves et al. 1978) . The gap
between and the fraction of the variance explained by GWAS ( h
2
) is therefore called the
missing heritability. This could be attributed to the influence of several complex factors such as
environment (Felson 2014; James 2008; Rundle et al. 2007) , genotype-by-environment (GxE)
interactions (Brown et al. 2014; Glass et al. 2013; Lee et al. 2014; Fairfax et al. 2014) ,
dominance (Reynolds et al. 2021; López-Cortegano and Caballero 2019; Sanjak, Long, and
Thornton 2017) , epistasis (Brown et al. 2014; Buil et al. 2014) and structural variation
(Jakubosky et al. 2020; Manolio et al. 2009; Payer et al. 2017) .
In this thesis, we limit the scope of study to structural variants (SVs). They are genetic
variants with sizes greater than 50 bp that can be combinations of insertions, deletions, and
inversions. The majority of SVs are derived from repetitive regions in the human genome
(Chaisson et al. 2019) and are referred to as copy number variants (CNVs), which can be further
divided into interspersed or tandem duplications. There are mainly two classes of tandem
duplications – short tandem repeats (STRs) and variable number tandem repeats (VNTRs).
Computational algorithms for STR genotyping (Gymrek et al. 2012; Mousavi et al. 2019;
Willems et al. 2017; Dashnow et al. 2018; Tankard et al. 2018; Tang et al. 2017; Dolzhenko et al.
2017, 2020) have reached a level that is both accurate and scalable for incorporation into clinical
pipelines (Stranneheim et al. 2021; Kristina et al. 2022) . In contrast, software packages for
VNTR genotyping were nonexistent until the release of adVNTR in 2018 (Bakhtiari et al. 2018) ,
which was limited to targeted genotyping of only ~3000 loci, leaving the majority of the ~100k
VNTRs unexplored. While VNTR sequences could be directly resolved using Pacbio sequencing
instead of indirect estimation from short-read sequencing (SRS), Pacbio sequencing remains
expensive and is neither scalable for population studies nor practical for most personalized
3
medical decisions. Furthermore, most of the population or case-control studies use SRS (1000
Genomes Project Consortium et al. 2015; GTEx Consortium 2020; Laakso et al. 2017) .
Therefore, methods that can exploit the vast amount of existing short-read data would be
valuable.
The overarching goal of my thesis is to develop a graph-based framework to genotype
VNTR length and content, and to evaluate the contribution of VNTR variations to the phenotypic
diversity in the human populations. In the following sections of this chapter, I will be discussing
the significance, challenges, and opportunities for the scientific problems discussed in chapter 2
and 3.
1.2 Genotyping VNTRs from short reads
Variable number tandem repeats (VNTRs) constitute a significant portion of sequence variation
(Chaisson et al. 2019) and are implicated in a diverse range of human disease such as bipolar
disorder (Benedetti et al. 2008) , type I diabetes (Pritchard et al. 2007) , and Alzeheimer’s disease
(Pritchard et al. 2007) . They are defined as sequences consisting of consecutive copies of a
motif, each greater than 6 bp. Reliable read alignment and variant calling at these regions are
usually hampered by their repetitive nature and high mutation rate (Kirby et al. 2013;
Hajirasouliha et al. 2010) , which increases the divergence between a reference and a query,
known as allelic bias. Most functional genomic studies had to remove these regions for proper
data interpretation (Amemiya, Kundaje, and Boyle 2019) . It is therefore hypothesized that
genetic variants within VNTR regions contribute to a significant portion of the missing
heritability of complex traits (Chaisson et al. 2019; Eichler et al. 2010) .
4
The first step to address this is to produce accurate variant calls in these regions. One
approach is to generate de novo diploid assemblies of human genomes from long-read
sequencing (LRS) data to directly obtain the ground truth VNTR sequences (Ebert et al. 2021;
Chaisson et al. 2019) . Variants within these regions can be called by comparing diploid
assemblies with the reference (Li et al. 2018) . Tandem repeats are found to be the major source
of SV insertions/deletions (>50%) and are better represented in diploid assemblies compared to
GRCh38 (Chaisson et al. 2019) . This approach also has higher sensitivity in SV detection.
Assembling nine genomes from three trios produces an increase in discovered SVs more than
three-fold compared to short-read based SV calling (Chaisson et al. 2019) . Furthermore, SV
events are more enriched in the African population (Chaisson et al. 2019) , warranting the
inclusion of genomes with diverse ancestry background for population-based research. One
common disadvantage for the biallelic variant calls adopted by these studies is that the
quantitative nature of repeat expansion and contraction is not captured. Multiallelic variant calls
are preferred for investigating the functional roles of copy number variations in VNTR regions.
By representing VNTR variation with multiallelic calls, recent studies have revealed
abundant associations between VNTR length and biological traits e.g. gene expression level
(Garg et al. 2021; Bakhtiari et al. 2021) , height (Mukamel et al. 2021; Beyter et al. 2021) , and
disease status (De Roeck et al. 2018; Course et al. 2020; Trost et al. 2020) . Only one study
discussed here (Beyter et al. 2021) uses LRS to obtain VNTR genotype directly. Most studies,
however, use SRS data and rely on information from a single reference genome. Either the read
depth over GRCh38 (De Roeck et al. 2018; Mukamel et al. 2021; Garg et al. 2021) or a model
built from GRCh38 alone (Bakhtiari et al. 2021) is used to infer repeat length. Reference bias
remains largely unsolved.
5
One solution to counter reference bias is to perform sequence analysis on a pangenome
graph (PGG) that encodes sequence variations across populations in a single data structure while
“collapsing” regions that are identical between individuals into single sequences. Several
pangenome graph models have demonstrated the utility of PGGs to increase the accuracy of
variant calls in complex regions of the human genome (Iqbal et al. 2012; Hickey et al. 2020; Li,
Feng, and Chu 2020) . Despite this, multiple problems with VNTR genotyping remain unsolved.
First, the multiple sequence alignment (MSA) step in PGG construction can be unreliable for
VNTRs, for which multiple solutions with similar quality are likely, and the graph structure
derived from the MSA output can introduce artifacts. Second, the boundaries of a VNTR locus
remain undefined even after MSA. Third, PGG-based variant callers are generally designed for
biallelic variants. Several key components for comparing VNTR genotypes between individuals
are missing. For example, consistent annotations of VNTR boundaries across individuals are
required to produce correct length estimates. Multiallelic genotypes such as repeat length or the
occurrence of a motif are preferred for follow-up functional analyses.
The rise of de novo assembly, the advantage of PGG, and the multiallelic representation
of VNTR calls offer a great opportunity to solve the VNTR genotyping problem. In chapter 2, I
present our recent development in genotyping VNTRs with danbing-tk (Lu and Chaisson 2021) .
We establish a pipeline to annotate VNTRs, map orthology, encode variations with a
repeat-pangenome graph (RPGG), and align short reads to the RPGG. Leveraging the rich
information in graph alignments, we reveal abundant loci with differential repeat expansion and
motif usage across the human populations. Unstable loci that are potentially susceptible to repeat
expansion are highlighted. We also discover hundreds of VNTR loci for which the length is
6
associated with nearby gene expression. Our toolkit allows researchers to profile VNTRs that
were largely missed or ill-defined in typical analyses.
1.3 Mapping functional motifs in VNTRs
VNTRs are the major type of SVs in the human genome (Chaisson et al. 2019) . They are also
more likely to impact phenotypes compared to SNVs and indels (Chaisson et al. 2019) . Indeed,
VNTRs have been reported to have extensive effects on human complex traits and diseases such
as psoriasis (Hollox et al. 2008) , lipoprotein A levels (Kraft et al. 1992) , Alzheimer’s (De Roeck
et al. 2018) , and ALS (Course et al. 2020) . Most studies do not survey the genome-wide
association of VNTRs with specific traits of interest and require prior knowledge to pinpoint
disease loci. Only recently, systematic evaluation of the functional impact of VNTRs became
possible with advances in sequencing technologies and computational methods. Association
studies such as eQTL mapping (Lu and Chaisson 2021; Bakhtiari et al. 2021; Eslami Rasekh et
al. 2021) and GWAS (Garg et al. 2022; Mukamel et al. 2021) lead to the discovery of pervasive
effects of VNTRs on complex traits. Among the abundant signals, fine-mapping suggests a
significant fraction of them are poorly tagged by nearby SNVs (Mukamel et al. 2021; Bakhtiari
et al. 2021; Garg et al. 2022) , implicating VNTRs as one of the major sources of missing
heritability. Putative causal variants identified with these approaches provide valuable
information for follow-up studies to unravel disease mechanisms and initiate translational
research.
When examining the methodology of each study, either biallelic variant calls or
multiallelic VNTR lengths are used to test for associations with human traits. As discussed in the
7
previous section, multiallelic calls are more ideal for testing the functional impact of VNTRs due
to their susceptibility to copy number variation through expansion or contraction. Additionally,
repeated motifs that contain transcription factor binding sites might act together to perturb gene
expression. This suggests an advantage to using the occurrence of each unique motif within a
VNTR locus as the multiallelic genotype. Nonetheless, existing genotyping tools only output the
copy number or length of a VNTR, which essentially assumes 100% identity across all repeat
units. Depending on the problem formulation, either a probabilistic model built from the single
allele in GRCh38 (Bakhtiari et al. 2021) or the read depth over GRCh38 (De Roeck et al. 2018;
Mukamel et al. 2021; Garg et al. 2021) is used to estimate VNTR length. Copy number variation,
however, is not the only type of variation across VNTRs. SNVs and indels within each repeat
unit are also prevalent (Course et al. 2021, 2020) and essentially create a novel “motif” through
each mutation. The impact of the majority of these motif variations is unknown except for a few
case studies. For example, a single nucleotide insertion in the MUC1 VNTR creates a premature
stop codon and causes medullary cystic kidney disease type 1 (Kirby et al. 2013). In another
case, certain motifs in the CACNA1C VNTR are found to decrease gene expression while other
motifs have the opposite effect and protect against schizophrenia (Song, Lowe, and Kingsley
2018). Currently, no tools are aware of VNTR motif variations, impeding the detection of
functional motifs in VNTRs.
In our previous work (Lu and Chaisson 2021) , we show that motifs or paths differentially
represented across populations can be detected from graph alignments output by danbing-tk . It is
therefore possible to perform association studies to identify motifs for which the variation
correlates with changes in traits such as gene expression. Moreover, with the release of 35 HiFi
8
assemblies by HGSVC (Ebert et al. 2021) , we have the opportunity to improve the quality of
RPGG and extend the application of danbing-tk to detect functional motifs.
In chapter 3, I present our recent efforts in mapping functional VNTR motifs (Lu and
Chaisson 2022) . We update the RPGG using the 35 high-quality long-read assemblies released
by HGSVC and expand the genotypable set of VNTRs. We exploit the full graph alignment
information produced by danbing-tk and develop a framework to perform motif-trait associations
based on compact graph alignments. We discover abundant eMotifs, or motifs for which the copy
number variation is associated with nearby gene expression. We replicate the discovery of a
known risk eMotif and report a novel risk eMotif for schizophrenia in the CACNA1C VNTR.
Fine-mapping is performed to assess the causality of eMotifs and reveal abundant putative causal
eMotifs carrying independent effects from GTEx variants. We prioritize four likely causal
eMotifs with strong evidence of regulatory roles in disease-related genes as an additional
reference for experimental biologists.
9
Chapter 2
Profiling Variable Number Tandem Repeat Variation across
Populations Using Repeat-Pangenome Graphs
Understanding the impact of genetic variants on human phenotypes is one of the fundamental
questions in human genetics studies. Association studies such as eQTL mapping or GWAS are
efficient methods to assess the functional impact of variants at genome-wide scale, and provide
valuable information for prioritizing genetic loci for follow-up experimental validation. A key
step for an association study is to acquire accurate genotypes for the variants of interest in order
to perform hypothesis testing. Most early studies only consider SNVs, and to a lesser extent for
short indels and SVs due to limitations in experiments and algorithms. As of the date this thesis
was written, incorporating SVs into association studies has gradually become a routine,
algorithms for STR genotyping have just become full-fledged, and methods for VNTR
genotyping are still under active development.
In this chapter, I present how we address the problem of genotyping VNTRs from
Illumina short reads. VNTRs are known to have a higher mutation rate than other regions of the
human genome. As a result, GRCh38 imposes a strong bias in sequence composition and suffers
from reduced mapping accuracy in VNTR regions when we intend to align short reads to this
reference sequence. We will show how this can be resolved by a new data structure,
repeat-pangenome graph (RPGG), along with a toolkit danbing-tk that constructs and performs
operations on this data structure. We will also demonstrate a few analyses and applications once
we have accurate genotypes for VNTRs.
10
2.1 Abstract
Variable number tandem repeat sequences (VNTR) are composed of consecutive repeats of short
segments of DNA with hypervariable repeat count and composition. They include protein coding
sequences and associations with clinical disorders. It has been difficult to incorporate VNTR
analysis in disease studies that use short-read sequencing because the traditional approach of
mapping to the human reference is less effective for repetitive and divergent sequences. We solve
VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that
encodes both the population diversity and repeat structure of VNTR loci from multiple
haplotype-resolved assemblies. We developed software to build a RPGG, and use the RPGG to
estimate VNTR composition with short reads. We used this to discover VNTRs with length
stratified by continental population, and novel expression quantitative trait loci, indicating that
RPGG analysis of VNTRs will be critical for future studies of diversity and disease.
11
2.2 Introduction
The human genome is composed of roughly 3% simple sequence repeats (SSRs) (I. H. G. S.
Consortium and International Human Genome Sequencing Consortium 2001) , loci composed of
short, tandemly repeated motifs. These sequences are classified by motif length into short
tandem repeats (STRs) with a motif length of six nucleotides or fewer, and variable-number
tandem repeats (VNTRs) for repeats of longer motifs. SSRs are prone to hyper-mutability
through motif copy number changes due to polymerase slippage during DNA replication
(Viguera, Canceill, and Ehrlich 2001) . Variation in SSRs are associated with tandem repeat
disorders including amyotrophic lateral sclerosis and Huntington’s disease (Gatchel and Zoghbi
2005) , and VNTRs are associated with a wide spectrum of complex traits and diseases including
attention-deficit disorder, Type 1 Diabetes and schizophrenia (Hannan 2018) . While STR
variation has been profiled in human populations (Mallick et al. 2016) and to find expression
quantitative trait loci (eQTL) (Fotsing et al. 2019; Gymrek et al. 2016) , and variation at VNTR
sequences may be detected for targeted loci (Bakhtiari et al. 2018; Dolzhenko et al. 2019) , the
landscape of VNTR variation in populations and effects on human phenotypes are not yet
examined genome-wide. Large scale sequencing studies including the 1000 Genomes Project
(1000 Genomes Project Consortium et al. 2015) , TOPMed (Taliun et al. 2021) and DNA
sequencing by the Genotype-Tissue Expression (GTEx) project (G. Consortium and GTEx
Consortium 2017) rely on high-throughput sequencing (SRS) characterized by SRS reads up to
150 bases. Alignment and standard approaches for detecting single-nucleotide variant (SNV) and
indel variation (insertions and deletions less than 50 bases) using SRS are unreliable in SSR loci
(Li et al. 2018) , and the majority of VNTR SVs are missed using general SV detection
algorithms with SRS (Chaisson et al. 2019) .
12
A number of tools have been developed specifically to detect or genotype tandem repeat
variation with short reads. Most existing tools, however, support a limited description of the
complexity of tandem repeats using a single motif, such as in GangSTR (Mousavi et al. 2019)
and adVNTR (Bakhtiari et al. 2018) , leaving the variation in motif sequences unexplored. While
ExpansionHunter (Dolzhenko et al. 2019) allows the repeat structure to be defined by a regular
expression, it is mostly restricted to STR genotyping and has not been extended to VNTRs. The
full extent to which VNTR loci differ has been made more clear by long-read sequencing (LRS)
and assembly. LRS assemblies have megabase scale contiguity and accurate consensus
sequences (Koren et al. 2017; Chin et al. 2016) that may be used to detect VNTR variation.
Nearly 70% of insertions and deletions discovered by LRS assemblies greater than 50 bases are
in STR and VNTR loci (Chaisson et al. 2019) , accounting for up to 4 Mbp per genome.
Furthermore, LRS assemblies reveal how VNTR sequences differ by kilobases in length and by
motif composition (Song, Lowe, and Kingsley 2018) . LRS assemblies have been used to
improve VNTR analysis with SRS when used as population-specific references that add
sequences missing from the reference and improve alignment (Du et al. 2019; Shi et al. 2016) .
Additionally, VNTR variation discovered by LRS assemblies may be genotyped using SRS,
although with lower accuracy than other SVs (Hickey et al. 2020; Audano et al. 2019) .
Furthermore, the genotype represents the presence of a known variant, and does not reveal the
spectrum of copy number variation that exists in tandem repeat sequences (Chen et al. 2019) .
Repeat length estimation in tools specialized for tandem repeat genotyping allows more
biological meaningful analyses (Gymrek et al. 2016; Saini et al. 2018; Gymrek et al. 2017) .
The hypervariability of VNTRs prevents a single assembly from serving as an optimal
reference. Instead, to improve both alignment and genotyping, multiple assemblies may be
13
combined into a pangenome graph (Hickey et al. 2020; Garrison et al. 2018; Chen et al. 2019;
Eggertsson et al. 2019) composed of sequence-labeled vertices connected by edges such that
haplotypes correspond to paths in the graph. Sequences shared between haplotypes are stored in
the same vertex, and genetic variation is represented by the structure of the graph. A
conceptually similar construct is the repeat graph (Pevzner, Tang, and Tesler 2004) , with
sequences repeated multiple times in a genome represented by the same vertex. Graph analysis
has been used to encode the elementary duplication structure of a genome (Jiang et al. 2007) and
for multiple sequence alignment of repetitive sequences with shuffled domains (Raphael et al.
2004) , making them well-suited to represent VNTRs that differ in both repeat count and
composition.
Here we propose the representation of human VNTRs as a repeat-pangenome graph
(RPGG), that encodes both the repeat structure and sequence diversity of VNTR loci. The most
straight-forward approach that combines a pangenome graph and a repeat graph is a de Bruijn
graph, and was the basis of one of the earliest representations of a pangenome by the Cortex
method (Iqbal, Turner, and McVean 2013; Iqbal et al. 2012) . The de Bruijn graph has a vertex for
every distinct sequence of length k in a genome ( k- mer), and an edge connecting every two
consecutive k -mers, thus k -mers occurring in multiple genomes or multiple times in the same
genome are stored by the same vertex. While the Cortex method stores entire genomes in a de
Bruijn graph, we construct a separate locus-RPGG for each VNTR and store a genome as the
collection of locus-RPGGs, which deviates from the definition of a de Bruijn graph because the
same k -mer may be stored in multiple vertices.
We develop a toolkit, Tan d em Repe a t Ge n otyping b ased on Haplotype-der i ved
Pange n ome G raphs (danbing-tk) to identify VNTR boundaries in assemblies, construct RPGGs,
14
align SRS reads to the RPGG, and infer VNTR motif composition and length in SRS samples.
We generate a RPGG from 19 haplotype-resolved LRS assemblies sequenced for population
references and diversity panels (Chaisson et al. 2019; Audano et al. 2019; Seo et al. 2016; Shi et
al. 2016) , showing that while ~85% of the composition of repeats is discovered after 3 genomes,
the genetic diversity stored in the RPGG sequentially increases as all 19 genomes are included in
the RPGG. Alignment to the RPGG improves the mean absolute percentage error 28-63% over
mapping to the standard human reference as a linear sequence or a repeat graph. This enables the
alignment of SRS datasets into an RPGG to discover population genetics of VNTR loci, and to
associate expression with VNTR variation. We find 785 loci that demonstrate population
structure with respect to the inferred lengths of VNTR sequences, and importantly to discover
8,216 loci that show differential motif usage between populations. Finally, we apply danbing-tk
to the SRS genomes from the GTEx consortium to discover 346 eQTL where the VNTR length
is associated with gene expression.
2.3 Materials and methods
2.3.1 Repeat-pangenome graph construction
2.3.1.1 Initial discovery of tandem repeats
TRF v4.09 (option: 2 7 7 80 10 50 500 -f -d -h) (Benson 1999) was used to roughly annotate the
SSR regions of five PacBio assemblies (AK1, HG00514, HG00733, NA19240, NA24385). The
scope of this work focuses on VNTRs that cannot be resolved by typical short read sequencing
methods. We selected the set of SSR loci with a motif size greater than 6 bp and a total length
greater than 150 bp and less than 10 kbp. For each haplotype, the selected VNTR loci were
15
mapped to GRCh38 reference genome to identify homologous VNTR loci. To maintain data
quality, VNTR loci that could not be assigned homology were removed from datasets.
2.3.1.2 Boundary expansion of VNTRs
The biological boundaries of a VNTR are ill-defined; VNTRs with sparse recurring motifs or
transition between different motifs or a nested motif structure often fail to be fully annotated by
TRF. A misannotation of VNTR boundaries can cause erroneous length estimates. To avoid the
propagation of this error to downstream analysis, we developed a multiple boundary expansion
algorithm to recover the proper boundary for each VNTR across all haplotypes, including the 14
remaining genomes (HG00268, HG00512, HG00513, HG00731, HG00732, HG01352,
HG02059, HG02106, HG02818, HG04217, NA12878, NA19238, NA19239 and NA19434). The
algorithm maintains an invariant: the flanking sequence in any of the haplotypes does not share
k -mers with the VNTR regions from all haplotypes. VNTR boundaries in each haplotype are
iteratively expanded until the invariant is true or if expansion exceeds 10 kbp in either 5’ or 3’
direction. The size of the flanking regions is chosen to be 700 bp, which is approximately the
upper bound of the insert size of typical SRS reads. The following QC step removes a haplotype
if its VNTR annotation is within 700 bp to breakpoints or if the orthology mapping location to
GRCh38 is different from the majority of haplotypes. A VNTR locus with the number of
supporting haplotypes less than 90% of the total number of haplotypes is also removed. Adjacent
VNTR loci within 700 bp to each other in any of the haplotypes will induce a merging step over
all haplotypes. Haplotypes with distance between adjacent loci inconsistent with the majority of
haplotypes are removed. Finally, VNTR loci with the number of supporting haplotypes less than
80% of the total number of haplotypes are removed, leaving 73,582 of the initial 84,411 loci.
16
2.3.1.3 Read-to-graph alignment
For the two haplotypes of an individual, three data structures are used to encode the information
of all VNTR loci, including VNTRs and their 700 bp flanking sequences. The first data structure
allows fast locus lookup for each k -mer ( k =21) by hashing each canonical k -mer in the VNTRs
and the flanking sequences to the index of the original locus. The second data structure enables
graph threading by storing a bi-directional de Bruijn graph for each locus. The third data
structure is used for counting k -mers originating from VNTRs. The read mapping algorithm
maps each pair of Illumina paired-end reads to a unique VNTR locus in three phases: (1) In the
k -mer set mapping phase, the read pair is converted to a pair of canonical k -mer multisets. The
VNTR locus with the highest count of intersected k -mers is detected with the first data structure.
(2) In the threading phase, the algorithm tries to map the k -mers in the read pair to the
bi-directional de Bruijn graph such that the mapping forms a continuous path/cycle. To account
for sequencing and assembly errors, the algorithm is allowed to edit a limited number of
nucleotides in a read if no matching k -mer is found in the graph. The read pair is determined
feasible to map to a VNTR locus if the number of mapped k -mers is above an empirical
threshold. (3) In the k -mer counting phase, canonical k -mers of the feasible read pair are counted
if they existed in the VNTR locus. Finally, the read mapping algorithm returns the k -mer counts
for all loci as mapped by SRS reads. Alignment timing was conducted on an Intel Xeon
E5-2650v2 2.60GHz node.
17
2.3.1.4 Graph pruning and merging
Pan-genome representation provides a more thorough description of VNTR diversity and reduces
reference allele bias, which effectively improves the quality of read mapping and downstream
analysis. Considering the fact that haplotypes assembled from long read datasets are error prone
in VNTR regions, it is necessary to prune the graphs/ k -mers before merging them as a
pan-genome. We ran the read mapping algorithm with error correction disabled so as to detect
k -mers unsupported by SRS reads. The three data structures were updated by deleting all
unsupported k -mers for each locus. By pooling and merging the reference regions corresponding
to the VNTR regions in all individuals, we obtained a set of “pan-reference” regions, each
indicating a location in GRCh38 that is likely to map to a VNTR region in any other unseen
haplotype. By referencing the mapping relation of VNTR loci across individuals, we encoded the
variability of each VNTR locus by merging the three data structures across individuals.
2.3.1.5 Alignment quality analysis
To evaluate the quality of the haplotype assemblies and the performance of the read mapping
algorithm, VNTR k -mer counts in the original assemblies were regressed against those mapped
from SRS reads. The of the linear fit was used as the alignment quality score (referred to as
aln- ). To measure alignment quality in the pan-genome setting, only the k -mer set derived
from the genotyped individual was retained as the input for regression.
2.3.1.6 Data filtering
A final set of 32,138 VNTR regions was called by filtering based on aln- . The quality of a
locus was measured by the mean aln- across individuals. Loci with mean aln- below 0.96
18
were removed from the final call set. The final pan-genome graphs were used to genotype large
Illumina datasets, measure length prediction accuracy, analyze population structures and map
eQTL.
2.3.1.7 Predicting VNTR lengths
Read depths at VNTR regions usually vary considerably from locus to locus. Furthermore, the
sampling bias of different sequencing runs are also different, which limits our ability to genotype
the accurate length of VNTRs. To account for this, we compute locus-specific biases (LSBs)
for each sample , a tuple of (genome , sequencing run) as follows:
,where is the ground truth VNTR lengths of 32,138 loci in genome ; is the sum of
k -mer counts in each locus mapped by samples ; is the global read depth of sample
estimated by averaging the read depths of 397 unique regions without any types of repeats or
duplications.
The ground truth VNTR length of a locus in genome is averaged across haplotypes:
,where is the number of haplotype(s) in genome , i.e. 2 for normal individuals and 1 for
complete hydatidiform mole (CHM) samples.
With the above bias terms, the VNTR length of locus in sample can be computed by:
19
,where is same as described above; is the estimated LSBs computed from sample with
ground truth VNTR lengths; is the sum of k -mer counts of locus mapped by sample .
We assume the LSBs that best approximates come from samples within the same sequencing
run. Without prior knowledge on the ground truth VNTR lengths of and therefore , we
determine the “closest” sample w.r.t. based on between the read depths, RD , of the 397
unique regions as follows:
, where is the set of samples with ground truths and within the same sequencing run as .
We cross-validate our approach by leaving one sample out of the pan-genome database and
evaluating the prediction accuracy on the excluded sample.
For comparison, VNTR lengths were also estimated by a read depth method. For each
VNTR region, the read depth, computed with samtools bedcov -j, was divided by the global read
depth, computed from the 397 nonrepetitive regions, to give the length estimate.
2.3.1.8 Comparing with GraphAligner
The compact de Bruijn graph of each VNTR locus was generated with bcalm v2.2.3 (option:
-kmer-size 21 -abundance-min 1) using the VNTR sequences from all assemblies as input.
GFA files were then reindexed and concatenated to generate the RPGGs for 32,138 loci.
Error-free paired-end reads were simulated from all VNTR regions at 2x coverage with 150 bp
read length and 600 bp insert size (300 bp gap between each end). Reads were aligned to the
RPGG using GraphAligner v1.0.11 with option -x dbg --seeds-minimizer-length 21. Reads with
alignment identity > 90% were counted from the output gam file. To compare in a similar setting,
20
danbing-tk was run with option -gc -thcth 117 -k 21 -cth 45 -rth 0.5 to assert >90% identity for
all reads aligned, given that .
2.3.2 Population analysis
2.3.2.1 V
ST
calculation
V
ST
was calculated according to (Redon et al. 2006) :
Top V
ST
loci were considered as the sites with V
ST
at least three standard deviations above the
mean.
2.3.2.2 Properties of V
ST
The dosage of a VNTR for an individual is . Consider individuals consisting of
populations, with individuals each, with population mean and variance, and , and a
global mean and variance and for all individuals. Population stratification is calculated as
.
The mean across populations, is calculated as . The variance is
for all individuals, and this may be separated out by population as
, using to denote the k th individual in population . The value
may be computed as:
21
The total population variance relative to the population mean, variance, size, and global
mean is:
Replacing this in the calculation of V
ST
gives:
2.3.2.3 Identifying unstable loci
A locus was annotated as a candidate for being unstable if at least one individual had outlying
k -mer dosage ≥ six standard deviations above the mean, using population and locus specific
summary statistics on data discarding individuals with dosage less than 10 or a bimodal
distribution was not detected (diptest v0.75-7, p > 0.9). Among this set, the number of times each
genome appeared as an outlier was used to select a set of genomes with an over abundant
contribution to fragile loci. Any candidate locus with an individual that was an outlier in at least
four other loci was removed from the candidate list. The loci were compared to gencode v34,
excluding readthrough, pseudogenes, noncoding RNA, and nonsense transcripts.
22
2.3.2.4 Identifying differential motif usage and expansion
Sample outliers in the 1000 Genomes were detected from the LSBs over 397 control regions and
the VNTR dosages over 32,138 loci using DBSCAN. A total of 119/2,504 samples were
removed from downstream analysis. We use the EAS population as the reference for measuring
differential motif usage and expansion. Initially, a lasso fit using the statsmodel.api.OLS function
in python statsmodel v0.10.1 (Seabold and Perktold 2010) was performed for each locus to
identify the k -mer with the most variance explained (VEX) in VNTR lengths using the following
formula: , where is the VNTR length of individuals in the EAS
population; is the k -mer dosage matrix for individuals with k -mers;
is the model coefficient, and is the error term. The lasso penalty weight
was scanned starting at 0.9 with at a step size of −0.1 until at least one covariate has a positive
weight or is below 0.1. The k -mer with the highest weight is denoted as the most informative
k -mer (mi-kmer) for the locus.
To identify loci with differential motif usage between populations, we subtracted the
median count of the mi-kmer of the AFR from the EAS population for each locus, denoted as
. The null distribution of was estimated by bootstrap. Specifically, EAS individuals
were sampled with replacement times, matching the sample sizes of the EAS
and AFR populations, respectively. The bootstrap statistics, , were computed by
subtracting the median count of the mi-kmer of the last from the first bootstrap
samples for each locus. The estimated null distribution is then used to determine the threshold for
calling a locus having significant differential motif usage between populations (two-sided P <
0.01).
23
To identify loci with differential motif expansion between populations, we subtracted the
proportion of VEX by mi-kmer in the AFR from the EAS population, denoted as . The null
distribution of was estimated by bootstrap in a similar sampling procedure as , except
for subtracting the proportion of VEX by the mi-kmer in the last from the first
bootstrap samples for each locus. The estimated null distribution is used to determine the
threshold for calling a locus having significant differential motif expansion between populations
(two-sided P < 0.01).
2.3.3 eQTL mapping
2.3.3.1 Retrieving datasets
WGS datasets of 879 individuals, normalized gene expression matrices and covariates of all
tissues are accessed from the GTEx Analysis V8 (dbGaP Accession phs000424.v8.p2).
2.3.3.2 Genotype data preprocessing
VNTR lengths are genotyped using daunting-tk with options: -gc -thcth 50 -cth 45 -rth 0.5. All
the k -mer counts of a locus are summed and adjusted by global read depth and ploidy to
represent the approximate length of a locus. Sample outliers were detected from the LSBs over
397 control regions and the VNTR dosages over 32,138 loci using DBSCAN. A total of 26/838
samples were removed from downstream analysis. Adjusted values are then z-score normalized
as input for eQTL mapping.
24
2.3.3.3 Expression data preprocessing
The downloaded expression matrices are already preprocessed such that outliers are rejected and
expression counts are quantile normalized as standard normal distribution. Confounding factors
such as sex, sequencing platform, amplification method, technical variations and population
structure are removed prior to eQTL mapping to avoid spurious associations. Technical
variations are corrected with the covariates, including PEER factors, provided by the GTEx
Consortium. Population structures are corrected with the top 10 principal components (PCs)
from the SNP matrix of all samples. Particularly, principal component analysis (PCA) was
performed jointly on the intersection of the SNP sets from GTEx samples and 1KGP Omni 2.5
SNP genotyping arrays
(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ALL.c
hip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz). This is done by first using
CrossMap v0.4.0 to liftover the SNP sites from Omni 2.5 arrays to GRCh38, followed by
extracting the intersection of the two SNP sets using vcftools isec. The SNP set is further
reduced by LD-pruning with plink v1.90b6.12 using the options: --indep 50 5 2, leaving a total
of 757,000 sites. Finally, PCA on the joint SNP matrix was done by smartpca v13050. The
normalized expression matrix are residualized with the above covariates using the following
formula:
, where is the residualized expression matrix; is the normalized expression matrix; is
the projection matrix; is the identity matrix; is the covariate matrix where each column
25
corresponds to a covariate mentioned above. The residualized expression values are z-score
normalized as the input of eQTL mapping.
2.3.3.4 Association test
VNTRs within 100 kb to a gene are included for eQTL mapping. Linear regression was done
using the statsmodel.api.OLS function in python statsmodel v0.10.1 (Seabold and Perktold 2010)
with expression values as the dependent variable and genotype values as the independent
variable. Nominal P values are computed by performing t tests on slope. Adjusted P values are
computed by Bonferroni correction on nominal P values. Under the assumption of at most one
causal VNTR per gene, we control gene-level false discovery rate at 5%. Specifically, the
adjusted P values of the lead VNTR for each gene are taken as input for Benjamini-Hochberg
procedure using statsmodels.stats.multitest.fdrcorrection v0.10.1. Lead VNTRs that passed the
procedure are identified as eVNTRs.
2.4 Results
2.4.1 Repeat pan-genome graph construction
Our approach to build RPGGs is to de novo assemble LRS genomes, and build de Bruijn graphs
on the assembled sequences at VNTR loci, using SRS genomes to ensure graph quality. We used
public LRS data for 19 individuals with diverse genetic backgrounds, including genomes from
individual genome projects (Seo et al. 2016; Zook et al. 2020) , structural variation studies
(Chaisson et al. 2019) , and diversity panel sequencing (Audano et al. 2019) (Figure 2.1a, Table
A.1). Each genome was sequenced by either PacBio contiguous long read between, or
high-fidelity sequencing between 16 and 76-fold coverage along with matched 22-82-fold
26
Illumina sequencing (Table 2.1). This data reflects a wide range of technology revisions,
sequencing depth, and data type, however subsequent steps were taken to ensure accuracy of
RPGG through locus redundancy and SRS alignments. We developed a pipeline that partitions
LRS reads by haplotype based on phased heterozygous SNVs and assembles haplotypes
separately by chromosome. When available, we used existing telomere-to-telomere SNV and
phase data provided by Strand-Seq and/or 10X Genomics (Porubsky et al. 2017; Chaisson et al.
2019) with phase-block N50 size between 13.4-18.8 Mb. For other datasets, long-read data were
used to phase SNVs. While this data has lower phase-block N50 (<0.5 - 6 Mb), the individual
locus-RPGG do not use long-range haplotype information and are not affected by phasing switch
error. Reads from each chromosome and haplotype were independently assembled using the flye
assembler (Kolmogorov et al. 2019) for a diploid of 0.88-14.5Mb N50, with the range of
assembly contiguity reflected by the diversity of input data.
In this study, the number of resolved VNTR loci is a more accurate measurement of
useful assembly contiguity than N50 because a disjoint RPGG is generated for each VNTR
locus. An initial set of 84,411 VNTR intervals with motif size >6 bp, minimal length >150 bp
and <10k bp (mean length=420 bp in GRCh38, Methods, Table A.1) were annotated by Tandem
Repeats Finder (TRF) (Benson 1999) , and then mapped onto contig coordinates using pairwise
contig alignments. This filtering criterion corresponds to an empirical cutoff of 56% purity and
can retain VNTRs (n=2,715, Figure A.1) that have nested STR annotations (Figure A.2). Long
VNTR loci tended to have fragmented TRF annotation, which can cause erroneous length
estimates in downstream analysis and fail to properly interpret repeat structures as a whole such
as in adVNTR-NN (Figure A.3). During locus assignment, danbing-tk expands boundaries and
merges loci to ensure boundaries of all VNTRs are well-defined and harmonized across genomes
27
(Methods) (Figure 2.1b). In practice, we found that 43,869/84,411 (52%) of the VNTR loci are
subject to boundary expansion, with an average expansion size of 539 bp. The set of VNTRs that
can be properly annotated ranges from 19,800-73,212 depending on the assembly quality, with a
final set of 73,582 loci (mean length=652 bp) across 19 genomes (Figure A.4).
The RPGGs are a collection of independently constructed bi-directional de Bruijn graphs
of each VNTR locus and flanking 700 bases from the haplotype-resolved assemblies. In a
bi-directional de Bruijn graph, each distinct sequence of length k ( k -mer) and its reverse
complement map to a vertex, and each sequence of length k +1 connects the vertices to which the
two composite k -mers map. The RPGG differs from a standard bi-directional de Bruijn graph
because a k -mer may be repeated in multiple subgraphs. There was little effect on downstream
analysis for values of k between 17 and 25, and so k =21 was used for all applications. To remove
spurious vertices and edges from assembly consensus errors, SRS from genomes matching the
LRS samples were mapped to the RPGG, and k -mers not mapped by SRS were removed from
the graph (average of 264 per locus). Using the number of vertices as a proxy for sampled
genetic diversity, we find that 27% (2,102,270 new nodes) of the sequences not contained in
GRCh38 (7,672,357 nodes) are discovered after the inclusion of 19 genomes, with diversity
linearly increasing per genome after the first four genomes are added to the RPGG (8,958,361
nodes, Figure 2.1c).
The alignment of a read to an RPGG may be defined by the path in the RPGG with a
sequence label that has the minimum edit distance to the read among all possible paths. We used
~5.88 ⨉10
8
error-free 150bp paired end reads simulated from six genomes (HG00512, HG00513,
HG00731, HG00732, NA19238 and NA19239) to evaluate how reads are aligned to the RPGG.
While methods exist to find alignments that do not reuse cycles (Rakocevic et al. 2019) , others
28
allow alignment to cyclic graphs but with high computational costs when applied to RPGG
(Garrison et al. 2018) or are limited to alignment in STR regions (Dolzhenko et al. 2019) .
Efficient alignment with cycles is a more challenging problem recently solved by GraphAligner
(Rautiainen, Mäkinen, and Marschall 2019) to map long reads to pangenome graphs. Although
>99.99% of the reads simulated from VNTR loci were aligned, 6.03% of reads matched with less
than 90% identity, indicating misalignment. We developed an alternative approach tuned for
RPGG alignments in danbing-tk (Figure 2.1d) to realign all SRS reads within a bam/fastq file to
the RPGG in two passes, first by finding locus-RPGGs with a high number (>45 in each end)
shared k -mers with reads, and next by threading the paired-end reads through the locus-RPGG,
allowing for up to two edits (mismatch, insertion, or deletion) and at least 50 matched k-mers per
read against the threaded path (Methods). Using danbing-tk, 99.997% of VNTR-simulated reads
were aligned with >90% identity. The RPGG is only built on VNTRs and their flanking
sequences, excluding the rest of the genome. When reads from the entire genome are considered,
for 96.6% of the loci (71,080/73,582), danbing-tk can map >90% of the reads back to their
original VNTR regions. Misaligned reads from either other VNTR loci included in the RPGG or
the remainder of the genome not included in the RPGG target relatively few loci; 3.6%
(2,635/73,582) loci have at least one read misaligned from outside the locus. The graph pruning
step is the primary cause of missed alignments, and affects on average 2,772 loci per assembly.
On real data, danbing-tk required 18.5 GB of memory to map 150 base paired-end reads at 10.1
Mb/sec on 16 cores.
2.4.2 Read-to-graph alignment in VNTR regions
Alignment of SRS reads to the RPGG enables estimation of VNTR length and motif
composition. The count of k -mers in SRS reads mapped to the RPGG are reported by danbing-tk
29
for each locus. For samples and VNTR loci, the result of an alignment is count matrices
of dimension , where is the number of vertices in the de Bruijn graph on the locus ,
excluding flanking sequences. If SRS reads from a genome were sequenced without bias,
sampled uniformly, and mapped without error to the RPGG, the count of a k -mer in a locus
mapped by an SRS sample should scale by a factor of read depth with the sum of the count of the
k -mer from the locus of both assembled haplotypes for the same genome. The quality of
alignment (aln- ) and sequencing bias were measured by comparing the k -mer counts from the
19 matched Illumina and LRS genomes (Figure 2.2a). In total, 44% (32,138/73,582) loci had a
mean aln- ≥ 0.96 between SRS and assembly k -mer counts, and were marked as “valid” loci to
carry forward for downstream diversity and expression analysis (Figure 2.2b). Valid had an
average length of 341 bp, compared to 657 bp in the entire database (Figure 2.2c). VNTR loci
that did not align well (invalid) were enriched for sequences that map within Alu (21,820), SV A
(1,762), and other 26,752 mobile elements (Figure A.3); loci with false mapping in the
simulation experiment are also enriched in the invalid set (Table A.2) . Specifically, 71.6%
(4,297/5,999) of loci with false positive mapping, 84.7% (8,065/9,525) of loci with false
negative mapping are not marked as valid. Loci with false mapping but retained in the final set
have lower but still decent length prediction accuracy (0.79 versus 0.82). The complete RPGG
on valid loci contains 8,958,361 vertices, in contrast to the corresponding RPGG on GRCh38
only (repeat-GRCh38), which has 7,672,357 vertices. We validate that the additional vertices in
the RPGG are indeed important for accurately recruiting reads pertinent to a VNTR locus, using
the CACNA1C VNTR as an example (Figure 2.2d). It is known that the reference sequence at
this locus is truncated compared to the majority of the populations (319 bp in GRCh38 versus
5,669 bp averaged across 19 genomes). The limited sequence diversity provided by
30
repeat-GRCh38 at this locus failed to recruit reads that map to paths existing in the RPGG but
missing or only partially represented in repeat-GRCh38. A linear fit between the k -mers from
mapped reads and the ground truth assemblies shows that there is a 13-fold gain in slope, or
measured read depth, when using RPGG compared to repeat-GRCh38 (Figure 2.2e). The k -mer
counts in the RPGGs also correlate better with the assembly k -mer counts compared to the
repeat-GRCh38 (aln- = 0.992 versus 0.858).
New genomes with arbitrary combinations of motifs and copy numbers in VNTRs should
still align to an RPGG as long as the motifs are represented in the graph. We used leave-one-out
analysis to evaluate alignment of unseen genomes to RPGGs and estimation of VNTR length. In
each experiment, an RPGG was constructed with one LRS genome missing. SRS reads from the
missing genome were mapped into the RPGG, and the estimated locus lengths were compared to
the average diploid lengths of corresponding loci in the missing LRS assembly. The locus length
is estimated as the adjusted sum of k -mer counts mapped from SRS sample :
, where is sequencing depth of , is a correction for locus-specific
sampling bias (LSB). LSB measures the deviation of an observed read depth from the expected
value within an interval (see Methods for formal definition). Because the SRS datasets used in
this study during pangenome construction were collected from a wide variety of studies with
different biases, there was no consistent LSB in either repetitive or nonrepetitive regions for
samples from different sequencing runs (Figure A.6-7). However, principal component analysis
(PCA) of repetitive and nonrepetitive regions showed highly similar projection patterns (Figure
A.8), which enabled using LSB in nonrepetitive regions as a proxy for finding the nearest
neighbor of LSB in VNTR regions (Figure A.9). Leveraging this finding, a set of 397
nonrepetitive control regions were used to estimate the LSB of an unseen SRS sample
31
(Methods), giving a median length-prediction accuracy of 0.82 for 16 unrelated genomes (Figure
2.3a left, A.10). The read depth of a repetitive region correlates to the locus length when aligning
short reads to a linear reference genome. However, estimation of VNTR length from read depth
has an accuracy of 0.75 (Figure 2.3a left). We also compared the performance for length
prediction using the RPGG versus repeat-GRCh38, and observed a 58% improvement in
accuracy (0.82 versus 0.52, Figure 2.3a left, A.11). The overall error rate, measured with mean
absolute percentage error (MAPE), of all loci (n=32,138) are also significantly lower when using
RPGGs (MAPE=0.18, Figure 2.3a right) compared with the repeat-GRCh38 (0.23, paired t -test P
= 4.2⨉10
-32
) or reference-aligned read depth (0.20, paired t -test P = 2.4⨉10
-33
). Furthermore, a
62% reduction in error size is observed for the 6,383 loci poorly genotyped (MAPE > 0.4) using
repeat-GRCh38 (Figure 2.3b, MAPE=0.235 versus 0.610). Loci with low accuracy in length
estimates from RPGG can be mostly explained by the estimation error in LSB due to varying
data quality ( r
2
=0.89, Figure A.12; example given in Figure A.13), and to a slight degree by the
presence of a missing haplotype (Figure A.14), the fraction of k -mers in a locus unique to a
sample (Figure A.15), GC bias (Figure A.16), and the difference in the VNTR GC content across
samples (Figure A.17).
2.4.3 Profiling VNTR length and motif diversity
To explore global diversity of VNTR sequences and potential functional impact, we aligned
reads from 2,504 individuals from diverse populations sequenced at 30-fold coverage sequenced
by the 1000-Genomes project (1KGP) (Fairley et al. 2020; 1000 Genomes Project Consortium et
al. 2015) , and 879 GTEx genomes (G. Consortium and GTEx Consortium 2017) to the RPGG.
The fraction of reads from these datasets that align to the RPGG ranges from 1.11%-1.37%,
similar to the matched LRS/SRS data (1.23%). PCA on the LSB of both datasets showed the
32
1KGP and GTEx genomes as separate clusters in both repetitive and nonrepetitive regions
(Figure A.7), indicating experiment-specific bias that prevents cross data set comparisons.
Consistent with the finding in previous leave-one-out analysis, genomes from the same study
cluster together in the PCA plot of LSB, and so within each dataset and locus, k -mer counts from
SRS reads normalized by sequencing depth were used to compare VNTR content across
genomes.
The k -mer dosage: , was used as a proxy for locus length to compare tandem
repeat variation across populations in the 1KGP genomes. The 1KGP samples contain
individuals from African (26.5%), East Asian (20.1%), European (19.9%), Admixed American
(13.9%), and South Asian (19.5%) populations. When comparing the average population length
to the global average length, 60.8% (19,530/32,138) have differential length between populations
(FDR=0.05 on ANOV A P values), with similar distributions of differential length when loci are
stratified by the accuracy of length prediction (Figure 2.4a). Population stratification was
calculated using the V
ST
statistic (Redon et al. 2006) on VNTR length (Figure 2.4b). Previous
studies have used >3 standard deviations above the mean to define for highly stratified copy
number variants (Sudmant, Mallick, et al. 2015) . Under this measure, 785 variants are highly
stratified, including 266 that overlap genes, however this is not significantly enriched (P=0.079,
one-sided permutation test). Two of the top five loci ranked by V
ST
are intronic: a 72 base VNTR
in PLCL1 (V
ST
=0.37), and a 148 base locus in SP ATA18 (V
ST
=0.35) (Figure 2.4c,d). These values
for V
ST
are lower than what are observed for large copy number variants (Redon et al. 2006) and
may be the result of neutral variation, however this may be affected by the high variance of the
length estimate, as V
ST
decreases as the variance of the copy number/dosage values increase
(Methods).
33
VNTR loci that are unstable may undergo hyper-expansion and are implicated as a
mechanism of multiple diseases (Hannan 2018) . To discover potentially unstable loci, we
searched the 1kg genomes for evidence of rare VNTR hyper-expansion. Loci were screened for
individuals with extreme (>6 standard deviations) variation, and then filtered for deletions or
unreliable samples (Methods) to characterize 477 loci as potentially unstable. These loci are
inside 115 genes and are significantly reduced from the number expected by chance (p<1 ⨉10
-5
,
one-sided permutation test; n=10,000). Of these loci, 64 have an individual with > 10 standard
deviations above the mean, of which two overlap genes, KCNA2 , and GRM4 (Figure A.18).
Alignment to an RPGG provides information about motif usage in addition to estimates
of VNTR length because genomes with different motif composition will align to different
vertices in the graph. To detect differential motif usage, we searched for loci with a k -mer that
was more frequent in one population than another and not simply explained by a difference in
locus length, comparing African (AFR) and East Asian populations for maximal genetic
diversity. Lasso regression against locus length was used to find the k -mer with the most variance
explained (VEX) in EAS genomes, denoted as the most informative k -mer (mi-kmer). Two
statistics are of interest when comparing the two populations: the difference in the count of
mi-kmers ( ) and the difference between proportion of VEX ( ) by mi-kmers.
describes the usage of an mi-kmer in one population relative to another, while indicates the
degree that the mi-kmer is involved in repeat contraction or expansion in one population relative
to another. We observe that 8,216 loci have significant differences in the usage of mi-kmers
between the two populations (two-sided P < 0.01, bootstrap, Figure A.19). Among these, the
mi-kmers of 1,913 loci are crucial to length variation in the EAS but not in the AFR population
(two-sided P < 0.01, bootstrap) (Figure 2.4e, A.19). A top example of these loci with at least
34
0.9 in the EAS population was visualized with a heatmap of relative k -mer count from both
populations, and clearly showed differential usage of cycles in the RPGG (Figure 2.4f).
2.4.4 Association of VNTR with nearby gene expression
Because the danbing-tk length estimates showed population genetic patterns expected for human
diversity, we assessed whether danbing-tk alignments could detect VNTR variation with
functional impact. Genomes from the GTEx project were mapped into the RPGG to discover loci
that have an effect on nearby gene expression in a length-dependent manner. A total of 812/838
genomes with matching expression data passed quality filtering (Methods). Similar to the
population analysis, the k -mer dosage was used as a proxy for locus length. Methods previously
used to discover eQTL using STR genotyping (Fotsing et al. 2019) were applied to the
danbing-tk alignments. In sum, 30,362 VNTRs within 100 kb to 45,720 GTEx gene-annotations
(including genes, lncRNA, and other transcripts) were tested for association, with a total of
149,057 tests and approximately 3.3 VNTRs tested per gene. Using a gene-level FDR cutoff of
5%, we find 346 eQTL (eVNTRs) (Figure 2.5a), among which 344 (99.4%) discoveries are
previously unreported (Table A.3), indicating that the spectrum of association between tandem
repeat variation and expression extends beyond the lengths and the types of SSR considered in
previous STR (Mousavi et al. 2019) and VNTR (Bakhtiari et al. 2021) studies. This likely
represents a lower bound on the total eVNTRs as analysis on the complete set of loci that are not
filtered by mapping accuracy include 658 eVNTRs. Both positive and negative effects were
observed among eVNTRs (Figure 2.5b). More eVNTRs with positive effect size were found than
with a negative effect size (200 versus 146, binomial test P = 0.0043), with an average effect of
+0.261 (from +0.139 to +0.720) versus −0.247 (from −0.524 to −0.159), respectively. eVNTRs
tend to be closer to telomeres relative to all VNTRs (Mann–Whitney U test P = 5.2⨉10
-5
, Figure
35
A.20). Because many exons contain VNTR sequences, expression measured by read depth
should increase with length of the VNTR, and there is an 2.5-fold enrichment of eVNTRs in
coding regions as expected.
The eVNTRs have the potential to yield insight to disease. In one example, an intronic
eVNTR at chr5:96,896,863-96,896,963 flanks exon 9 of ERAP2 (Figure 2.5d, A.21). The
eVNTR has a -0.52 effect size and was reported across 27 tissues. Although the effect is not
independent of the lead eSNP (Figure A.22-23), the variant is missing from the GTEx cis -eQTL
catalogue and colocalizes with a regulatory hotspot with peaks of histone markers, DNase and 40
different ChIP signals. The protein product of ERAP2 , or endoplasmic reticulum aminopeptidase
2, is a zinc metalloaminopeptidase involving in the process of Class I MHC mediated antigen
presentation and innate immune response. It has been reported to be associated with several
diseases including ankylosing spondylitis (Wellcome Trust Case Control Consortium et al. 2007)
and Crohn’s disease (Franke et al. 2010) . Abnormal expansion of the VNTR might increase
autoimmune disease risk through reducing ERAP2 expression, leaving longer and more antigenic
peptides, yet potentially higher fitness against virus infection (Ye et al. 2018) . This VNTR is a
unique sequence in GRCh38 that is a 101 bp tandem duplication in 17/38 of the haplotypes.
Another example is an intergenic VNTR at chr17:46,265,245-46,265,480 that associates with the
expression of KANSL1 ~40kb upstream (Figure 2.5c, A.21). The eVNTR has a maximal effect
size of +0.45 and is significant across 40 tissues. The protein product of KANSL1 , or KAT8
regulatory NSL complex subunit 1, is a part of the histone acetylation machinery. Deletion of
this gene is linked to Koolen-de Vries syndrome (Koolen et al. 2008) , and the locus is associated
with Parkinson disease (Witoelar et al. 2017) . The eVNTR colocalizes with strong ChIP signals
the association of this VNTR with the epigenetic landscape warrants further investigation.
36
Figure 2.1: Sequence diversity of VNTRs in human populations. a, Global diversity of SMS
assemblies. b, Dot-plot analysis of the VNTR locus chr1:2280569-2282538 (SKI intron 1
VNTR) in genomes that demonstrate varying motif usage and length c , Diversity of RPGG as
genomes are incorporated, measured by the number of k -mers in the 32,138 VNTR graphs. Total
graph size built from GRCh38 and an average genome are also shown. d, danbing-tk workflow
analysis. (top) VNTR loci defined from the reference are used to map haplotype loci. Each locus
is converted to a de Bruijn graph, from which the collection of graphs is the RPGG. The de
Bruijn graphs shown illustrate sequences missing from the RPGG built only on GRCh38. The
alignments may be either used to select which loci may be accurately mapped in the RPGG using
SRS that match the assemblies (red), or may be used to estimate lengths on sample datasets
(blue).
37
Genome
Continental
population Study
Assembly
N50 (Mb)
Fraction of
VNTR
annotated Ancestry Cov
AK1 EAS KG 54 0.88 0.840 Korean
HG00268 EUR DP 67 3.51 0.967 Finnish
HG00512 EAS HGSVG 28 8.83 0.995 Han Chinese
HG00513 EAS HGSVG 30 1.57 0.993 Han Chinese
HG00514 EAS HGSVG 31 1.32 0.948 Han Chinese
HG00731 AMR HGSVG 31 2.18 0.995 Puerto Rican
HG00732 AMR HGSVG 16 1.3 0.992 Puerto Rican
HG00733 AMR HGSVG 46 6.88 0.992 Puerto Rican
HG01352 AMR DP 68 5.97 0.992 Colombian
HG02059 EAS DP 76 19.5 0.992 Vietnamese
HG02106 AMR DP 57 0.88 0.640 Peruvian
HG02818 AFR DP 56 0.66 0.802 Gambian
HG04217 SAS DP 60 0.86 0.269 Telugu
NA12878 EUR DP 54 4.67 0.971 Central European
NA19238 AFR HGSVG 23 2.64 0.991 Yoruba
NA19239 AFR HGSVG 35 4.87 0.994 Yoruba
NA19240 AFR HGSVG 49 3.4 0.989 Yoruba
NA19434 AFR DP 62 11 0.980 Luhya
NA24385 EUR GIAB 54 1.32 0.981 Ashkenazim
Table 2.1: Source genomes for RPGG. Continental populations represented are East Asian
(EAS), European (EUR), Admixed Amerindian (AMR), South Asian (SAS), and African (AFR).
Coverage is estimated diploid coverage based on alignment to GRCh38. Assembly N50 is of
haplotype-resolved assemblies. The fraction of VNTR annotated are all VNTR with at least 700
flanking bases assembled.
38
Figure 2.2: Mapping short reads to repeat-pangenome graphs. a, An example of evaluating the
alignment quality of a locus mapped by SRS reads. The alignment quality is measured by the
of a linear fit between the k -mer counts from the ground truth assemblies and from the mapped
reads (Methods). b, Distribution of the alignment quality scores of 73,582 loci. Loci with
alignment quality less than 0.96 when averaged across samples are removed from downstream
analysis (Methods). c, Distribution of VNTR lengths in GRCh38 removed or retained for
downstream analysis. d-e , Comparing the read mapping results of the CACNA1C VNTR using
RPGG or repeat-GRCh38. The k -mer counts in each graph and the differences are visualized
with edge width and color saturation ( d ). The k -mer counts from the ground truth assemblies are
regressed against the counts from reads mapped to the RPGG (red) and repeat-GRCh38 (blue),
respectively ( e ).
39
Figure 2.3: VNTR length prediction. a , Accuracies of VNTR length prediction measured for
each genome (left) and each locus (right). Mean absolute percentage error (MAPE) in VNTR
length is averaged across loci and genomes, respectively. Lengths were predicted based on
repeat-pangenome graphs (RPGG), repeat-GRCh38 (RHG) or naive read depth method (RD),
respectively. Boxes span from the lower quartile to the upper quartile, with horizontal lines
indicating the median. Whiskers extend to points that are within 1.5 interquartile range (IQR)
from the upper or the lower quartiles. b, Relative performance of RPGG versus repeat-GRCh38.
Loci are ordered along the x-axis by genotyping accuracy in repeat-GRCh38. The y-axis shows
the decrease in MAPE using RPGG versus repeat-GRCh38. The subplot shows loci poorly
genotyped (MAPE>0.4) in repeat-GRCh38. The red dotted line indicates the baseline without
any improvement.
40
Figure 2.4: Population properties of VNTR loci. a , Ratios of median length between populations
for loci with significant differences in average length. Loci are stratified by accuracy prediction
(<0.8), medium (0.8-0.9), and high (0.9+). b , Manhattan plot of V
ST
values. c-d , The distribution
of estimated length via k -mer dosage in continental populations for PLCL1 and SP ATA18 VNTR
loci, selected to visualize the distribution of dosage in different populations. Each point is an
individual. e, Differential usage and expansion of motifs between the EAS and AFR populations.
For each locus, the proportion of variance explained by the most informative k -mer in the EAS is
shown for the EAS and AFR populations on the x and y axes, respectively. Points are colored by
the difference in normalized k -mer counts, with red and blue indicating k -mers more abundant in
EAS and AFR populations, respectively. f, An example VNTR with differential motif usage.
Edges are colored if the k -mer count is biased toward a certain population. The black arrow
indicates the location of the k -mer that explains the most variance of VNTR length in the EAS
population.
41
Figure 2.5: cis -eQTL mapping of VNTRs. a, eVNTR discoveries in 20 human tissues. The
quantile-quantile plot shows the observed P value of each association test versus the P value
drawn from the expected uniform distribution. Black dots indicate the permutation results from
the top 5% associated (gene, VNTR) pairs in each tissue. The regression plots for ERAP2 and
KANSL1 are shown in c and d. b, Effect size distribution of significant associations from all
tissues. c-d , Genomic view of disease-related (eGene,eVNTR) pairs ( ERAP2 ,
chr5:96896863-96896963) (c) and ( KANSL1 , chr17:46265245-46265480) (d) are shown. Red
boxes indicate the location of eGenes and eVNTRs.
42
2.5 Discussion
Previous commentaries have proposed that variation in VNTR loci may represent a component
of undiagnosed disease and missing heritability (Hannan 2010) , which has remained difficult to
profile even with whole genome sequencing (Mousavi et al. 2019) . To address this, we have
proposed an approach that combines multiple genomes into a pangenome graph that represents
the repeat structure of a population. This is supported by the software, danbing-tk and associated
RPGG. We used danbing-tk to generate a pangenome from 19 haplotype-resolved assemblies,
and applied it to detect VNTR variation across populations and to discover eQTL.
The structure of the RPGG can help to organize the diversity of assembled VNTR
sequences with respect to the standard reference. In particular, the RPGG on 19 genomes is 27%
larger than repeat-GRCh38. Combined with the observation that using the 19-genome RPGG
gives a 63% decrease in length prediction error, this indicates that the pan-genomes add detail for
the missing variation. With the availability of additional genomes sequenced through the
Pangenome Reference Consortium ( https://humanpangenome.org/ ) and the HGSVC
( https://www.internationalgenome.org ), combined with advanced haplotype-resolve assembly
methods (Porubsky et al. 2020) , the spectrum of this variation will be revealed in the near future.
While we anticipate that eventually the full spectrum of VNTR diversity will be revealed through
LRS of large cohorts, the RPGG analysis will help organize analysis by characterizing repeat
domains. For example, with our approach, we are able to detect 1,913 loci with differential motif
usage between populations, which could be difficult to characterize using an approach such as
multiple-sequence alignment of VNTR sequences from assembled genomes.
There are several caveats to our approach. Datasets combined from disparate sequencing
runs with batch effects will affect dosage estimates. In contrast to other pangenome approaches
43
(Garrison et al. 2018; Rakocevic et al. 2019) , danbing-tk does not keep track of a reference (e.g.
GRCh38) coordinate system. Furthermore, because it is often not possible to reconstruct a
unique path in an RPGG, only counts of mapped reads are reported rather than the order of
traversal of the RPGG. An additional caveat of our approach is that genotype is calculated as a
continuum of k -mer dosage rather than discrete units, prohibiting direct calculation of
linkage-disequilibrium for fine-scale mapping (LaPierre et al., n.d.) . Finally this approach only
profiles loci where k -mer counts from reads and assemblies are correlated; loci for which every
k -mer appears the same number of times are excluded from analysis (on average 8,058/73,582
per genome).
The rich data provided by danbing-tk and pangenome analysis provide the basis for
additional association studies. While most analysis in this study focused on the diversity of
VNTR length or association of length and expression, it is possible to query differential motif
usage using the RPGG. The ability to detect motifs that have differential usage between
populations brings the possibility of detecting differential motif usage between cases and
controls in association studies. This can help distinguish stabilizing versus fragile motifs (Braida
et al. 2010) , or resolve some of the problem of missing heritability by discovering new
associations between motif and disease (Song, Lowe, and Kingsley 2018) . Finally, this work is a
part of ongoing pangenome graph analysis (Paten et al. 2017; Li, Feng, and Chu 2020) , and
represents an approach to generating pangenome graphs in loci that have difficult multiple
sequence alignments or degenerate graph topologies. Additional methods may be developed to
harmonize danbing-tk RPGGs with genome-wide pangenome graphs constructed from other
methods.
44
Chapter 3
The Motif Composition of Variable-Number Tandem Repeats
Impacts Gene Expression
In the last chapter, we addressed the genotyping problem for VNTR and solved it with a novel
data structure – repeat-pangenome graph. We developed a toolkit, danbing-tk , for graph
construction and read-to-graph alignment designed specifically for VNTR regions. We showed
that the approach is efficient and accurate. The output is essentially the mapping dosage for the
graph and contains all k -mer counts for each VNTR locus. This is useful for analyses such as
VNTR length estimation and eQTL mapping after summing up the k -mer counts for each VNTR
locus. We also took advantage of the full graph information and identified k -mers or paths that
occur differentially between populations. This opens the possibility to answer another question
that has been mostly undiscussed – what is the impact of the motif compositions of VNTRs on
gene expression at genome-wide scale? We will discuss why this is an important question and
how we can leverage the genotyping outputs of danbing-tk to approach this.
45
3.1 Abstract
Understanding the impact of DNA variation on human traits is a fundamental question in human
genetics. Variable number tandem repeats (VNTRs) make up roughly 3% of the human genome
but are often excluded from association analysis due to poor read mappability or divergent repeat
content. While methods exist to estimate VNTR length from short-read data, it is known that
VNTRs vary in both length and repeat (motif) composition. Here, we use a repeat-pangenome
graph (RPGG) constructed on 35 haplotype-resolved assemblies to detect variation in both
VNTR length and repeat composition. We align population scale data from the Genotype-Tissue
Expression (GTEx) Consortium to examine how variations in sequence composition may be
linked to expression, including cases independent of overall VNTR length. We find that 20,834
VNTRs are associated with nearby gene expression through motif variations, of which only 5.1%
associations are accessible from length. Fine-mapping identifies 273 genes to be likely driven by
variation in certain VNTR motifs and not overall length. To demonstrate the utility of association
using VNTR motifs, we examine the intronic VNTR of CACNA1C , which has been reported to
be associated with schizophrenia risk through motif variation. We show that in healthy
populations a previously identified schizophrenia risk motif is associated with decreased
expression of CACNA1C , and detect an additional novel motif with similar effect.
46
3.2 Introduction
Variable number tandem repeats (VNTRs) are repetitive DNA sequences with the size of a repeat
unit greater than six nucleotides. The copy number of a repeat unit is hypervariable due to its
susceptibility to replication slippage caused by strand mispairing between the same (Viguera,
Canceill, and Ehrlich 2001) or across haplotypes (Jeffreys et al. 1994) . At the sequence level,
single nucleotide variations or short indels are also prevalent along a repeat sequence and can
greatly expand the number of identified alleles relative to classification by length (Novroski et al.
2016) . Altogether, copy number variations, SNVs and short indels contribute to the full spectrum
of VNTR polymorphism. Missing heritability (Eichler et al. 2010) that cannot be explained by
SNVs can be partially attributed to VNTR polymorphisms (Hannan 2018; Mitra et al. 2021; Lu
and Chaisson 2021) . Accumulating evidence indicates that VNTRs are associated with a diverse
array of human traits and are casual to several diseases at the copy number level or sequence
level (Beyter et al. 2021; Bakhtiari et al. 2021; Mukamel et al. 2021) . Furthermore, significant
enrichment of VNTRs in subtelomeric genes that are mostly expressed in the brain suggests
further exploration of their roles in shaping behavioral/cognitive polymorphisms and modulating
neurological disease risks (Linthorst et al. 2020) .
VNTR length polymorphism can modulate human traits through several mechanisms,
including changing the number of protein domains (Desseyn et al. 2000) , the distance between
gene and gene regulators (Sun et al. 2018) , the number of regulator binding sites (Tsuge et al.
2005) , and the number of CpG sites (DeJesus-Hernandez et al. 2011; Renton et al. 2011) .
Abundant associations between repeat copy number and human traits have been widely reported
(Mukamel et al. 2021) and provide insights to functional annotations. However, it is impossible
to fully understand the biological functions of VNTRs without examining variation at the
47
sequence level. For example, a single cytosine insertion in MUC1 VNTR was identified to be
causal to medullary cystic kidney disease type 1 by adding a premature stop in translation (Kirby
et al. 2013) . In addition, certain repeat motifs in CACNA1C but not the total repeat copy number
were reported to be associated with schizophrenia risk by tuning gene expression activity (Song,
Lowe, and Kingsley 2018) . In both cases, long-read sequencing such as single-molecule
real-time sequencing or capillary sequencing has been useful to resolve the full sequence of
VNTRs and yield meaningful clinical interpretations.
Currently large-scale sequencing efforts use high-throughput short-read sequencing
(SRS) (Taliun et al. 2021) , however, VNTR analysis with SRS suffers from ambiguity in read
alignment, allelic bias of reference and the hypermutability of repeat sequences.
Single-nucleotide and small indel variant calls from VNTR regions using short-read alignments
are error-prone and blacklisted by ENCODE (Amemiya, Kundaje, and Boyle 2019) . Recently,
several methods have been developed specifically to estimate VNTR length from short-read data
using Hidden Markov Models (Bakhtiari et al. 2021) , read-depth (Garg et al. 2021) , and
repeat-pangenome graphs (Lu and Chaisson 2021) . These approaches have found an association
between estimated VNTR length and gene expression. In this study, we use a reference
pangenome graph (PGG) to reduce allelic bias when mapping short reads to a reference, and to
improve variant inference for motif composition. The PGG is a graph-based data structure that
summarizes sequence variations from a collection of samples by representing variants as
alternative paths or “bubbles” from the reference (Eizenga et al. 2020) . One of the most common
implementations of PGG is to use a sequence graph. In this model, each node represents an
allele; each edge points to a downstream allele; a traversal through the graph matches an
observed haplotype. This allows sequencing reads to be placed more accurately across the
48
genome and significantly improves variant calling accuracy in regions containing SVs (Chen et
al. 2019; Eggertsson et al. 2019; Li, Feng, and Chu 2020; Sirén et al. 2020) , with the majority of
which coming from indel events within VNTRs (Li, Feng, and Chu 2020) . However, variant
calling remains challenging for multiallelic VNTR regions as the position of calls varies (Li,
Feng, and Chu 2020) ; an extra processing step is needed to reveal the multiallelic property of a
locus.
Another commonly used graph model is the de Bruijn graph (dBG). The main distinction
is that each node is a unique k -mer derived from one or more k -base substrings present in the
input sequences. By augmenting with additional haplotype or distance information, dBG-based
models have been useful in genome assembly (Iqbal et al. 2012; Muggli et al. 2017) and variant
calling (Cameron et al. 2017; Narzisi et al. 2018) . Furthermore, this formulation groups all
occurrences of repetitive k-mers across input sequences by construction, which can be a
particularly desirable property when studying the biological implication of VNTR motifs.
By leveraging the advantages of PGG and dBG, genotyping VNTR from SRS samples at
a population scale has been made possible with danbing-tk (Lu and Chaisson 2021) . The method
constructs a repeat-pangenome graph (RPGG) that consists of disjoint locus-RPGGs, each
representing a single VNTR locus and encoding observed VNTR alleles with a dBG. Read
mapping to RPGG reveals the coverage of each k -mer and can be accumulated as a copy number
estimate, allowing associations between repeat copy number and human trait to be identified.
In this work, we extend the application of the danbing-tk to examine the association
between each path in the graph, or VNTR “motif”, and gene expression using the complete
read-mapping output i.e. the coverages of all k -mers. We show that motif usage/repeat count can
be accurately estimated in a RPGG, and that this may be used to compare motif composition
49
between individuals. We map genomes sequenced by GTEx to discover associations between
motif composition and gene expression that are independent of VNTR length. Fine-mapping of
these loci finds novel causal links where motif composition is likely to modulate changes in gene
expression.
50
3.3 Materials and Methods
3.3.1 Repeat pangenome graph construction
A set of 88,441 VNTR coordinates were retrieved from danbing-tk v1.3 (Lu and Chaisson 2021) .
The VNTR set was obtained by (1) detecting VNTRs over the five haplotype-resolved
assemblies (AK1, HG00514, HG00733, NA19240, NA24385) released by Lu and Chaisson (Lu
and Chaisson 2021) using Tandem Repeat Finder, (2) selecting for VNTRs with size between
100 bp and 10 kbp and motif size > 6 bp, and (3) applying danbing-tk to the VNTRs in the five
genomes to identify 88,441 loci with proper orthology mapping. Next, we downloaded the 35
haplotype-resolved assemblies released by HGSVC (Ebert et al. 2021) . VNTR annotations on the
35 assemblies and the corresponding RPGG were generated using the build module of
danbing-tk, giving a total of 80,518 loci.
3.3.2 VNTR genotyping
Whole-genome sequencing (WGS) samples for HGSVC assemblies were retrieved from 1000
Genomes Project phase 3 (1000 Genomes Project Consortium et al. 2015) . WGS samples for
eQTL mapping were retrieved from GTEx Analysis Release V8 (dbGaP Accession
phs000424.v8.p2) (GTEx Consortium 2020) and Geuvadis (Lappalainen et al. 2013) with a total
of 879 and 445 samples respectively. VNTRs were genotyped using danbing-tk v1.3 with options
“-ae -kf 4 1 -gc 85 -k 21 -cth 45”. The output k -mer counts were adjusted by the coverage of
each sample before subsequent analyses.
3.3.3 Alignment quality analysis
The aln- r
2
statistic was used to evaluate how well a VNTR can be genotyped. It is the r
2
computed by regressing the k -mer counts from assemblies against the counts from reads aligned
51
to the locus-RPGG. Since VNTRs were genotyped using the RPGG, any read k -mers not present
in the original assembly were ignored.
3.3.4 Graph compaction and motif dosage computation
Locus-RPGG built from k =21 contains abundant contiguous paths without branches. It is
desirable to reduce the number of nodes to be tested in eQTL mapping by merging nodes on this
type of path. This is essentially a problem of converting dBGs to compact dBGs where nodes on
a non-branching path are merged into a unitig, or referred to as a motif in this context. For each
motif, we recorded the mapping relation from its constituent nodes and computed the motif
dosage by averaging the k -mer counts from constituent nodes. In practice, when given a matrix
of VNTR genotype where each column represents a k -mer in a locus-RPGG and each row
represents a sample, the matrix of motif dosages can be simply computed by column operations
using the mapping relations.
3.3.5 Quality control of motifs
To ensure the quality of motifs tested in eQTL mapping, we applied three filters to remove
motifs (1) with mean absolute percentage error (MAPE) > 0.25, (2) with dosage invariant across
HGSVC haplotypes, or (3) that were derived from an inter-VNTR region (denoted as
“flank-like”) but were included in the repeat region of a locus-RPGG due to the distance between
the upstream and downstream VNTRs being < 700 bp. For the first filter, we computed mean
absolute percentage error (APE) for each motif by measuring the error size of each motif to the
linear fit for aln- r
2
. Formally, let x = ( x
1
, x
2
, …, x
P
) be the motif dosages from assemblies and y =
( y
1
, y
2
, …, y
P
) be the motif dosages from short reads, where P is the number of motifs in the
52
locus-RPGG. For each genome g , is the fitted value for the dosage of a motif from the linear 𝑦
𝑔
fit between x and y . The MAPE of a motif can be computed as follows:
𝑀 𝐴 𝑃 𝐸 =
1
𝑁
𝑔 = 1
𝑁
∑
| 𝑦
𝑔
− 𝑦
𝑔
|
𝑦
𝑔
, where N is the number of genomes with the motif. For the second filter, a motif was removed if
the dosage of a motif was the same across all 70 haplotypes. The dosage was set to zero if the
motif was not present in a haplotype. For the third filter, any k -mers derived from the
inter-VNTR regions before the VNTR merging step were extracted. Any motifs overlapping with
these k -mers were removed.
3.3.6 eQTL mapping
Gene expression data was processed as previously described (Lu and Chaisson 2021) unless
stated otherwise. Briefly, normalized gene expression matrices and covariates of all tissues were
retrieved from GTEx Analysis Release V8 (dbGaP Accession phs000424.v8.p2) (GTEx
Consortium 2020) . Confounding factors were removed using covariates including sex,
sequencing platform, amplification method, PEER factors, and top 10 principal components
(PCs) from the joint SNP matrix with 1KGP samples. Residualization of the gene expression
matrices was done with the following formula:
, 𝑌 = ( 𝐼 − 𝐶 ( 𝐶
𝑇
𝐶 )
− 1
𝐶
𝑇
) 𝑌 '
where Y is the residualized expression matrix; Y′ is the normalized expression matrix; I is the
identity matrix; C is the covariate matrix where each column corresponds to a covariate
mentioned above.
53
For eQTL mapping using motif dosage, samples with motif dosage being two standard
deviations away from the mean were removed for each motif. For eQTL mapping using VNTR
length, samples with VNTR length being three standard deviations away from the mean were
removed for each VNTR. The motif dosages, VNTR lengths, and the residualized expression
counts for the remaining samples were z-score normalized before testing for association.
For gene-level cis -eQTL discoveries, the P-value of each test was adjusted according to
the number of variants tested for each gene using Bonferroni correction. The minimal P-values
from all the tests against each gene were extracted and controlled at 5% false discovery rate
using the Benjamini–Hochberg procedure. Only one eMotif (if using motif dosage) or one
eVNTR (if using VNTR length) was reported for each eGene. For cis -eQTL discoveries using a
genome-wide P-value cutoff, all P-values from all tests were recorded and controlled at 5% false
discovery rate using the Benjamini–Hochberg procedure. The P-value cutoffs range from
1.2✕10
-5
(Kidney) to 1.4✕10
-3
(Thyroid), depending on the power in each tissue (Table B.3). A
VNTR that contains an eMotif was also regarded as an eVNTR. Consequently, an eVNTR can be
associated with gene expression through length or motif dosage depending on the type of tests
performed.
3.3.7 Fine-mapping
To evaluate whether the eMotifs are causal to gene expression, we used susieR (Wang et al.
2020) to fine-map the cis -window around the transcription start site of each gene. All variants in
GTEx’s catalog
(GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.vcf.gz),
including SNVs, indels or structural variants, were extracted if within the 1 Mb cis -windows. For
each tissue, all motifs that have the lowest P-value for each gene-VNTR pair were extracted if
54
within the 100 kb cis -windows. The extracted GTEx variants and motifs were taken as input for
fine-mapping. Susie was run using L=5 to allow up to five sets of causal variants within the
whole region. Motifs with posterior inclusion probability (PIP) ≥ 0.8 while passing the
genome-wide P-value cutoff as described in the previous section were reported as likely causal
eMotifs.
55
3.4 Results
3.4.1 Repeat pangenome graphs enable accurate profiling of motif composition
We developed an extended computational analysis pipeline based on the previously published
danbing-tk method to map read depth and identify eQTLs from individual paths in an RPGG
(Figure 3.1). The RPGG is constructed using 35 haplotype-resolved assemblies including three
trios released by the Human Genome Structural Variation Consortium (HGSVC) (Ebert et al.
2021) . Orthologous boundaries of 80,478 VNTR loci were annotated using danbing-tk (Lu and
Chaisson 2021) from a set of 84,411 VNTRs (Methods). We further augment the VNTR
annotations with additional 40 clinically relevant loci (Table B.1), giving a total of 80,518 loci
for subsequent analyses. The VNTRs annotated have a mean length of 685 bp across assemblies
versus 665 bp in GRCh38 (Figure 3.2a, B.1). Among the 70 haplotypes, a VNTR has an average
of 7.4 alleles per locus when defining an allele based on exact length (Figure B.2). Each locus
has an average of 3.0 alleles that are observed only once, denoted as private allele count, and 597
loci that have at least half the alleles (N≥35) as private (Figure B.2). The number of alleles per
locus is positively correlated with VNTR length (ρ=0.36). As VNTR length increases, the private
allele count also increases (Figure B.2), e.g. private allele count = 10.2 when VNTR length > 500
bp.
We used danbing-tk (Lu and Chaisson 2021) to encode the allele information across
haplotypes in an RPGG, consisting of 80,518 locus subgraphs, or locus-RPGGs. The graph has a
total of 398,576,090 k -mers ( k =21) or nodes, and 404,035,564 edges, with an average of 4,950
nodes in each locus-RPGG when including 700 bp flanking sequences on both sides. Each
locus-RPGG has an average of 104 nodes or 10.6% nodes that are observed only in one
assembly. The repeat region of RPGG (excluding flanking sequences) has 62,457,731 k -mers
56
(Figure 3.2b), which is 38% greater than the graph built from GRCh38 alone (n=45,148,309) and
38% greater than the smallest graph built from an assembly (HG00864, n=45,197,090). We
evaluated the quality of the alignments to each locus-RPGG by measuring the consistency of
k -mer counts from assemblies versus from short-read mapping, denoted as aln- r
2
(Methods).
Overall, there is a slightly high aln- r
2
of 0.94±0.08 (Figure 3.2c) compared to the aln- r
2
on the
previously published 19 haplotype-resolved assemblies (0.93±0.08), with enrichment of loci with
higher aln- r
2
(Figure B.3). We had previously applied an aln- r
2
cutoff to limit computational
requirements, however the danbing-tk alignment method had a large, but constant factor
improvement in runtime, and all 80,518 loci are genotyped in this study.
3.4.2 VNTR motif composition has pervasive cis -effects on gene expression
Using the short-read alignment module in danbing-tk, we estimated the VNTR content of 80,518
loci as graph genotypes and processed 838 GTEx genomes (GTEx Consortium 2020) using ~12
cpu hours and ~29 Gb memory per sample. The read alignments to each subgraph are
summarized as a vector of the number of reads mapped to each node/ k -mer. When normalized by
global read depth these represent mapping dosage used as input for cis -eQTL mapping.
In Lu 2021 (Lu and Chaisson 2021) , cis -eQTL mapping using an RPGG was reported
using the sum of the dosage vector for each locus-RPGG as an estimate of VNTR length.
Replicating this approach on the 35-genome RPGG, we discovered 1,355 VNTRs in association
with nearby gene expression (denoted as eVNTR), which is 3.9-fold the number (N=346)
reported from the previous RPGG built on 32,138 VNTRs (Lu and Chaisson 2021) . Of the
original eVNTRs, 84% (290/346) were reproduced in this version while the remaining 16% are
57
enriched in discoveries with lower significance, in particular when nominal P > 10
-5
(odds ratio =
0.819, Fisher’s exact P = 1.4 ✕10
-9
, Figure B.4).
In addition to VNTR length, our previous work (Lu and Chaisson 2021) identified motifs
enriched in certain populations sequenced by the 1000 Genomes Project Consortium (1000
Genomes Project Consortium et al. 2015) . We hypothesized that differential motif usage across
individuals, possibly independent of overall VNTR length variation, can modulate nearby gene
expression. To test this, we converted each locus-RPGG to a compact de Bruijn graph, and
considered each path as a locus to test in eQTL mapping. The original RPGG contains on
average 776 nodes in each locus-RPGG, which is reduced to 55 paths (referred to as motifs
hereafter) after compaction (Figure B.5), with a total of 4,456,881 motifs.
To ensure the quality of read-mapping to each motif in each locus-RPGG, we evaluate
the “mappability” of each motif by measuring the consistency between the dosage from short
reads and the dosage from the ground-truth assemblies using mean absolute percentage error
(MAPE, Methods). We removed 49.8% (2,219,780/4,456,881) of the motifs with MAPE ≥ 0.25
(Figure 3.3a). The number of motifs with zero variance in absolute percentage error (n=764,354)
is equivalent to the number paths private to one genome among the 35 HGSVC assemblies
(Figure B.6) and were retained for subsequent analyses. Similar to eQTL mapping on SNVs
where homozygous variants are removed, we avoid testing “invariant” motifs that appear the
same number of times in a repeat across all assemblies. This further removes 25.5%
(571,537/2,237,101) of motifs. By construction, our RPGG could contain loci with multiple
VNTRs in close proximity but spaced apart by short flanking sequences. We removed additional
0.1% (2,005/1,665,564) motifs that are derived from those sequences and could be simply
explained by SNVs, leaving a total of 1,663,559 motifs for eQTL mapping. For each motif, we
58
consider only samples whose dosages are within two standard deviations from the mean,
avoiding the discoveries of associations whose effects are mainly driven by outliers. This on
average removes 34 from 813 of the samples per locus (Figure B.7, Methods).
We ran cis -eQTL mapping for each VNTR motif and discovered 20,834 eVNTRs,
including 53,209 motifs associated with nearby gene expression, denoted as eMotifs (Figure
3.3b). While 78% (1,053/1,355) of the eVNTRs discovered using length estimates were also
reported using motif dosage, 95% (19,781 / 20,834) of the eVNTRs discovered from motif were
undetectable with length-based eQTL mapping. To assess the reproducibility of our methods, we
performed eQTL mapping on the Geuvadis dataset (Lappalainen et al. 2013) and compared the
discoveries with the GTEx results. We found that 68.4% (1,923/2,811) of the eMotifs and 92.9%
(2,397/2,579) of the eVNTRs from Geuvadis were also observed in at least one GTEx tissue.
Unreplicated eGene-eVNTR pairs (ePairs) tend to have lower significance in Geuvadis,
especially when P > 2.5✕10
-9
(odds ratio = 3.2, Fisher’s exact P = 4.2✕10
-39
, Figure B.8). When
comparing the whole blood tissue from GTEx and the lymphoblastoid from Geuvadis, the effect
sizes from the two datasets had a correlation coefficient of 0.80 (Figure 3.3c). Among the
replicated ePairs, 91% (800/878) had the same sign of effect (Figure 3.3d), suggesting motif
variations as a common explanatory variable in gene expression.
3.4.3 Disease relevance of eMotifs
Among the 40 additional disease-relevant tandem repeats (matched with 36 genes) included in
the RPGG, 18 of them including C9orf72 , CACNA1C , CSTB , DRD4 and MUC21 (Table B.2)
were identified as eQTLs and were associated with their original disease-linked genes.
Additionally, at least one eVNTR was detected for 17 of the 18 remaining genes and was
different from the originally annotated disease-relevant tandem repeat.
59
We investigated two examples of motifs associated with disease to see if they had
associations with expression in healthy individuals. Landefeld et al. (Landefeld et al. 2018)
report that the “RS1” VNTR at the 5’UTR of AVPR1A is associated with externalizing behaviors
while V ollebregt et al. (V ollebregt et al. 2021) report that the “RS3” VNTR but not RS1 is
associated with childhood onset aggression. No association between VNTR length of RS1 nor
RS3 with AVPR1A expression was found, however we found eMotifs for AVPR1A in healthy
individuals that correspond to the RS1 VNTR (CTAT)
5
TTAT(CTAT)
4
(b=−0.26, P=1.0 ✕10
-4
)
and the RS3 VNTR C(AG)
10
(b=0.21, P=2.1 ✕10
-4
). The RS3 VNTR
(chr12:63,156,354-63,156,429) has a nested repeat structure with an average size of 701 bp in
assemblies. It consists of two slightly divergent copies that each carries a highly repetitive
(CT)nTT(CT)n(GT)n core motif at chr12:63,156,354-63,156,394 and
chr12:63,156,701-63,156,751 (Figure B.9). Other VNTR annotation approaches might consider
this region as two separate VNTRs of which the length of the core motif is associated with
AVPR1A expression.
The decreased expression of CACNA1C was known to be a risk factor for schizophrenia
and has been reported to be associated with several 30 bp risk motifs at
chr12:2,255,789-2,256,090 based on a case-control study (Song, Lowe, and Kingsley 2018) .
Here, we found that the expression of CACNA1C in brain cerebellar hemisphere was associated
with a risk eMotif CAACCACACGATCCTGACCTT (denoted as motif 1, b=−0.44, P=1.2 ✕10
-8
,
Figure 3.4a). The eMotif covers five of the six mutation sites (Figure 3.4b) and is novel to all of
the risk motifs reported previously (Song, Lowe, and Kingsley 2018) . In addition, we were able
to replicate findings from the case-control study (Song, Lowe, and Kingsley 2018) in healthy
populations. The risk eMotif CCTGACCTTACTAGTTTACGA (b=−0.40, P=1.5 ✕10
-7
, denoted
60
as motif 2) that covers two of the mutation sites (Figure 3.4b) matches two of the risk motifs and
none of the protective motifs, indicating the prevalence of risk-modulating motifs even among
healthy populations. When examining the frequency of the two risk eMotifs in the 35 HGSVC
assemblies, 28 haplotypes carry motif 1 while 38 haplotypes carry motif 2 (Figure B.10). Most of
the individuals carry only few copies of the risk motifs except for the haplotype 2 of HG02818
which has 98 copies of motif 1. Annotating the locus-RPGG with the eQTL effect sizes also
reveals that motifs with minor risk and protective effects are pervasive within the locus, e.g.
CTGACTAGTTTACGATCACACGA (b=−0.29, P=1.6 ✕10
-4
) and
CAACCACACGATCCTGACCTGA (b=0.29, P=2.7 ✕10
-4
) (Figure 3.4c). The size of this VNTR
locus in GRCh38 is underrepresented with only 301 bp compared to an average size of 6,247 bp
across assemblies. In addition, immense histone modification, DNase clusters and TF clusters
signals can also be found in this locus (Figure 3.4d), necessitating future investigations to fully
understand the regulation mechanism of CACNA1C .
To further narrow down eMotifs that are likely causal to gene expression, we used susieR
(Wang et al. 2020) to fine-map the 1Mb cis -window of each eGene. We discovered 273 out of
19,965 eGenes of which the highest eMotif posterior inclusion probability (PIP) is greater than
0.8, suggesting the likely causal roles of these eMotifs (Figure 3.4e-f). The expression of these
273 eGenes are likely modulated by 560 eMotifs, or equivalently 229 eVNTRs (Figure 3.3e). On
average, 81.4% of the eGenes are shared across tissues while 82.4% of the eVNTRs and 53.1%
of the eMotifs are observed across multiple tissues (Figure B.11-13).
In search of eGene versus likely causal eMotif pairs that could have potential clinical
implications, we first ranked eGenes by their fold change in disease enrichment using GS2D
(Fontaine and Andrade-Navarro 2016) . eGenes with associated disease matching the
61
fine-mapped tissues and of which at least one eVNTR located within the gene body were
retained. The top four example genes (Table 3.1) have important roles in immune response
( ITGB2 , FCGR3B and FCGR3A ), cell-cell interactions ( ITGB2 ), or histone modification
( KANSL1 ). Most haplotypes in the HGSVC assemblies carry only up to two copies of the listed
motifs (Figure B.14), except for the motif of ITGB2 , which has 1-7 copies in observed
haplotypes. Most eVNTRs in the HGSVC assemblies have similar size to GRCh38 except for the
KANSL1 VNTR at chr17:46,217,604-46,217,944, which is approximately 15 times of the sizes in
GRCh38.
62
Figure 3.1: Methods overview. a, Estimating the dosages of VNTR motifs using a locus-RPGG.
A locus-RPGG is built from haplotype-resolved assemblies by first annotating the orthology
mapping of VNTR boundaries and then encoding the VNTR alleles with a de Bruijn graph
(dBG), or locus-RPGG. A compact dBG is constructed by merging nodes on a non-branching
path into a unitig, denoted as a motif in this context. Motif dosages of a VNTR can be computed
by aligning short reads to an RPGG and averaging the coverage of nodes corresponding to the
same motif. b, Identifying likely causal eMotifs. The dosage of each motif is tested against the
expression level of a nearby gene. Genes in significant association with at least one motif
(denoted as eMotif) are fine-mapped using susieR (Wang et al. 2020) in order to identify eMotifs
that are likely causal to nearby gene expression. All GTEx variants (Methods) and lead motifs
from each gene-VNTR pair are included in the fine-mapping model.
63
Figure 3.2: Characteristics of VNTRs and the RPGG. a, Size distribution of VNTR alleles across
35 HGSVC assemblies. Each dot represents the size of a VNTR locus in an assembly. The order
of 80,518 VNTR loci were sorted according to size in GRCh38. b, Cumulative graph sizes. A
total of 35 repeat-genome graphs were incrementally added to the RPGG in the order of their
respective graph size. The red dash line denotes the size of the repeat graph built from GRCh38.
The blue dash line denotes the average size of the graphs built from assemblies. c, Distribution of
the aln- r
2
for all locus-RPGGs. The aln- r
2
of each locus was computed by regressing the
assembly k -mer counts against the read k -mer counts from graph alignments.
64
Figure 3.3: cis -eQTL mapping of VNTR motifs. a , Distribution of motif-MAPE. For each motif,
the mean absolute percentage error (MAPE) was computed by averaging the absolute percentage
errors (APEs, Methods) across 35 genomes. A total of 4,456,881 motifs were shown in the plot.
b, Quantile-quantile plot of gene-level eMotif discoveries across 20 human tissues from GTEx
datasets. The expected P-values (x-axis) were drawn from Unif(0,1) and plotted against observed
nominal P-values obtained from t -test on the slope of each linear model consisting of expression
(response variable) versus motif dosage (explanatory variable). c-d. Replication on Geuvadis
dataset. c, The quantile-quantile plot shows the observed P-value of each association test
(two-sided t-test) versus the P-value drawn from the expected uniform distribution. Black dots
indicate the permutation results from the top 5% associated (gene, motif) pairs. d, Correlation of
65
eMotif effect sizes between Geuvadis and GTEx whole blood tissue. Only eGenes significant in
both datasets were shown. Each pair of gene and motif that has the same/opposite sign across
datasets were colored in red/black. e. VNTR-centric view of gene-level eQTL discoveries and
fine-mapping. Lead eMotif denotes any VNTRs that are associated with gene expression through
at least one eMotif under gene-level discoveries. Length denotes any VNTRs that are associated
with gene expression through length under gene-level discoveries. Causal eMotif denotes any
VNTRs of which a motif passes the fine-mapping procedure with a posterior inclusion
probability ≥ 0.8 while being a significant eQTL under genome-wide P-value cutoff. Motifs with
the lowest P-value for each VNTR-gene pair are included in the fine-mapping model.
66
Figure 3.4: Disease-relevant genes and fine-mapping. a-d, Identification of risk motifs for
CACNA1C expression. The motifs in CACNA1C VNTR at chr12:2,255,789-2,256,088 were
analyzed. a, Association of CACNA1C VNTR motif CAACCACACGATCCTGACCTT with
gene expression. b, Genome browser view of CACNA1C VNTR. c, Graph visualization of motif
effect sizes from the CACNA1C VNTR. Each edge denotes a motif and is colored blue/red if
67
having a negative/positive effect on gene expression. Color saturation and edge width both scale
with the absolute value of effect size. The sequence of a motif is shown parallel to the edge and
colored in dark blue if having a significant effect or colored in light red/blue if borderline
significant. d, Multiple sequence alignment of known CACNA1C risk motifs (Song, Lowe, and
Kingsley 2018) and the likely causal eMotifs reported in this study. e, Distribution of posterior
inclusion probability (PIP). eMotifs with PIP greater than 0.8 were called likely causal. Each dot
is from a fine-mapped gene in a tissue. The maximal PIPs of all eMotifs/GTEx variants nearby a
gene are shown on the x-axis/y-axis. f, Ideograms of selected eGenes and eGenes with a likely
causal eMotif. The locations of all 510 eGenes with a likely causal eMotif are shown. The
locations of AVPR1A , CACNA1C , ITGB2 , FCGR3B , FCGR3A and KANSL1 are shown in the
selected example genes track.
Table 3.1: Selected examples of disease-related genes with likely causal eMotifs.
[1] The one with the lowest q-value in cis -eQTL mapping if there are likely causal eMotifs in multiple tissues.
[2] Denotes whether the VNTR coordinate on GRCh38 overlaps with the histone H3K4Me1 H3K4Me3, H3K27Ac mark, the
DNase clusters, or the TF clusters track in Genome Browser.
[3] The VNTR size in GRCh38 (ref) versus the average size in assemblies (asm).
68
3.5 Discussion
Genomic variant discovery serves to link genetic and phenotypic variation. Using gene
expression as a phenotypic measure, diverse classes of variation have been found to have an
effect on gene expression including single-nucleotide variants (GTEx Consortium 2020) ,
structural variation (Chiang et al. 2017; Ebert et al. 2021; Sudmant, Rausch, et al. 2015) , STRs
(Gymrek et al. 2016) , and VNTRs (Bakhtiari et al. 2021; Eslami Rasekh et al. 2021; Garg et al.
2021) . Here we show that in addition to association of VNTR length with expression, a more
nuanced measurement of VNTR variation that takes into account sequence composition reveals
eMotifs that influence gene expression.
Overall, we find 20,834 VNTR loci containing at least one eMotif. In contrast, previous
studies that used associations based on length estimate alone ranged between 163-2,980 eVNTRs
(Bakhtiari et al. 2021; Lu and Chaisson 2021; Garg et al. 2021) , with the number roughly
correlating with the number of loci each study profiled. Although more tests per VNTR locus are
performed, the fine-mapping analysis finds that the majority of variants are linked with nearby
eQTLs. After applying fine-mapping, 229 (1.1%) eVNTRs contain motifs determined as causal.
In contrast, 0.18% of the 4.3M eQTL variants discovered in the GTEx (v8) are fine-mapped
(GTEx Consortium 2020) .
We observe that most eVNTRs have different motifs positively and negatively associated
with expression of the same nearby gene. Most eQTL mapping pipelines are based on biallelic
variants. When encoding a variant, the reference allele is usually treated as zero while the
alternative allele is treated as one. Alternatively, this can be viewed as encoding the alternative
allele as its copy number, which is simply one for a biallelic variant, while keeping the reference
allele as zero. When the same encoding method is applied to a VNTR locus consisting of a
69
reference motif and an alternative motif, the only difference is that the alternative allele becomes
a continuous value representing the adjusted motif depth, and may take on a positive or negative
association depending on the relation to the reference motif.
This study profiles 80,518 VNTR loci, a 2.5-fold increase over our previous analysis of
VNTR variation using repeat-pangenome graphs (Lu and Chaisson 2021) that is largely
attributable to the high-quality haplotype-resolved assemblies used to construct the pangenome.
The size of the graph increases sequentially with the number of assemblies included in the graph,
and is consistent with the increasing number of structural variants discovered in VNTRs by
whole-genome alignment (Ebert et al. 2021) . The inclusion of additional genomes from
large-scale sequencing projects such as the Human Pangenome Reference Consortium will yield
an improved estimate of saturation of VNTR variation.
The use of repeat-pangenome graphs in this study differs from other implementations of
pangenom-graphs including those constructed by progressive whole-genome alignment (Li,
Feng, and Chu 2020) and variant inclusion (Sirén et al. 2021) , both of which preserve haplotype
information from the genomes or variants used to construct the pangenome graph. While
systematic analysis of variation in VNTRs and association with expression has not yet been
conducted using these approaches, we anticipate the repeat-pangenome graph will provide
complementary analysis. In particular, variant genotyping in graphs that preserve haplotype as
implemented by Giraffe (Sirén et al. 2021) and PanGenie (Ebler et al., n.d.) corresponds to
associating read data with haplotypes (paths in a pangenome graph) covering variants. These
approaches provide highly accurate genotyping of variants shared with the graph, however
hypervariable VNTR sequences are more likely to have differences from genomes represented in
the graph, and additional analysis is required to quantify motif usage in addition to genotype.
70
The implementation of the pangenome as a de Bruijn graph is an elegant approach to
identifying the composition of identical motif repeats, however small differences in motif
composition can make the graph complex, and additional development is necessary to identify
graph topologies that naturally reflect VNTR repeat composition. One result of this complexity is
our number of eMotifs that are deemed likely causal using fine mapping is possibly an
underestimate. Many motifs have highly correlated read dosage, however we use a conservative
approach of considering each motif as an independent variable for fine mapping. Future
development that merges similar motifs to the same edge both aggregate depth otherwise split on
several edges, and reduce the correlated motifs tested during fine mapping.
In summary, this study demonstrates how VNTR composition has a pervasive influence
on gene expression, and highlights the need to profile variation in complex, repetitive regions of
the genome. We anticipate this approach will be useful for future expression and association
studies.
71
Chapter 4
Conclusions and Future Work
In this dissertation, I described the implications of our work on human genetics research. In brief,
human phenotypic variations are usually explained with a GxE model that considers both genetic
(G) and environmental (E) components. Conventional computational genetics studies focused
entirely on SNVs and short indels, and ignored other types of genetic variants. As sequencing
technologies and computational methods have advanced, the models have been expanded to
include SVs and STRs. Only very recently have researchers started to examine the functional
impact of VNTRs. While VNTR lengths could be faithfully estimated with different approaches,
none of the current methods consider the sequence variations in repeat units.
In chapter 2, I presented our novel approach to genotype VNTRs using danbing-tk . It
builds an RPGG from diploid assemblies for 32,138 loci, provides accurate estimates of VNTR
length from Illumina short-read sequencing data, and facilitates the identification of VNTRs as
eQTLs through length polymorphisms. We reported 346 eVNTRs through length association,
identified 477 potentially unstable loci, revealed 785 loci with length stratified across
populations, and detected 8,216 loci with differential motif usage between populations. The last
analysis also opens the possibility to analyze VNTR motifs besides length.
In chapter 3, I presented our follow-up study on the unanswered question – what is the
impact of VNTR motif variation on gene expression at the genome-wide scale? We updated the
RPGG and expanded the set of genotypable VNTRs to 80,518 loci. Leveraging the rich
information in the RPGG, we tested the association of each VNTR motif in the graph with gene
expression. We reported 15x more signals compared to eQTL mapping using repeat length.
72
Using the CACNA1C VNTR as an example, we demonstrated the ability of our framework to
replicate a known risk eMotif for schizophrenia while also reporting a novel risk motif. We
fine-mapped all eGenes and reported 229 eMotifs that cannot be explained by nearby GTEx
variants. We prioritized four genes carrying likely causal eMotifs with large effect sizes and
overlapping abundant regulatory signals. Our findings provide another piece of the puzzle to the
missing heritability problem and serve as a reference for future studies.
One major direction for the future is to apply the danbing-tk framework to detect
functional elements using case-control studies. One of the several datasets we are interested in is
the Metabolic Syndrome in Men study (METSIM) (Stančáková et al. 2009; Laakso et al. 2017) .
The METSIM study is a resourceful dataset for understanding the roles of genetic components in
type 2 diabetes, cardiovascular diseases, and insulin resistance-related traits. With the large
sample size and unique properties of the Finnish populations, studies were able to identify
multiple trait-associated biomarkers (Mardinoglu et al. 2017; Orozco et al. 2018) . By integrating
multiple data types such as WGS, RNA-seq, and phenotype data, causal inference using
mendelian randomization was feasible for pinpointing the variants and genes that are most
disease-relevant (Civelek et al. 2017; Ganel et al. 2020) .
Another aspect of our future work is to develop a new module in danbing-tk to detect rare
VNTR alleles not present in the graph. The current danbing-tk implementation simply skips the
parts of reads that are highly divergent from the RPGG, leaving the corresponding paths with
lower k -mer counts. As we move towards datasets with larger sample sizes to increase the power
of variant-trait associations, it is more likely for rare alleles with weak effects to be reported.
This requires our program to be capable of reporting novel rare alleles from recurrent read
73
divergence. With the increased robustness of our algorithm, we expect to discover more
functionally relevant variants in the future.
74
References
1000 Genomes Project Consortium, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Erik P.
Garrison, Hyun Min Kang, Jan O. Korbel, et al. 2015. “A Global Reference for Human
Genetic Variation.” Nature 526 (7571): 68–74.
Amemiya, Haley M., Anshul Kundaje, and Alan P. Boyle. 2019. “The ENCODE Blacklist:
Identification of Problematic Regions of the Genome.” Scientific Reports 9 (1): 9354.
Audano, Peter A., Arvis Sulovari, Tina A. Graves-Lindsay, Stuart Cantsilieris, Melanie
Sorensen, Annemarie E. Welch, Max L. Dougherty, et al. 2019. “Characterizing the Major
Structural Variant Alleles of the Human Genome.” Cell 176 (3): 663–75.e19.
Bakhtiari, Mehrdad, Jonghun Park, Yuan-Chun Ding, Sharona Shleizer-Burko, Susan L.
Neuhausen, Bjarni V. Halldórsson, Kári Stefánsson, Melissa Gymrek, and Vineet Bafna.
2021. “Variable Number Tandem Repeats Mediate the Expression of Proximal Genes.”
Nature Communications 12 (1): 1–12.
Bakhtiari, Mehrdad, Sharona Shleizer-Burko, Melissa Gymrek, Vikas Bansal, and Vineet Bafna.
2018. “Targeted Genotyping of Variable Number Tandem Repeats with adVNTR.” Genome
Research 28 (11): 1709–19.
Benedetti, Francesco, Sara Dallaspezia, Cristina Colombo, Adele Pirovano, Elena Marino, and
Enrico Smeraldi. 2008. “A Length Polymorphism in the Circadian Clock Gene Per3
Influences Age at Onset of Bipolar Disorder.” Neuroscience Letters 445 (2): 184–87.
Benson, G. 1999. “Tandem Repeats Finder: A Program to Analyze DNA Sequences.” Nucleic
Acids Research . https://doi.org/ 10.1093/nar/27.2.573 .
Beyter, Doruk, Helga Ingimundardottir, Asmundur Oddsson, Hannes P. Eggertsson, Eythor
Bjornsson, Hakon Jonsson, Bjarni A. Atlason, et al. 2021. “Long-Read Sequencing of 3,622
Icelanders Provides Insight into the Role of Structural Variants in Human Diseases and
Other Traits.” Nature Genetics 53 (6): 779–86.
Braida, Claudia, Rhoda K. A. Stefanatos, Berit Adam, Navdeep Mahajan, Hubert J. M. Smeets,
Florence Niel, Cyril Goizet, et al. 2010. “Variant CCG and GGC Repeats within the CTG
Expansion Dramatically Modify Mutational Dynamics and Likely Contribute toward
Unusual Symptoms in Some Myotonic Dystrophy Type 1 Patients.” Human Molecular
Genetics 19 (8): 1399–1412.
Brown, Andrew Anand, Alfonso Buil, Ana Viñuela, Tuuli Lappalainen, Hou-Feng Zheng, J.
Brent Richards, Kerrin S. Small, Timothy D. Spector, Emmanouil T. Dermitzakis, and
Richard Durbin. 2014. “Genetic Interactions Affecting Human Gene Expression Identified
by Variance Association Mapping.” eLife 3 (April): e01381.
Buil, Alfonso, Andrew Anand Brown, Tuuli Lappalainen, Ana Viñuela, Matthew N. Davies,
Hou-Feng Zheng, J. Brent Richards, et al. 2014. “Gene-Gene and Gene-Environment
Interactions Detected by Transcriptome Sequence Analysis in Twins.” Nature Genetics 47
75
(1): 88–91.
Cameron, Daniel L., Jan Schröder, Jocelyn Sietsma Penington, Hongdo Do, Ramyar Molania,
Alexander Dobrovic, Terence P. Speed, and Anthony T. Papenfuss. 2017. “GRIDSS:
Sensitive and Specific Genomic Rearrangement Detection Using Positional de Bruijn Graph
Assembly.” Genome Research 27 (12): 2050–60.
Chaisson, Mark J. P., Ashley D. Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky,
Tobias Rausch, Eugene J. Gardner, et al. 2019. “Multi-Platform Discovery of
Haplotype-Resolved Structural Variation in Human Genomes.” Nature Communications 10
(1): 1784.
Chen, Sai, Peter Krusche, Egor Dolzhenko, Rachel M. Sherman, Roman Petrovski, Felix
Schlesinger, Melanie Kirsche, et al. 2019. “Paragraph: A Graph-Based Structural Variant
Genotyper for Short-Read Sequence Data.” Genome Biology 20 (1): 291.
Chiang, Colby, Alexandra J. Scott, Joe R. Davis, Emily K. Tsang, Xin Li, Yungil Kim, Tarik
Hadzic, et al. 2017. “The Impact of Structural Variation on Human Gene Expression.”
Nature Genetics 49 (5): 692–99.
Chin, Chen-Shan, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion,
Alicia Clum, Christopher Dunn, et al. 2016. “Phased Diploid Genome Assembly with
Single-Molecule Real-Time Sequencing.” Nature Methods 13 (12): 1050–54.
Consortium, Gtex, and GTEx Consortium. 2017. “Genetic Effects on Gene Expression across
Human Tissues.” Nature . https://doi.org/ 10.1038/nature24277 .
Consortium, International Human Genome Sequencing, and International Human Genome
Sequencing Consortium. 2001. “Initial Sequencing and Analysis of the Human Genome.”
Nature . https://doi.org/ 10.1038/35057062 .
Course, Meredith M., Kathryn Gudsnuk, Samuel N. Smukowski, Kosuke Winston, Nitin Desai,
Jay P. Ross, Arvis Sulovari, et al. 2020. “Evolution of a Human-Specific Tandem Repeat
Associated with ALS.” American Journal of Human Genetics 107 (3): 445–60.
Course, Meredith M., Arvis Sulovari, Kathryn Gudsnuk, Evan E. Eichler, and Paul N.
Valdmanis. 2021. “Characterizing Nucleotide Variation and Expansion Dynamics in
Human-Specific Variable Number Tandem Repeats.” Genome Research 31 (8): 1313–24.
Coventry, William L., and Matthew C. Keller. 2005. “Estimating the Extent of Parameter Bias in
the Classical Twin Design: A Comparison of Parameter Estimates From Extended
Twin-Family and Classical Twin Designs.” Twin Research and Human Genetics .
https://doi.org/ 10.1375/twin.8.3.214 .
Dashnow, Harriet, Monkol Lek, Belinda Phipson, Andreas Halman, Simon Sadedin, Andrew
Lonsdale, Mark Davis, et al. 2018. “STRetch: Detecting and Discovering Pathogenic Short
Tandem Repeat Expansions.” Genome Biology 19 (1): 121.
DeJesus-Hernandez, Mariely, Ian R. Mackenzie, Bradley F. Boeve, Adam L. Boxer, Matt Baker,
Nicola J. Rutherford, Alexandra M. Nicholson, et al. 2011. “Expanded GGGGCC
76
Hexanucleotide Repeat in Noncoding Region of C9ORF72 Causes Chromosome 9p-Linked
FTD and ALS.” Neuron 72 (2): 245–56.
De Roeck, Arne, Lena Duchateau, Jasper Van Dongen, Rita Cacace, Maria Bjerke, Tobi Van den
Bossche, Patrick Cras, et al. 2018. “An Intronic VNTR Affects Splicing of ABCA7 and
Increases Risk of Alzheimer’s Disease.” Acta Neuropathologica 135 (6): 827–37.
Desseyn, Jean-Luc, Jean-Pierre Aubert, Nicole Porchet, and Anne Laine. 2000. “Evolution of the
Large Secreted Gel-Forming Mucins.” Molecular Biology and Evolution 17 (8): 1175–84.
Dewan, Andrew, Mugen Liu, Stephen Hartman, Samuel Shao-Min Zhang, David T. L. Liu,
Connie Zhao, Pancy O. S. Tam, et al. 2006. “HTRA1 Promoter Polymorphism in Wet
Age-Related Macular Degeneration.” Science 314 (5801): 989–92.
Dolzhenko, Egor, Mark F. Bennett, Phillip A. Richmond, Brett Trost, Sai Chen, Joke J. F. A. van
Vugt, Charlotte Nguyen, et al. 2020. “ExpansionHunter Denovo: A Computational Method
for Locating Known and Novel Repeat Expansions in Short-Read Sequencing Data.”
Genome Biology 21 (1): 102.
Dolzhenko, Egor, Viraj Deshpande, Felix Schlesinger, Peter Krusche, Roman Petrovski, Sai
Chen, Dorothea Emig-Agius, et al. 2019. “ExpansionHunter: A Sequence-Graph-Based
Tool to Analyze Variation in Short Tandem Repeat Regions.” Bioinformatics 35 (22):
4754–56.
Dolzhenko, Egor, Joke J. F. A. van Vugt, Richard J. Shaw, Mitchell A. Bekritsky, Marka van
Blitterswijk, Giuseppe Narzisi, Subramanian S. Ajay, et al. 2017. “Detection of Long
Repeat Expansions from PCR-Free Whole-Genome Sequence Data.” Genome Research 27
(11): 1895–1903.
Du, Zhenglin, Liang Ma, Hongzhu Qu, Wei Chen, Bing Zhang, Xi Lu, Weibo Zhai, et al. 2019.
“Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern
Han Genome.” Genomics, Proteomics & Bioinformatics 17 (3): 229–47.
Eaves, L. J., Krystyna A. Last, P. A. Young, and N. G. Martin. 1978. “Model-Fitting Approaches
to the Analysis of Human Behaviour.” Heredity . https://doi.org/ 10.1038/hdy.1978.101 .
Ebert, Peter, Peter A. Audano, Qihui Zhu, Bernardo Rodriguez-Martin, David Porubsky, Marc
Jan Bonder, Arvis Sulovari, et al. 2021. “Haplotype-Resolved Diverse Human Genomes and
Integrated Analysis of Structural Variation.” Science 372 (6537).
https://doi.org/ 10.1126/science.abf7117 .
Ebler, Jana, Wayne E. Clarke, Tobias Rausch, Peter A. Audano, Torsten Houwaart, Jan Korbel,
Evan E. Eichler, Michael C. Zody, Alexander T. Dilthey, and Tobias Marschall. n.d.
“Pangenome-Based Genome Inference.” https://doi.org/ 10.1101/2020.11.11.378133 .
Eggertsson, Hannes P., Snaedis Kristmundsdottir, Doruk Beyter, Hakon Jonsson, Astros
Skuladottir, Marteinn T. Hardarson, Daniel F. Gudbjartsson, Kari Stefansson, Bjarni V.
Halldorsson, and Pall Melsted. 2019. “GraphTyper2 Enables Population-Scale Genotyping
of Structural Variation Using Pangenome Graphs.” Nature Communications 10 (1): 5402.
77
Eichler, Evan E., Jonathan Flint, Greg Gibson, Augustine Kong, Suzanne M. Leal, Jason H.
Moore, and Joseph H. Nadeau. 2010. “Missing Heritability and Strategies for Finding the
Underlying Causes of Complex Disease.” Nature Reviews. Genetics 11 (6): 446–50.
Eizenga, Jordan M., Adam M. Novak, Jonas A. Sibbesen, Simon Heumos, Ali Ghaffaari, Glenn
Hickey, Xian Chang, et al. 2020. “Pangenome Graphs.” Annual Review of Genomics and
Human Genetics 21 (August): 139–62.
Eslami Rasekh, Marzieh, Yözen Hernández, Samantha D. Drinan, Juan I. Fuxman Bass, and
Gary Benson. 2021. “Genome-Wide Characterization of Human Minisatellite VNTRs:
Population-Specific Alleles and Gene Expression Differences.” Nucleic Acids Research 49
(8): 4308–24.
Fairfax, Benjamin P., Peter Humburg, Seiko Makino, Vivek Naranbhai, Daniel Wong, Evelyn
Lau, Luke Jostins, et al. 2014. “Innate Immune Activity Conditions the Effect of Regulatory
Variants upon Monocyte Gene Expression.” Science 343 (6175): 1246949.
Fairley, Susan, Ernesto Lowy-Gallego, Emily Perry, and Paul Flicek. 2020. “The International
Genome Sample Resource (IGSR) Collection of Open Human Genomic Variation
Resources.” Nucleic Acids Research 48 (D1): D941–47.
Felson, Jacob. 2014. “What Can We Learn from Twin Studies? A Comprehensive Evaluation of
the Equal Environments Assumption.” Social Science Research 43 (January): 184–99.
Flannick, Jason, Gudmar Thorleifsson, Nicola L. Beer, Suzanne B. R. Jacobs, Niels Grarup, Noël
P. Burtt, Anubha Mahajan, et al. 2014. “Loss-of-Function Mutations in SLC30A8 Protect
against Type 2 Diabetes.” Nature Genetics 46 (4): 357–63.
Fontaine, Jean Fred, and Miguel A. Andrade-Navarro. 2016. “Gene Set to Diseases (GS2D):
Disease Enrichment Analysis on Human Gene Sets with Literature Data.” Genomics and
Computational Biology . https://doi.org/ 10.18547/gcb.2016.vol2.iss1.e33 .
Fotsing, Stephanie Feupe, Jonathan Margoliash, Catherine Wang, Shubham Saini, Richard
Yanicky, Sharona Shleizer-Burko, Alon Goren, and Melissa Gymrek. 2019. “The Impact of
Short Tandem Repeat Variation on Gene Expression.” Nature Genetics 51 (11): 1652–59.
Franke, Andre, Dermot P. B. McGovern, Jeffrey C. Barrett, Kai Wang, Graham L.
Radford-Smith, Tariq Ahmad, Charlie W. Lees, et al. 2010. “Genome-Wide Meta-Analysis
Increases to 71 the Number of Confirmed Crohn’s Disease Susceptibility Loci.” Nature
Genetics 42 (12): 1118–25.
Fuchsberger, Christian, Jason Flannick, Tanya M. Teslovich, Anubha Mahajan, Vineeta
Agarwala, Kyle J. Gaulton, Clement Ma, et al. 2016. “The Genetic Architecture of Type 2
Diabetes.” Nature 536 (7614): 41–47.
Garg, Paras, Bharati Jadhav, William Lee, Oscar L. Rodriguez, Alejandro Martin-Trujillo, and
Andrew J. Sharp. 2022. “A Phenome-Wide Association Study Identifies Effects of
Copy-Number Variation of VNTRs and Multicopy Genes on Multiple Human Traits.”
American Journal of Human Genetics , May. https://doi.org/ 10.1016/j.ajhg.2022.04.016 .
78
Garg, Paras, Alejandro Martin-Trujillo, Oscar L. Rodriguez, Scott J. Gies, Elina Hadelia, Bharati
Jadhav, Miten Jain, Benedict Paten, and Andrew J. Sharp. 2021. “Pervasive Cis Effects of
Variation in Copy Number of Large Tandem Repeats on Local DNA Methylation and Gene
Expression.” American Journal of Human Genetics 108 (5): 809–24.
Garrison, Erik, Jouni Sirén, Adam M. Novak, Glenn Hickey, Jordan M. Eizenga, Eric T.
Dawson, William Jones, et al. 2018. “Variation Graph Toolkit Improves Read Mapping by
Representing Genetic Variation in the Reference.” Nature Biotechnology 36 (9): 875–79.
Gatchel, Jennifer R., and Huda Y. Zoghbi. 2005. “Diseases of Unstable Repeat Expansion:
Mechanisms and Common Principles.” Nature Reviews. Genetics 6 (10): 743–55.
Glass, Daniel, Ana Viñuela, Matthew N. Davies, Adaikalavan Ramasamy, Leopold Parts, David
Knowles, Andrew A. Brown, et al. 2013. “Gene Expression Changes with Age in Skin,
Adipose Tissue, Blood and Brain.” Genome Biology 14 (7): R75.
GTEx Consortium. 2020. “The GTEx Consortium Atlas of Genetic Regulatory Effects across
Human Tissues.” Science 369 (6509): 1318–30.
Gymrek, Melissa, David Golan, Saharon Rosset, and Yaniv Erlich. 2012. “lobSTR: A Short
Tandem Repeat Profiler for Personal Genomes.” Genome Research 22 (6): 1154–62.
Gymrek, Melissa, Thomas Willems, Audrey Guilmatre, Haoyang Zeng, Barak Markus, Stoyan
Georgiev, Mark J. Daly, et al. 2016. “Abundant Contribution of Short Tandem Repeats to
Gene Expression Variation in Humans.” Nature Genetics 48 (1): 22–29.
Gymrek, Melissa, Thomas Willems, David Reich, and Yaniv Erlich. 2017. “Interpreting Short
Tandem Repeat Variations in Humans Using Mutational Constraint.” Nature Genetics .
https://doi.org/ 10.1038/ng.3952 .
Hajirasouliha, Iman, Fereydoun Hormozdiari, Can Alkan, Jeffrey M. Kidd, Inanc Birol, Evan E.
Eichler, and S. Cenk Sahinalp. 2010. “Detection and Characterization of Novel Sequence
Insertions Using Paired-End next-Generation Sequencing.” Bioinformatics 26 (10):
1277–83.
Hannan, Anthony J. 2010. “Tandem Repeat Polymorphisms: Modulators of Disease
Susceptibility and Candidates for ‘missing Heritability.’” Trends in Genetics .
https://doi.org/ 10.1016/j.tig.2009.11.008 .
Hannan, Anthony J. 2018. “Tandem Repeats Mediating Genetic Plasticity in Health and
Disease.” Nature Reviews. Genetics 19 (5): 286–98.
Hickey, Glenn, David Heller, Jean Monlong, Jonas A. Sibbesen, Jouni Sirén, Jordan Eizenga,
Eric T. Dawson, Erik Garrison, Adam M. Novak, and Benedict Paten. 2020. “Genotyping
Structural Variants in Pangenome Graphs Using the vg Toolkit.” Genome Biology 21 (1):
35.
Hollox, Edward J., Ulrike Huffmeier, Patrick L. J. Zeeuwen, Raquel Palla, Jesús Lascorz, Diana
Rodijk-Olthuis, Peter C. M. van de Kerkhof, et al. 2008. “Psoriasis Is Associated with
Increased β-Defensin Genomic Copy Number.” Nature Genetics .
79
https://doi.org/ 10.1038/ng.2007.48 .
Iqbal, Zamin, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. 2012. “De Novo
Assembly and Genotyping of Variants Using Colored de Bruijn Graphs.” Nature Genetics
44 (2): 226–32.
Iqbal, Zamin, Isaac Turner, and Gil McVean. 2013. “High-Throughput Microbial Population
Genomics Using the Cortex Variation Assembler.” Bioinformatics .
https://doi.org/ 10.1093/bioinformatics/bts673 .
Jakubosky, David, Matteo D’Antonio, Marc Jan Bonder, Craig Smail, Margaret K. R. Donovan,
William W. Young Greenwald, Hiroko Matsui, et al. 2020. “Properties of Structural Variants
and Short Tandem Repeats Associated with Gene Expression and Complex Traits.” Nature
Communications 11 (1): 1–15.
James, W. P. T. 2008. “The Epidemiology of Obesity: The Size of the Problem.” Journal of
Internal Medicine . https://doi.org/ 10.1111/j.1365-2796.2008.01922.x .
Jeffreys, A. J., K. Tamaki, A. MacLeod, D. G. Monckton, D. L. Neil, and J. A. Armour. 1994.
“Complex Gene Conversion Events in Germline Mutation at Human Minisatellites.” Nature
Genetics 6 (2): 136–45.
Jiang, Zhaoshi, Haixu Tang, Mario Ventura, Maria Francesca Cardone, Tomas Marques-Bonet,
Xinwei She, Pavel A. Pevzner, and Evan E. Eichler. 2007. “Ancestral Reconstruction of
Segmental Duplications Reveals Punctuated Cores of Human Genome Evolution.” Nature
Genetics 39 (11): 1361–68.
Jong, Simone de, Lewis R. Vidler, Younes Mokrab, David A. Collier, and Gerome Breen. 2016.
“Gene-Set Analysis Based on the Pharmacological Profiles of Drugs to Identify
Repurposing Opportunities in Schizophrenia.” Journal of Psychopharmacology 30 (8):
826–30.
Kerem, B., J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K. Cox, A. Chakravarti, M.
Buchwald, and L. C. Tsui. 1989. “Identification of the Cystic Fibrosis Gene: Genetic
Analysis.” Science 245 (4922): 1073–80.
Kirby, Andrew, Andreas Gnirke, David B. Jaffe, Veronika Barešová, Nathalie Pochet, Brendan
Blumenstiel, Chun Ye, et al. 2013. “Mutations Causing Medullary Cystic Kidney Disease
Type 1 Lie in a Large VNTR in MUC1 Missed by Massively Parallel Sequencing.” Nature
Genetics 45 (3): 299–303.
Klein, Robert J., Caroline Zeiss, Emily Y. Chew, Jen-Yue Tsai, Richard S. Sackler, Chad Haynes,
Alice K. Henning, et al. 2005. “Complement Factor H Polymorphism in Age-Related
Macular Degeneration.” Science 308 (5720): 385–89.
Kolmogorov, Mikhail, Jeffrey Yuan, Yu Lin, and Pavel A. Pevzner. 2019. “Assembly of Long,
Error-Prone Reads Using Repeat Graphs.” Nature Biotechnology 37 (5): 540–46.
Koolen, D. A., A. J. Sharp, J. A. Hurst, H. V. Firth, S. J. L. Knight, A. Goldenberg, P.
Saugier-Veber, et al. 2008. “Clinical and Molecular Delineation of the 17q21.31
80
Microdeletion Syndrome.” Journal of Medical Genetics 45 (11): 710–20.
Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and
Adam M. Phillippy. 2017. “Canu: Scalable and Accurate Long-Read Assembly via
Adaptive K-Mer Weighting and Repeat Separation.” Genome Research 27 (5): 722–36.
Kraft, H. G., S. Köchl, H. J. Menzel, C. Sandholzer, and G. Utermann. 1992. “The
Apolipoprotein (a) Gene: A Transcribed Hypervariable Locus Controlling Plasma
Lipoprotein (a) Concentration.” Human Genetics 90 (3): 220–30.
Kristina, Ibañez, James Polke, R. Tanner Hagelstrom, Egor Dolzhenko, Dorota Pasko, Ellen
Rachel Amy Thomas, and Louise Daugherty et al. 2022. “Whole Genome Sequencing for
the Diagnosis of Neurological Repeat Expansion Disorders in the UK: A Retrospective
Diagnostic Accuracy and Prospective Clinical Validation Study.” The Lancet Neurology 21
(3): 234–45.
Laakso, Markku, Johanna Kuusisto, Alena Stančáková, Teemu Kuulasmaa, Päivi Pajukanta,
Aldons J. Lusis, Francis S. Collins, Karen L. Mohlke, and Michael Boehnke. 2017. “The
Metabolic Syndrome in Men Study: A Resource for Studies of Metabolic and
Cardiovascular Diseases.” Journal of Lipid Research 58 (3): 481–93.
Landefeld, Clare C., Colin A. Hodgkinson, Primavera A. Spagnolo, Cheryl A. Marietta,
Pei-Hong Shen, Hui Sun, Zhifeng Zhou, Barbara K. Lipska, and David Goldman. 2018.
“Effects on Gene Expression and Behavior of Untagged Short Tandem Repeats: The Case
of Arginine Vasopressin Receptor 1a (AVPR1a) and Externalizing Behaviors.”
Translational Psychiatry 8 (1): 1–10.
LaPierre, Nathan, Kodi Taraszka, Helen Huang, Rosemary He, Farhad Hormozdiari, and Eleazar
Eskin. n.d. “Identifying Causal Variants by Fine Mapping Across Multiple Studies.”
https://doi.org/ 10.1101/2020.01.15.908517 .
Lappalainen, Tuuli, Michael Sammeth, Marc R. Friedländer, Peter A. C. ‘t Hoen, Jean Monlong,
Manuel A. Rivas, Mar Gonzàlez-Porta, et al. 2013. “Transcriptome and Genome
Sequencing Uncovers Functional Variation in Humans.” Nature 501 (7468): 506–11.
Lee, Mark N., Chun Ye, Alexandra-Chloé Villani, Towfique Raj, Weibo Li, Thomas M.
Eisenhaure, Selina H. Imboywa, et al. 2014. “Common Genetic Variants Modulate
Pathogen-Sensing Responses in Human Dendritic Cells.” Science 343 (6175): 1246980.
Li, Heng, Jonathan M. Bloom, Yossi Farjoun, Mark Fleharty, Laura Gauthier, Benjamin Neale,
and Daniel MacArthur. 2018. “A Synthetic-Diploid Benchmark for Accurate
Variant-Calling Evaluation.” Nature Methods 15 (8): 595–97.
Li, Heng, Xiaowen Feng, and Chong Chu. 2020. “The Design and Construction of Reference
Pangenome Graphs with Minigraph.” Genome Biology 21 (1): 265.
Linthorst, Jasper, Wim Meert, Matthew S. Hestand, Jonas Korlach, Joris Robert Vermeesch,
Marcel J. T. Reinders, and Henne Holstege. 2020. “Extreme Enrichment of
VNTR-Associated Polymorphicity in Human Subtelomeres: Genes with Most VNTRs Are
81
Predominantly Expressed in the Brain.” Translational Psychiatry 10 (1): 369.
López-Cortegano, Eugenio, and Armando Caballero. 2019. “Inferring the Nature of Missing
Heritability in Human Traits Using Data from the GWAS Catalog.” Genetics 212 (3):
891–904.
Lu, Tsung-Yu, and Mark J. P. Chaisson. 2021. “Profiling Variable-Number Tandem Repeat
Variation across Populations Using Repeat-Pangenome Graphs.” Nature Communications
12 (1): 1–12.
Lu, Tsung-Yu, and Mark J. P. Chaisson. 2022. “The Motif Composition of Variable-Number
Tandem Repeats Impacts Gene Expression.” BioRxiv , March.
https://doi.org/ 10.1101/2022.03.17.484784 .
MacDonald, M. E., C. M. Ambrose, M. P. Duyao, R. H. Myers, C. Lin, L. Srinidhi, and P. S. . . .
& Harper. 1993. “A Novel Gene Containing a Trinucleotide Repeat That Is Expanded and
Unstable on Huntington’s Disease Chromosomes.” Cell 72 (6): 971–83.
Mallick, Swapan, Heng Li, Mark Lipson, Iain Mathieson, Melissa Gymrek, Fernando Racimo,
Mengyao Zhao, et al. 2016. “The Simons Genome Diversity Project: 300 Genomes from
142 Diverse Populations.” Nature 538 (7624): 201–6.
Manolio, Teri A., Francis S. Collins, Nancy J. Cox, David B. Goldstein, Lucia A. Hindorff,
David J. Hunter, Mark I. McCarthy, et al. 2009. “Finding the Missing Heritability of
Complex Diseases.” Nature 461 (7265): 747–53.
Mitra, Ileena, Bonnie Huang, Nima Mousavi, Nichole Ma, Michael Lamkin, Richard Yanicky,
Sharona Shleizer-Burko, Kirk E. Lohmueller, and Melissa Gymrek. 2021. “Patterns of de
Novo Tandem Repeat Mutations and Their Role in Autism.” Nature 589 (7841): 246–50.
Moltke, Ida, Niels Grarup, Marit E. Jørgensen, Peter Bjerregaard, Jonas T. Treebak, Matteo
Fumagalli, Thorfinn S. Korneliussen, et al. 2014. “A Common Greenlandic TBC1D4
Variant Confers Muscle Insulin Resistance and Type 2 Diabetes.” Nature 512 (7513):
190–93.
Mousavi, Nima, Sharona Shleizer-Burko, Richard Yanicky, and Melissa Gymrek. 2019.
“Profiling the Genome-Wide Landscape of Tandem Repeat Expansions.” Nucleic Acids
Research 47 (15): e90.
Muggli, Martin D., Alexander Bowe, Noelle R. Noyes, Paul S. Morley, Keith E. Belk, Robert
Raymond, Travis Gagie, Simon J. Puglisi, and Christina Boucher. 2017. “Succinct Colored
de Bruijn Graphs.” Bioinformatics 33 (20): 3181–87.
Mukamel, Ronen E., Robert E. Handsaker, Maxwell A. Sherman, Alison R. Barton, Yiming
Zheng, Steven A. McCarroll, and Po-Ru Loh. 2021. “Protein-Coding Repeat
Polymorphisms Strongly Shape Diverse Human Phenotypes.” Science 373 (6562):
1499–1505.
Narzisi, Giuseppe, André Corvelo, Kanika Arora, Ewa A. Bergmann, Minita Shah, Rajeeva
Musunuri, Anne-Katrin Emde, Nicolas Robine, Vladimir Vacic, and Michael C. Zody. 2018.
82
“Genome-Wide Somatic Variant Calling Using Localized Colored de Bruijn Graphs.”
Communications Biology 1 (March): 20.
Novroski, Nicole M. M., Jonathan L. King, Jennifer D. Churchill, Lay Hong Seah, and Bruce
Budowle. 2016. “Characterization of Genetic Sequence Variation of 58 STR Loci in Four
Major Population Groups.” Forensic Science International. Genetics 25 (November):
214–26.
Paten, Benedict, Adam M. Novak, Jordan M. Eizenga, and Erik Garrison. 2017. “Genome
Graphs and the Evolution of Genome Inference.” Genome Research 27 (5): 665–76.
Pauling, L., and H. A. Itano. 1949. “Sickle Cell Anemia a Molecular Disease.” Science 110
(2865): 543–48.
Payer, Lindsay M., Jared P. Steranka, Wan Rou Yang, Maria Kryatova, Sibyl Medabalimi, Daniel
Ardeljan, Chunhong Liu, Jef D. Boeke, Dimitri Avramopoulos, and Kathleen H. Burns.
2017. “Structural Variants Caused by Insertions Are Associated with Risks for Many
Human Diseases.” Proceedings of the National Academy of Sciences of the United States of
America 114 (20): E3984–92.
Pevzner, Pavel A., Haixu Tang, and Glenn Tesler. 2004. “De Novo Repeat Classification and
Fragment Assembly.” Genome Research 14 (9): 1786–96.
Porubsky, David, Shilpa Garg, Ashley D. Sanders, Jan O. Korbel, Victor Guryev, Peter M.
Lansdorp, and Tobias Marschall. 2017. “Dense and Accurate Whole-Chromosome
Haplotyping of Individual Genomes.” Nature Communications 8 (1): 1293.
Porubsky, David, Human Genome Structural Variation Consortium, Peter Ebert, Peter A.
Audano, Mitchell R. Vollger, William T. Harvey, Pierre Marijon, et al. 2020. “Fully Phased
Human Genome Assembly without Parental Data Using Single-Cell Strand Sequencing and
Long Reads.” Nature Biotechnology . https://doi.org/ 10.1038/s41587-020-0719-5 .
Pritchard, Antonia L., Colin W. Pritchard, Peter Bentham, and Corinne L. Lendon. 2007. “Role
of Serotonin Transporter Polymorphisms in the Behavioural and Psychological Symptoms
in Probable Alzheimer Disease Patients.” Dementia and Geriatric Cognitive Disorders 24
(3): 201–6.
Rakocevic, Goran, Vladimir Semenyuk, Wan-Ping Lee, James Spencer, John Browning, Ivan J.
Johnson, Vladan Arsenijevic, et al. 2019. “Fast and Accurate Genomic Analyses Using
Genome Graphs.” Nature Genetics . https://doi.org/ 10.1038/s41588-018-0316-4 .
Raphael, Benjamin, Degui Zhi, Haixu Tang, and Pavel Pevzner. 2004. “A Novel Method for
Multiple Alignment of Sequences with Repeated and Shuffled Elements.” Genome
Research 14 (11): 2336–46.
Rautiainen, Mikko, Veli Mäkinen, and Tobias Marschall. 2019. “Bit-Parallel Sequence-to-Graph
Alignment.” Bioinformatics 35 (19): 3599–3607.
Redon, Richard, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel
Andrews, Heike Fiegler, et al. 2006. “Global Variation in Copy Number in the Human
83
Genome.” Nature 444 (7118): 444–54.
Renton, Alan E., Elisa Majounie, Adrian Waite, Javier Simón-Sánchez, Sara Rollinson, J.
Raphael Gibbs, Jennifer C. Schymick, et al. 2011. “A Hexanucleotide Repeat Expansion in
C9ORF72 Is the Cause of Chromosome 9p21-Linked ALS-FTD.” Neuron 72 (2): 257–68.
Reynolds, Edwardo G. M., Catherine Neeley, Thomas J. Lopdell, Michael Keehan, Keren
Dittmer, Chad S. Harland, Christine Couldrey, et al. 2021. “Non-Additive Association
Analysis Using Proxy Phenotypes Identifies Novel Cattle Syndromes.” Nature Genetics 53
(7): 949–54.
Rundle, Andrew, Ana V . Diez Roux, Lance M. Freeman, Douglas Miller, Kathryn M.
Neckerman, and Christopher C. Weiss. 2007. “The Urban Built Environment and Obesity in
New York City: A Multilevel Analysis.” American Journal of Health Promotion .
https://doi.org/ 10.4278/0890-1171-21.4s.326 .
Saini, Shubham, Ileena Mitra, Nima Mousavi, Stephanie Feupe Fotsing, and Melissa Gymrek.
2018. “A Reference Haplotype Panel for Genome-Wide Imputation of Short Tandem
Repeats.” Nature Communications 9 (1): 4397.
Sanjak, Jaleal S., Anthony D. Long, and Kevin R. Thornton. 2017. “A Model of Compound
Heterozygous, Loss-of-Function Alleles Is Broadly Consistent with Observations from
Complex-Disease GWAS Datasets.” PLOS Genetics .
https://doi.org/ 10.1371/journal.pgen.1006573 .
Schizophrenia Working Group of the Psychiatric Genomics Consortium. 2014. “Biological
Insights from 108 Schizophrenia-Associated Genetic Loci.” Nature 511 (7510): 421–27.
Seabold, Skipper, and Josef Perktold. 2010. “Statsmodels: Econometric and Statistical Modeling
with Python.” In Proceedings of the 9th Python in Science Conference , 57:61. Austin, TX.
Sekar, Aswin, Allison R. Bialas, Heather de Rivera, Avery Davis, Timothy R. Hammond, Nolan
Kamitaki, Katherine Tooley, et al. 2016. “Schizophrenia Risk from Complex Variation of
Complement Component 4.” Nature 530 (7589): 177–83.
Seo, Jeong-Sun, Arang Rhie, Junsoo Kim, Sangjin Lee, Min-Hwan Sohn, Chang-Uk Kim, Alex
Hastie, et al. 2016. “De Novo Assembly and Phasing of a Korean Human Genome.” Nature
538 (7624): 243–47.
Shi, Lingling, Yunfei Guo, Chengliang Dong, John Huddleston, Hui Yang, Xiaolu Han, Aisi Fu,
et al. 2016. “Long-Read Sequencing and de Novo Assembly of a Chinese Genome.” Nature
Communications 7 (June): 12065.
Sirén, Jouni, Jean Monlong, Xian Chang, Adam M. Novak, Jordan M. Eizenga, Charles
Markello, Jonas Sibbesen, et al. 2020. “Genotyping Common, Large Structural Variations in
5,202 Genomes Using Pangenomes, the Giraffe Mapper, and the vg Toolkit.” Cold Spring
Harbor Laboratory . https://doi.org/ 10.1101/2020.12.04.412486 .
Sirén, Jouni, Jean Monlong, Xian Chang, Adam M. Novak, Jordan M. Eizenga, Charles
Markello, Jonas A. Sibbesen, et al. 2021. “Pangenomics Enables Genotyping of Known
84
Structural Variants in 5202 Diverse Genomes.” Science 374 (6574): abg8871.
Song, Janet H. T., Craig B. Lowe, and David M. Kingsley. 2018. “Characterization of a
Human-Specific Tandem Repeat Associated with Bipolar Disorder and Schizophrenia.”
American Journal of Human Genetics 103 (3): 421–30.
Stranneheim, Henrik, Kristina Lagerstedt-Robinson, Måns Magnusson, Malin Kvarnung, Daniel
Nilsson, Nicole Lesko, Martin Engvall, et al. 2021. “Integration of Whole Genome
Sequencing into a Healthcare Setting: High Diagnostic Rates across Multiple Clinical
Entities in 3219 Rare Disease Patients.” Genome Medicine 13 (1): 40.
Sudmant, Peter H., Swapan Mallick, Bradley J. Nelson, Fereydoun Hormozdiari, Niklas Krumm,
John Huddleston, Bradley P. Coe, et al. 2015. “Global Diversity, Population Stratification,
and Selection of Human Copy-Number Variation.” Science 349 (6253): aab3761.
Sudmant, Peter H., Tobias Rausch, Eugene J. Gardner, Robert E. Handsaker, Alexej Abyzov,
John Huddleston, Yan Zhang, et al. 2015. “An Integrated Map of Structural Variation in
2,504 Human Genomes.” Nature 526 (7571): 75–81.
Sun, James H., Linda Zhou, Daniel J. Emerson, Sai A. Phyo, Titus Katelyn R., Wanfeng Gong,
and Thomas G. et al Gilgenast. 2018. “Disease-Associated Short Tandem Repeats
Co-Localize with Chromatin Domain Boundaries.” Cell 175 (1): 224–38.e15.
Taliun, Daniel, Daniel N. Harris, Michael D. Kessler, Jedidiah Carlson, Zachary A. Szpiech,
Raul Torres, Sarah A. Gagliano Taliun, et al. 2021. “Sequencing of 53,831 Diverse
Genomes from the NHLBI TOPMed Program.” Nature 590 (7845): 290–99.
Tam, Vivian, Nikunj Patel, Michelle Turcotte, Yohan Bossé, Guillaume Paré, and David Meyre.
2019. “Benefits and Limitations of Genome-Wide Association Studies.” Nature Reviews.
Genetics 20 (8): 467–84.
Tang, Haibao, Ewen F. Kirkness, Christoph Lippert, William H. Biggs, Martin Fabani, Ernesto
Guzman, Smriti Ramakrishnan, et al. 2017. “Profiling of Short-Tandem-Repeat Disease
Alleles in 12,632 Human Whole Genomes.” American Journal of Human Genetics 101 (5):
700–715.
Tankard, Rick M., Mark F. Bennett, Peter Degorski, Martin B. Delatycki, Paul J. Lockhart, and
Melanie Bahlo. 2018. “Detecting Expansions of Tandem Repeats in Cohorts Sequenced
with Short-Read Sequencing Data.” American Journal of Human Genetics 103 (6): 858–73.
The International HapMap Consortium. 2005. “A Haplotype Map of the Human Genome.”
Nature 437 (7063): 1299–1320.
Trost, Brett, Worrawat Engchuan, Charlotte M. Nguyen, Bhooma Thiruvahindrapuram, Egor
Dolzhenko, Ian Backstrom, Mila Mirceta, et al. 2020. “Genome-Wide Detection of Tandem
DNA Repeats That Are Expanded in Autism.” Nature 586 (7827): 80–86.
Tsuge, Masataka, Ryuji Hamamoto, Fabio Pittella Silva, Yozo Ohnishi, Kazuaki Chayama,
Naoyuki Kamatani, Yoichi Furukawa, and Yusuke Nakamura. 2005. “A Variable Number of
Tandem Repeats Polymorphism in an E2F-1 Binding Element in the 5′ Flanking Region of
85
SMYD3 Is a Risk Factor for Human Cancers.” Nature Genetics 37 (10): 1104–7.
Viguera, E., D. Canceill, and S. D. Ehrlich. 2001. “Replication Slippage Involves DNA
Polymerase Pausing and Dissociation.” The EMBO Journal 20 (10): 2587–95.
Visscher, Peter M., Matthew A. Brown, Mark I. McCarthy, and Jian Yang. 2012. “Five Years of
GWAS Discovery.” The American Journal of Human Genetics .
https://doi.org/ 10.1016/j.ajhg.2011.11.029 .
V ollebregt, O., E. Koyama, C. C. Zai, S. A. Shaikh, A. J. Lisoway, J. L. Kennedy, and J. H.
Beitchman. 2021. “Evidence for Association of Vasopressin Receptor 1A Promoter Region
Repeat with Childhood Onset Aggression.” Journal of Psychiatric Research 140 (August):
522–28.
Wang, Gao, Abhishek Sarkar, Peter Carbonetto, and Matthew Stephens. 2020. “A Simple New
Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping.”
Journal of the Royal Statistical Society: Series B (Statistical Methodology) .
https://doi.org/ 10.1111/rssb.12388 .
Wellcome Trust Case Control Consortium. 2007. “Genome-Wide Association Study of 14,000
Cases of Seven Common Diseases and 3,000 Shared Controls.” Nature 447 (7145): 661–78.
Wellcome Trust Case Control Consortium, Australo-Anglo-American Spondylitis Consortium
(TASC), Paul R. Burton, David G. Clayton, Lon R. Cardon, Nick Craddock, Panos
Deloukas, et al. 2007. “Association Scan of 14,500 Nonsynonymous SNPs in Four Diseases
Identifies Autoimmunity Variants.” Nature Genetics 39 (11): 1329–37.
Willems, Thomas, Dina Zielinski, Jie Yuan, Assaf Gordon, Melissa Gymrek, and Yaniv Erlich.
2017. “Genome-Wide Profiling of Heritable and de Novo STR Variations.” Nature Methods
14 (6): 590–92.
Witoelar, Aree, Iris E. Jansen, Yunpeng Wang, Rahul S. Desikan, J. Raphael Gibbs, Cornelis
Blauwendraat, Wesley K. Thompson, et al. 2017. “Genome-Wide Pleiotropy Between
Parkinson Disease and Autoimmune Diseases.” JAMA Neurology 74 (7): 780–92.
Ye, Chun Jimmie, Jenny Chen, Alexandra-Chloé Villani, Rachel E. Gate, Meena Subramaniam,
Tushar Bhangale, Mark N. Lee, et al. 2018. “Genetic Analysis of Isoform Usage in the
Human Anti-Viral Response Reveals Influenza-Specific Regulation of Transcripts under
Balancing Selection.” Genome Research 28 (12): 1812–25.
Zook, Justin M., Nancy F. Hansen, Nathan D. Olson, Lesley Chapman, James C. Mullikin,
Chunlin Xiao, Stephen Sherry, et al. 2020. “A Robust Benchmark for Detection of Germline
Large Deletions and Insertions.” Nature Biotechnology , June.
https://doi.org/ 10.1038/s41587-020-0538-8 .
86
Appendix A
Supplementary Information for Chapter 2
A.1 Supplemental Figures
Figure A.1: Comparison between the tandem repeat database in GangSTR and this work.
a, Size distribution of the TRs annotated in each study. TRs with size greater than 150 bp in at
least one assembly and with size greater than 50 bp in hg38 are annotated in this study. Tandem
repeat sizes above 1000 bp, above 50 bp, and below 50 bp are not shown for this study (left),
GangSTR (middle) and comparison (right), respectively. b, Percentage of overlapping TRs
between databases. The number of overlapping loci changes across databases since multiple loci
in GangSTR’s database could correspond to only one locus in our database. Source data are
provided as a Source Data file.
87
Figure A.2: An example of multiple STR annotations within a VNTR.
Dot plot was generated using exact matching between 9-mers along chr1:861277-862683.
Annotations of four STRs (red box; chr1:861863-861874, chr1:862001-862016,
chr1:862077-862088 and chr1:862133-862144) and one VNTR (blue box; chr1:861777-862183;
before boundary expansion) are highlighted.
88
Figure A.3: An example VNTR annotation split by adVNTR-NN.
Dot plots of VNTR sequences at chr14:104941587-104953440 from four assemblies and
GRCh38 are shown. Note that this region is split into 39 sub-regions in adVNTR-NN with an
average VNTR size of 54 bp.
89
Figure A.4: Completeness of VNTR annotations in individual genomes.
90
X-axis indicates relative genomic order of each missing VNTR locus which is marked by a blue
stripe. A locus is called missing if both VNTR haplotypes in the genome are missing. Percentage
of missing loci is the number of missing loci divided by 32,138, the total number of loci
annotated.
Figure A.5: Classes of VNTRs removed by alignment quality filtering.
Annotations of VNTR classes are retrieved from the RepeatMasker track in UCSC Genome
Browser. VNTRs that span multiple repeat annotations will be counted once for each class.
Repeat classes are shown only for those with at least 200 repeats called. Class “other” indicates
repeats not annotated in the RepeatMasker track. Labels on x-axis are sorted by the number of
removed (top) or retained (bottom) loci. Source data are provided as a Source Data file.
91
Figure A.6: LSB at repetitive regions.
The distribution of biases (n=32,138) at the 32,138 genotyped loci are shown for each sample.
The box within each density estimate spans from the lower quartile to the upper quartile, with the
white dot indicating the median. Whiskers extend to points that are within 1.5 interquartile range
(IQR) from the upper or the lower quartiles. Samples are retrieved from HGSVC, Human
Genome Structural Variation Consortium datasets; 1KGP, 1000 Genomes Project datasets;
WUDP, Washington University Diversity Project datasets; and IDV, individual studies. Source
data are provided as a Source Data file.
92
Figure A.7: LSB at non-repetitive regions of all genotyped samples.
Principal component analysis was done on a N × L matrix, where N is the number of samples,
and L is the number of unique regions. Each row of the matrix is a vector of LSB in 397 unique
regions from a single sample. Each sample is a tuple of (genome, sequencing run). Samples are
retrieved from HGSVC, Human Genome Structural Variation Consortium datasets; 1KGP, 1000
Genomes Project datasets; WUDP, Washington University Diversity Project datasets; and IDV,
individual studies; asterisks indicate samples with haplotype-resolved assemblies available.
Source data are provided as a Source Data file.
93
Figure A.8: LSB at non-repetitive regions preserves the relation between samples at repetitive
regions.
a b
Principal component analysis of LSB in VNTR ( a ) and unique ( b ) regions. PCA was done on an
N × L matrix, where N is the number of samples, and L is the number of VNTR loci. Each row of
the matrix is a vector of LSBs in 32,138 VNTR regions from a single sample. Each sample is a
tuple of (genome, sequencing run). Samples are retrieved from HGSVC, Human Genome
Structural Variation Consortium datasets; 1KGP, 1000 Genomes Project datasets; WUDP,
Washington University Diversity Project datasets; and IDV, individual studies. Source data are
provided as a Source Data file.
94
Figure A.9: Nearest neighbor search for LSB at VNTR regions using LSB at nonrepetitive
regions as a proxy.
The heat map shows the mean absolute error between each pair of LSB at VNTR regions. For the
sample denoted in each column, each cross indicates the nearest neighbor for that sample based
on the LSB in nonrepetitive regions. HGSVC, Human Genome Structural Variation Consortium
datasets; 1KGP, 1000 Genomes Project datasets; WUDP, Washington University Diversity
Project datasets; IDV, individual studies. Source data are provided as a Source Data file.
95
Figure A.10: Profile of prediction accuracy for each sample.
True and predicted lengths are plotted against each other for each sample. Each subtitle shows
the sample name followed by its nearest sample, mean absolute percentage error and the number
of loci. Loci not annotated in either the sample or its nearest sample are considered missing in
the prediction step. The red dotted line shows where 100% accuracy lies. Source data are
provided as a Source Data file.
96
Figure A.11: Performance of per-locus length prediction accuracy relative to GRCh38.
a, Fraction of loci with improved accuracy in each genome. b, Distribution of per-locus
accuracy. Loci with MAPE greater than 1.0 are not shown. c, Per-locus MAPE of pangenome
graphs versus hg38 graphs. Accuracy is measured by the mean absolute percentage error
(MAPE) in VNTR lengths across all genomes ( b-c ). Source data are provided as a Source Data
file.
Figure A.12: Correlation between the estimation error in VNTR length and in LSB.
Estimation error in length was computed using absolute percentage error, i.e. |1− gt/est |, where gt
is the length in assembly and est is the length estimated from leave-one-out analysis. Similarly,
estimation error in LSB was computed as |1− gt/est |, where gt is the ground truth of the LSB for
the VNTR locus (Methods) and est is the estimated LSB from the nearest neighbor (Methods).
Data points were accumulated from 32,138 VNTR loci across 16 genomes. Source data are
provided as a Source Data file.
97
Figure A.13: Example of deviation in LSB across samples.
An example locus with high alignment quality but low concordance in LSB between samples.
NA19238 has the most similar LSB to HG00731 based on the estimation from 397 control
regions and is used to estimate the LSB of this VNTR locus in HG00731. Length prediction error
is measured with mean absolute percentage error (MAPE). Source data are provided as a Source
Data file.
Figure A.14: Distribution of length estimation error for loci with or without a missing haplotype.
Density curves were accumulated from 32,138 VNTR loci across 16 genomes and each
normalized with area 1. Source data are provided as a Source Data file.
98
Figure A.15: Correlation between length estimation error and fraction of novel k-mers.
Fraction of novel k -mers for each locus in each genome was computed as the percentage of
k -mers missing from the leave-one-out locus-RPGG. Data points were accumulated from 32,138
VNTR loci across 16 genomes. The P-value was derived from two-sided t test. Source data are
provided as a Source Data file.
Figure A.16: Relationship between GC content and length prediction error.
GC contents of the 32,138 VNTRs were measured on GRCh38 using bedtools nuc. Length
prediction errors were measured using mean absolute percentage error in the leave-one-out
analysis. The r squared, effect size and P-value (two-sided t test) for GC<0.5 (left) and GC>0.5
99
(right) are shown in the titles. Source data are provided as a Source Data file.
Figure A.17: Effect of GC content change on bias and length estimation.
Left panel: The correlation between GC content and LSB in VNTR regions. Middle & right
panels: Correlation between GC content change and length estimation error. GC content change
(delta GC%) was computed from the VNTR sequence of a locus and the sequence of its nearest
neighbor (same locus in another genome) in leave-one-out analysis. The analysis was restricted
to HGSVC samples (HG00514, HG00733 and NA19240 trios). P-values were derived from
two-sided t test. Source data are provided as a Source Data file.
100
Figure A.18: Examples of unstable loci with individuals > 10 standard deviations above the
mean.
Swarm plots demonstrating highly unstable loci, determined as having an individual with
coverage at least ten standard deviations above the mean. The locus on the left overlaps KCNA2 ,
and the locus on the right overlaps GRM4 .
101
Figure A.19: Null and observed distributions of kmc
d
and r
d
2
between the EAS and AFR
populations.
The Null distribution of difference in the count of the most informative k -mer (mi-kmc, left) and
difference in variance explained by the most informative k -mer ( r
2
, right) at each locus was
simulated using bootstrap from the EAS population with sample size matching the sum of both
samples ( N
EAS
=502, N
AFR
=661). Observed values within the two-tailed P<0.01 regions were
called significant, with cutoff=(-4.702×10
-2
, 4.834×10
-2
) and (-1.028×10
-1
, 1.039×10
-1
) for
mi-kmc and r
2
, respectively. Source data are provided as a Source Data file.
102
Figure A.20: Distance of TRs and eTRs to telomere.
Telomere annotations were retrieved from UCSC Genome Browser and used to find the distance
of a tandem repeats to its closest telomere. Distribution (left) and the q-q plot (right) of the
statistics from TRs and eTRs were compared. Source data are provided as a Source Data file.
103
Figure A.21: Association between the top 50 pairs of eVNTR and eGene.
104
105
106
107
108
109
110
111
Plots are shown in order of q-value. The format of plot titles is tissue, VNTR_region,
gene_name, nominal_p_val and effect_size. The linear fit is shown as a dashed red line. Nominal
P-values were derived from two-sided t tests.
112
Figure A.22: Conditional association of chr5:96896863-96896963 VNTR with ERAP2
expression over chr5_96916885_T_C_b38.
Marginal association between VNTR and expression was performed by subsetting on samples
with the indicated genotype (subtitle) at the SNP site. The effect size ( b ) and P-value ( P ) for each
association test (two-sided t test) was shown in each subpanel. The red dashed line indicates the
regression line. HOM_REF, homozygous reference; HET, heterozygous; HOM_HET,
homozygous alternative.
Figure A.23: Linkage disequilibrium (LD) between chr5:96896863-96896963 VNTR and nearby
SNPs.
The LD between the VNTR and each nearby SNP was computed as the r
2
between genotype
values. The y-axis indicates the association P-value (two-sided t test) with ERAP2 expression
level. The location of VNTR (blue asterisk) and ERAP2 gene (blue line) are highlighted.
113
Figure A.24: Spurious alignment of Illumina reads to GRCh38 at a VNTR locus.
Alignment of Illumina datasets at 60x coverage from the HG00514 individual to
chr1:1075852-1079425 of hg38 is visualized by Integrative Genomics Viewer (IGV)
114
Figure A.25: Boundary expansion recovers the proper boundary of VNTR alleles.
For every two VNTR alleles, the boundary expansion algorithm operates in three steps:
individual expansion, joint expansion and quality check (Methods). The red boxes indicate the
regions where k -mer matching is subject to inspection. Any matches (red dots) occurring outside
of the central red box indicate the presence of shared k -mers between the VNTR and the flanking
sequence.
115
Figure A.26: Distribution of number of genes overlapping shuffled high V
ST
loci.
The frequency for 10,000 iterations of the number of genes overlapping high V
ST
loci that are
shuffled across the euchromatic genome. High V
ST
are defined by a minimal number of standard
deviations above the mean (3-5) (N=785, 470, and 235). The number of genes overlapping high
V
ST
loci in the original dataset are shown by the full-height vertical lines.
116
Figure A.27: Distribution of genes and UTR regions overlapping shuffled unstable loci.
The number of genes overlapping VNTRs defined as unstable with different cutoff values: at
least one individual with dosage > 6 standard deviations above the mean (N=19), and with > 10
standard deviations above the mean (N=2). The number of genes/UTRs overlapping unstable loci
in the original dataset are shown by the full-height vertical lines.
117
Figure A.28: Number of eVNTRs shared between or specific to each tissue.
eQTL discoveries for the 32,138 VNTR loci were controlled at 5% FDR.
118
Figure A.29: Length distribution of VNTRs and eVNTRs.
Length distribution of eVNTRs and VNTRs. eQTL discoveries for the 32,138 VNTR loci were
controlled at 5% FDR. Source data are provided as a Source Data file.
119
Figure A.30: Sample QC on VNTR genotypes of the 1000 Genomes.
a, Joint PCA plot of samples using the k -mer dosage adjusted by coverage. b-c, Outlier
detection, shown in gray, using DBSCAN with eps=0.5 on male (b) and female individuals (c).
d-e Joint PCA plot of samples using the LSBs from 397 control regions (c) and the outliers
detected using DBSCAN with eps=0.5 (d).
120
Figure A.31: Sample QC on VNTR genotypes the GTEx Genomes.
a-b Joint PCA plot of samples using the k -mer dosage adjusted by coverage and allelic dosage.
(a) and the outliers detected, shown in gray, using DBSCAN with eps=0.5 (b). c-d Joint PCA
plot of samples using the LSBs from 397 control regions (c) and the outliers detected using
DBSCAN with eps=0.3 (d).
121
Figure A.32: Growth of relative VNTR-graph size.
The growth curve ( a ) and the distribution of graph size ( b ) if adding genomes in an incremental
manner are shown for the 32,138 VNTR loci. Relative graph size is the ratio between the number
of nodes, or k -mers, in the RPGG and the median number of nodes in a single genome. Source
data are provided as a Source Data file.
Figure A.33: Example of under-alignment of orthologous VNTR sequences by pggb.
(Top) The multiple sequence alignment result of pggb for 34 VNTR haplotypes at
chr12:37898555-37928455 plus 700 bp flanking sequences on each side. (Bottom) The dot plots
of all haplotypes against GRCh38.
122
Figure A.34: Misalignment of simulated VNTR reads by bwa.
(Left) Number of misaligned VNTR reads averaged across samples. (Right) Fraction of
misaligned reads averaged across samples. 32,138 VNTR loci over six genomes, including
HG00512, HG00513, HG00731, HG00732, NA19238 and NA19239 were included in this
experiment. Loci without misalignments are not shown for clarity. 30x error-free paired-end
reads were simulated from the six genomes and each mapped to GRCh38+ALT+decoy+HLA
(the hs38DH in bwa) using bwa-mem2 to follow the alignment procedures in the 1KGP and the
GTEx project. We define that a read is misaligned if its location is beyond 1 kbp to the boundary
of its original VNTR locus. Source data are provided as a Source Data file.
123
Figure A.35: Misalignment of VNTR reads to GRCh38 rescued by danbing-tk.
Read pairs misaligned by bwa were extracted and aligned to RPGGs using danbing-tk. A
misalignment is called if the distance of any end of the read pair to its original VNTR locus is
greater than the threshold. Options “-thcth 50 -cth 45 -rth 0.5” were used for danbing-tk align,
same as the setting for genotyping the 1000 and the GTEx genomes. Source data are provided as
a Source Data file.
Figure A.36: Relationship between VNTR length and prediction error.
124
VNTR lengths of 32,138 loci were averaged across 19 genomes. Length prediction errors were
measured using mean absolute percentage error in the leave-one-out analysis. The r squared,
effect size and P-value (two-sided t test) are shown in the title. Source data are provided as a
Source Data file.
Figure A.37: Relationship between eVNTR P-value and prediction error.
Nominal P-values (two-sided t test) of eVNTRs were Bonferroni-corrected. Length prediction
errors were measured using mean absolute percentage error in the leave-one-out analysis. Source
data are provided as a Source Data file.
125
Figure A.38: Comparing the alignment accuracy with and without threading.
Paired-end 150 bp reads were simulated with or without SNVs and mapped to unpruned RPGG.
A read is considered correctly mapped if its VNTR k -mers are assigned to the correct VNTR
locus. Each curve is parameterized by percent identity threshold (linspace distributed between
35% and 90%). For runs with threading enabled (solid lines in both panels), cth was set to 30,
and four nucleotide corrections were allowed. TPR, true positive rate; FPR, false positive rate.
Source data are provided as a Source Data file.
126
Figure A.39: Replication of V st on the 698 genomes related to the 1KGP samples.
The 2,504 1KGP samples were retrieved from ENA project PRJEB31736. The 698 genomes
were retrieved from ENA project PRJEB36890. V st was computed over the 32,138 VNTR loci
using the total kmer dosage as proxy for length. The P-value was derived from two-sided t test.
Source data are provided as a Source Data file.
127
Figure A.40: Incremental RPGG construction and change in boundary annotations.
Left panel: Distribution of boundary change relative to the previous iteration of RPGG
construction. Right panel: Number of loci with expansion size passing each threshold (legend) in
each iteration. Δboundary is computed by summing the change in boundaries relative to the
previous iteration and dividing the value by the number of supporting haplotypes. Boundary
expansion was applied to the initial set of 84,411 loci annotated using TRF. Source data are
provided as a Source Data file.
128
A.2 Supplemental Tables
Table A.1: Initial VNTR discoveries
AK1 HG00514 HG00733 NA19240 NA24385 Pangenome
TRF 137,939 138,328 144,364 143,315 127,156 -
Boundary
expansion
54,870 57,505 64,711 65,027 53,867 -
Merging 84,411
Table A.2: False mapping of reads by danbing-tk over the initial 73,582 loci.
FP from untracked regions Inter-locus FP Total FP FN*** Union of loci
HG00512 2,407 329 2,465 2,690 4,705
HG00513 2,574 336 2,643 2,614 4,827
HG00731 2,540 330 2,595 2,328 4,500
HG00732 2,781 320 2,841 3,113 5,476
NA19238 2,678 340 2,744 3,054 5,282
NA91239 2,452 342 2,520 2,832 4,855
Union of loci 5,919 497 5,999 9,525 13,800
Fraction of loci* 8.04% 0.68% 8.15% 12.94% 18.75%
Fraction removed** 71.50% 95.20% 71.60% 84.70% 78.10%
* Union of loci divided by 73,582.
** Fraction of loci in the union set with genotyping quality r2<0.96.
*** Unaligned reads due to graph pruning of nodes not supported by short reads.
129
Table A.3: eVNTRs discovered in this work that overlap with other studies
Case 1 2
This work
eVNTR.chrom chr16 chr16
eVNTR.start 89429084 69325358
eVNTR.end 89430599 69325494
eVNTR.length 1515 136
eVNTR.eGene(s) RP11-104N10.2
PDF,SNTB2,TERF2,NIP
7
eVNTR.beta(s) -0.24
2.85E-14,1.66E-11,5.51E
-10,2.74E-08
Fotsing et al. 2019
eSTR.chrom chr16,chr16
eSTR.start 89429890,89430476
eSTR.end 89429901,89430493
eSTR.length 11,17
eSTR.eGene(s) ANKRD11,ANKRD11
eSTR.beta(s) -0.194,0.270
number of overlapping eGenes(s) 0
Bakhtiari et al. 2018
eVNTR.chrom chr16
eVNTR.start 69325359
eVNTR.end 69325495
eVNTR.length 136
eVNTR.eGene(s) VPS4A
eVNTR.pval(s) 5.43E-05
number of overlapping eGenes(s) 0
130
Table A.4: Data source
Genome
Long read
sequencing
Short read sequencing Assembly
AK1 N/A SRR3602738 , SRR3602759 GCA_002009925.1
HG00268 SRX4382104 ERR251041 , ERR251042 danbing-tk
HG00512 IGSR LRS PRJEB9396 IGSR asm
HG00513 IGSR LRS PRJEB9396 IGSR asm
HG00514 PRJNA300843 PRJEB9396 danbing-tk
HG00731 IGSR LRS PRJEB9396 IGSR asm
HG00732 IGSR LRS PRJEB9396 IGSR asm
HG00733 IGSR LRS PRJEB9396 danbing-tk
HG01352 SRX2095531 SRR5571302, SRR5571303, SRR5571304, SRR5571305 danbing-tk
HG02059 SRX2537696 SRR5571333, SRR5571336, SRR5571337, SRR5571338 danbing-tk
HG02106 SRX4385796 Nationwide
1
danbing-tk
HG02818 SRX3203304 SRR5571310, SRR5571311 , SRR5571338 danbing-tk
HG04217 SRX4406292 ERR3239756 , Nationwide
1
danbing-tk
NA12878 SRX1837653 SRR3397076 danbing-tk
NA19238 IGSR LRS PRJEB9396 IGSR asm
NA19239 IGSR LRS PRJEB9396 IGSR asm
NA19240 IGSR LRS PRJEB9396 danbing-tk
NA19434 SRX4118367 SRR5571360, SRR5571361 danbing-tk
NA24385 PacBio
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA
24385_son/NIST_Illumina_2x250bps/reads/
danbing-tk
1
Nationwide sequences are available through James.Fitch@NationwideChildrens.org
131
Table A.5: Augmenting database with disease-related tandem repeats
Chr Start End
Associated
gene
Associated disease Motif Type
Alternative
name
chr12 2255791 2256090 CACNA1C
bipolar
schizophrenia
(GACCCTGACCTGACTAGTT
TACAATCACAC)n
intron
chr12 63149772 63149849 A VPR1A
externalizing
behavior
(GA)n(GT)n(A)n intron A VR
chr12 63153304 63153366 A VPR1A
externalizing
behavior
(GATA)n 5UTR RS1
chr12 63156354 63156429 A VPR1A
externalizing
behavior
(CT)nTT(CT)n(GT)n 5UTR RS3
chr3 129172568 129172736 CNBP
myotonic
dystrophy 2
(CCTG)n intron
chr9 27573485 27573546 C9ORF72
amyotrophic lateral
sclerosis
(GGGGCC)n intron
Table A.6: Comparison of alignment statistics between danbing-tk and GraphAligner.
danbing-tk GraphAligner
Read pairs mapped 258516 (99.96%) 247930 (95.9%)
Read pairs correctly mapped 257638 (99.62%) 211919 (81.9%)
Read pairs mismapped 878 (0.34%) 532 (0.21%)
Read pairs with low identity in at least one end 0 (0%) 27259 (10.5%)
Read pairs split 0 (0%) 8220 (3.2%)
Singletons 0 (0%) 8629 (3.3%)
Loci with correct read pairs 28468 (98.5%) 28405 (98.3%)
132
Table A.7: Realignment statistics of misaligned VNTR reads from bwa.
Threshold* 1000 2000 3000 5000 10000 20000 50000
Genome N1** N2*** N1 N2 N1 N2 N1 N2 N1 N2 N1 N2 N1 N2
HG00512 766 64064 766 62406 766 61890 766 61505 766 61214 766 61014 718 60570
HG00513 709 63473 696 61890 690 61307 690 60869 690 60578 690 60399 690 60020
HG00731 644 64076 637 62518 637 62017 637 61701 637 61413 637 61220 634 60814
HG00732 805 62659 805 61248 805 60732 805 60347 805 60064 805 59901 793 59533
NA19238 1066 66977 1066 65524 1066 64917 1066 64568 1057 64282 1057 64105 1022 63558
NA19239 685 66077 683 64563 683 63955 683 63640 683 63353 683 63111 575 62660
*The minimum to call misalignment for a read, i.e. the distance between actual read interval and aligned read interval
**Number of read pairs not rescued by danbing-tk
***Number of read pairs misaligned by bwa
Appendix B
Supplementary Information for Chapter 3
B.1 Supplemental Figures
Figure B.1: Distribution of VNTR size in GRCh38 and in the 70 HGSVC assemblies.
VNTR sizes greater than 3000 bp are not shown, contributing to 3.5% (2,814/80,158) of loci.
133
Figure B.2: Distribution of VNTR size versus allele properties. Private alleles are those observed
only once across the 70 haplotypes.
VNTR sizes greater than 3000 bp are not shown in both subpanels, contributing to 3.5%
(2810/80,158) of loci. Density greater than the limit shown in the color bar is clipped.
134
Figure B.3: Distribution of aln-r
2
in this work and Lu 2021.
a, Comparing the distribution of aln- r
2
in both studies. b, Enrichment of high aln- r
2
loci in this
work. Contingency tables with two variables, “≥ aln- r
2
cutoff” and “is from Lu 2021”, were
constructed to compute odds ratio.
Figure B.4: Distribution of eVNTR discoveries from Lu 2021 (Lu and Chaisson 2021) missing in
this work.
a, Distribution of the eVNTR P-values from Lu 2021. The x-axis denotes whether the
discoveries are missing in this work. b, Enrichment of missing eVNTRs with lower significance.
Contingency tables with two variables, “< P
eQTL
cutoff” and “is missing”, were constructed.
Significance was computed with one-sided Fisher’s exact test.
135
Figure B.5: Size of locus-RPGG before and after compaction.
The number of nodes before compaction (y-axis) and the number of paths after compaction
(x-axis) were compared for all 80,518 loci in the RPGG, both illustrated on a log scale.
Figure B.6: Distribution of the standard deviation of absolute percentage error for all motifs.
Ground truth is measured from motif count in assemblies, and absolute percentage error is
measured by regression from read depth on edges (motif) to assembly count. The spike at std=0
corresponds to motifs with only one observation across all haplotypes.
136
Figure B.7: Distribution of the number of outlier samples across all motifs.
For each motif, samples with motif dosage greater than two standard deviations away from the
mean were removed. Data points were collected from the final set of 1,663,559 motifs and 838
GTEx samples.
Figure B.8: Distribution of eMotif P-values from Geuvadis missing in GTEx.
a, Distribution of the eMotif P-values from Geuvadis. The x-axis denotes whether the eMotifs
are missing from the GTEx discoveries. b, Enrichment of missing eMotifs with lower
significance. Contingency tables with two variables, “< P
eQTL
cutoff” and “is missing”, were
constructed. Significance was computed with one-sided Fisher’s exact test.
137
Figure B.9: Dot plot analysis of AVPR1A RS3 VNTR.
GRCh38 sequence at chr12:63,156,354-63,156,429 was plotted against itself. Each dot
represents an exact matching of a 13-mer. The boundaries of 700 bp flanking sequences were
indicated with red lines.
138
Figure B.10: Frequency of CACNA1C risk motifs across the 35 HGSVC assemblies.
The occurrence of risk motif 1 CAACCACACGATCCTGACCTT and risk motif 2
CCTGACCTTACTAGTTTACGA were counted considering both the motif and its reverse
complement. Bin sizes for both marginal distributions are one.
139
Figure B.11: Number of eGenes with likely causal eMotifs.
Each eGene reported by eQTL mapping was fine-mapped using all SNPs and eMotifs among the
100 kb window. eGenes of which the highest eMotif posterior inclusion probability (PIP) was
greater than 0.8 were reported as containing likely causal eMotifs. An eGene was considered
shared across tissues (magenta) if it was reported in at least one other tissue.
140
Figure B.12: Number of eVNTRs with likely causal eMotifs.
Each eGene reported by eQTL mapping was fine-mapped using all SNPs and eMotifs among the
100 kb window. eGenes of which the highest eMotif posterior inclusion probability (PIP) was
greater than 0.8 were reported as containing likely causal eMotifs. An eVNTR was considered
shared across tissues (magenta) if any eMotif in the VNTR was reported in at least one other
tissue.
141
Figure B.13: Number of likely causal eMotifs.
Each eGene reported by eQTL mapping was fine-mapped using all SNPs and eMotifs among the
100 kb window. eGenes of which the highest eMotif posterior inclusion probability (PIP) is
greater than 0.8 were reported as containing likely causal eMotifs. An eMotif was considered
shared across tissues (magenta) if it was reported in at least one other tissue.
142
Figure B.14: Frequency of likely causal eMotifs across the 35 HGSVC assemblies.
a, The x-axis denotes the dosage of
CGTAGGGCGAGCCTGGGTGACGAGATTCAGAGCGTTAGACG at
chr21:44,909,466-44,909,642. The y-axis denotest the dosage of
ACCCTGGATGCCTGTGGGCTGCCTTCCTCACC at chr21:44,928,811-44,929,048. b, The
x-axis denotes the dosage of
AGTCCCAGCTACTCGGGAGGTTGAGGCAGGAGAATGGCGTG at
chr1:161,538,348-161,538,697. The y-axis denotes the dosage of
TACTTGGTGACATGATTGTGAGAATAAGCTCTGGCGA at chr1:161,545,578-161,545,729.
c, The x-axis denotes the dosage of GGAGAAGAAGAAGAAGAAGAA at
chr17:46,265,245-46,265,480. The y-axis denotest the dosage of
AAGTAGCTGGGATTACAGGCGCATGCCAGCATGCCCGGCTTATTTTTCTATTTTTAGTA
GAGACGGGG at chr17:46,217,604-46,217,944. The occurrence of each motif was counted
considering both the motif and its reverse complement. Bin sizes for all marginal distributions
are one.
143
B.2 Supplemental Tables
Table B.1: Additional disease-relevant tandem repeats included in our reference set.
Gene
Associated/Causal
TR Region
Disease Note
ACAN chr15:88855423-88857301 OSTD coding
AR chrX:67545317-67545419 SBMA coding
ATN1 chr12:6936717-6936775 DRPLA coding
ATXN1 chr6:16327634-16327724 SCA1 coding
ATXN2 chr12:111598950-111599019 SCA2 coding
ATXN3 chr14:92071011-92071052 SCA3 coding
ATXN7 chr3:63912685-63912716 SCA7 coding
ATXN8 chr13:70,139,352-70,139,429 SCA8 noncoding
A VPR1A chr12:63149772-63149849 EXTB intron_A VR
A VPR1A chr12:63153304-63153366 EXTB 5UTR_RS1
A VPR1A chr12:63156354-63156429 EXTB 5UTR_RS3
C9orf72 chr9:27573485-27573546 ALS intron 1
CACNA1A chr19:13207859-13207898 SCA6 coding
CACNA1C chr12:2255791-2256090 bipolar, SCZ intron 3
CEL chr9:133071170-133071694 Monogenic diabetes coding
CNBP chr3:129172568-129172736 DM2 intron 1
CSTB chr21:43776293-43776322 PMP1A 5' UTR
DMPK chr19:45770205-45770266 DM1 3' UTR
DRD4 chr11:639989-640194 OCD, ADHD coding
FMR1 chrX:147912037-147912111 FXS, FXTAS 5' UTR
FXN chr9:69087917-69087957 FA noncoding
GP1BA chr17:4933823-4933963 ATF in stroke coding
HIC1 chr17:2052378-2053079 MCC promoter
HTT chr4:3074877-3074940 HD coding
IL1RN chr2:113130529-113130872 stroke, CAD intron
INS chr11:2161570-2161976 T1D, T2D, Obesity promoter
JPH3 chr16:87604283-87604329 HDL2 intron
MAOA chrX:43655101-43655205 bipolar promoter
MMP9 chr20:46013363-46013426 Kawasaki coding
MUC1 chr1:155188676-155192051 MCKD1 coding
MUC21 chr6:30986289-30987497 DPB coding
NACA chr12:56717496-56718739 atrial fibrillation coding
PER3 chr1:7829888-7830126 bipolar coding
PPP2R2B chr5:146878728-146878759 SCA12 noncoding
SLC6A3 chr5:1393582-1393985 ADHD, PK 3' UTR
144
SLC6A3 chr5:1414387-1414518 ADHD, PK intron 8
SLC6A4 chr17:30221385-30221590 BPSD, AZ intron
SLC6A4 chr17:30237128-30237450 OCD, anxiety, SCZ promoter
TBP chr6:170561907-170562017 SCA17 coding
TCHH chr1:152111363-152112285 Male pattern baldness score coding
ALS, amyotrophic lateral sclerosis. DM, myotonic dystrophy. DRPLA, dentatorubral-pallidoluysian atrophy. FTD, frontotemporal dementia.
FXS, fragile X syndrome. FXTAS, fragile X tremor-ataxia syndrome. HD, Huntington disease. HDL2, Huntington disease-like 2. SBMA,
spinobulbar muscular atrophy. SCA, spinocerebellar ataxia. SCZ, schizophrenia. FA, Friedreich ataxia. DPB, diffuse panbronchiolitis . AZ,
Alzheimer’s disease. MCC, metastatic colorectal cancer. PMP1A, progressive myoclonic epilepsy 1A. PK, Parkinson’s disease. OSTD,
osteochondritis dissecans. EXTB, externalizing behaviors. CAD, coronary artery disease.
145
Table B.2: eMotif discoveries for disease-relevant genes.
eGene eGene:eVNTR pair
Gene Rep.
1
# of
tissues
# of
eVNTRs
Rep.
1
# of
tissues
Tissue
matching
disease
ontology
eMotif ( positive / negative effect)
ACAN v 2 1
AR v 1 1
ATN1 v 29 3 v 23 v CAGCAGCAGCAGCAGCAGCAGCA
ATXN1 v 1 1
ATXN2 v 4 4
ATXN3 v 21 11 v 15 v
AGCAGCAGCAGCAGCAGCAGGG,
AGTTTGACACCAGCCTGGACAACACGGTGAA,
CACTGCAATCTCCGCCTCCCGGGTTCAAGTGATTCTCCTGC,
CATCGCCTGGCTAATTTTTTGTATTTTTAGTAGAG,
CCCTGTCTCTACTAAAAATACAAAA,
CCTCTTTAATCATATTAAGACTCTTAAGTAAATTTGTAATC,
CGGGAAGCGAAGGTTGCAGTGAGCTGAGATTGCACCATTGC,
CTGACCTCAGGTGATCCATCAGCCTTGGCCTCCCAAAGTGC,
GCAAAAATCACATGGAGCTCGTATGTCAGATAAAGTGTGAA,
GCAAAAATCACATGGAGCTCTTATGTCAGATAAAGTGTGAA,
GGTGGATGGCTTGAGGTCAGAAGTTTGACAC
ATXN7 v 6 1
ATXN8
A VPR1A
v 6 3 A VPR1A v 2 v
A VPR1A v 3 v
C9orf72 v 36 4 v 7 v
CACGCCCCGGCCCCGGCCCCGGC,
CCGCCCCGGGCCCGCCCCCGACCACGCCCCCGGCCCCGGCCCCGG
CC,
CCGGCCCCGGCCCCGGCCCCGGC
CACNA1A v 19 14
CACNA1C v 10 9 v 4 v
AAACTAGTCAGGTCAGGATCGTG,
AAGGTCAGGATCGTGTGGTTG,
ATCGTGTGGTTGTAAACCAGCAAGGTCAG,
CCCTGACCTTACTAGTTTACGA
CEL v 5 5
CNBP v 4 3
CSTB v 41 13 v 27 v
AAAAAAAAAAAAAAAGAAAAAGAAAAGAAAAGGAAA,
ACTGCAGGCTCCGCCTCCCGGGTTCAC,
CGCGATCTCGGCTTACTGCAGGC,
CGGGAGGCGGAGCCTGCAGTAAGCC
DMPK v 30 16
DRD4 v 27 17 v 5 v
CAGGGGTCCTGGGGGAGGCCGGGCGCGG,
GCAGGGGTCCTGGGGGAGGCC,
GCCTCCCCCAGGACCCCTGTGGCC
FMR1 v 3 2
FXN v 10 7 v 4 v
ACCTCGTGATCCGGCCGCCTTGGCCTTCCAAAGTGCTGGGA,
ATATAAAAATTAGTCGCCGGGTG
CCCCGTCTCTACTAAAAATATAAA
GP1BA v 16 7
146
HIC1 v 6 7 v 1
HTT v 8 11
IL1RN v 9 5 v 2
INS
JPH3 v 7 10
MAOA v 4 5 v 3
MMP9 v 4 3
MUC1 v 14 10 v 4
MUC21 v 7 7 v 4 v
ATCCCACTGGACACTGTGCTAGACTCAGAGTTGG,
CAGAGTTGGTGACTGTGCTGGCCCCACTGGAG,
GGGGCCAGCACTGCCACCAAC
NACA v 14 7
PER3 v 2 4 v 2
PPP2R2B v 12 7
SLC6A3
v 18 11
SLC6A3
SLC6A4
v 10 4
SLC6A4
TBP v 6 3
TCHH v 5 3
1
eGene or eGene-eVNTR pair replicated in eMotif discoveries using genome-wide P-value cutoffs.
Table B.3: eQTL mapping using a genome-wide P-value cutoff.
Tissue P-value cutoff
Kidney_Cortex 1.23E-05
Brain_Substantia_nigra 5.68E-05
Vagina 8.29E-05
Uterus 8.58E-05
Brain_Amygdala 1.05E-04
Brain_Spinal_cord_cervical_c-1 1.17E-04
Cells_EBV-transformed_lymphocytes 1.36E-04
Minor_Salivary_Gland 1.42E-04
Ovary 1.68E-04
Brain_Hippocampus 1.70E-04
Small_Intestine_Terminal_Ileum 1.84E-04
Brain_Hypothalamus 1.84E-04
Brain_Anterior_cingulate_cortex_BA24 1.87E-04
Liver 2.38E-04
Brain_Putamen_basal_ganglia 2.42E-04
147
Artery_Coronary 2.53E-04
Brain_Frontal_Cortex_BA9 2.54E-04
Prostate 2.76E-04
Brain_Caudate_basal_ganglia 3.15E-04
Brain_Nucleus_accumbens_basal_ganglia 3.17E-04
Adrenal_Gland 3.47E-04
Brain_Cortex 3.89E-04
Brain_Cerebellar_Hemisphere 3.90E-04
Spleen 4.24E-04
Pituitary 4.30E-04
Stomach 4.55E-04
Brain_Cerebellum 5.52E-04
Pancreas 5.64E-04
Colon_Sigmoid 5.75E-04
Colon_Transverse 6.15E-04
Esophagus_Gastroesophageal_Junction 6.20E-04
Heart_Left_Ventricle 6.27E-04
Breast_Mammary_Tissue 6.33E-04
Heart_Atrial_Appendage 6.63E-04
Testis 7.96E-04
Artery_Aorta 8.12E-04
Adipose_Visceral_Omentum 8.45E-04
Lung 9.05E-04
Esophagus_Muscularis 1.04E-03
Esophagus_Mucosa 1.07E-03
Skin_Not_Sun_Exposed_Suprapubic 1.07E-03
Adipose_Subcutaneous 1.15E-03
Artery_Tibial 1.21E-03
Muscle_Skeletal 1.24E-03
Cells_Cultured_fibroblasts 1.24E-03
Skin_Sun_Exposed_Lower_leg 1.26E-03
Whole_Blood 1.27E-03
Nerve_Tibial 1.33E-03
Thyroid 1.42E-03
148
Abstract (if available)
Abstract
Variable number tandem repeats (VNTRs) are consecutive repetitive elements in the human genome. They are subject to expansion and contraction during replication and exhibit considerable variability between individuals. Most genetics studies exclude these regions due to unreliable read alignments and variant calls. However, VNTRs contribute to abundant phenotypic variations in the human population and have extensive impact on disease risk. While methods exist to genotype VNTRs, no tools consider the following key components at the same time. First, genotyping VNTRs from short reads can benefit from the vast amount of population or cohort studies based on short-read sequencing. Second, multiallelic variant calls are preferred for the copy number varying nature of VNTRs. Third, motif variations are a common mode of mutation besides expansion/contraction, and their functional impacts have not been systematically evaluated. In this dissertation, I present our efforts to genotype VNTRs from short reads using a novel data structure – the repeat-pangenome graph – which provides a unified framework for profiling the functional impact of VNTRs at both copy number and motif levels.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Characterizing synonymous variants by leveraging gene expression and GWAS datasets
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Scalable latent factor models for inferring genetic regulatory networks
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Genetic and molecular insights into the genotype-phenotype relationship
PDF
Efficient short-read sequencing on long-read sequencers
PDF
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Understanding the characteristic of single nucleotide variants
PDF
Exploring the genetic basis of quantitative traits
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
A diagrammatic analysis of the secondary structural ensemble of CNG trinucleotide repeat
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Understanding ancestry-specific disease allelic effect sizes by leveraging multi-ancestry single-cell RNA-seq data
PDF
Ecologically responsible domestication of kelp facilitated by genomic tools
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
Asset Metadata
Creator
Lu, Tsung-Yu
(author)
Core Title
danbing-tk: a computational genetics framework for studying variable number tandem repeats
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2022-08
Publication Date
07/05/2022
Defense Date
06/17/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
copy number variation,eQTL mapping,fine-mapping,genome graph,genotyping,graph aligner,motif variation,OAI-PMH Harvest,tandem repeat,VNTR
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chaisson, Mark J.P. (
committee chair
), Chen, Liang (
committee member
), Gazal, Steven (
committee member
), Mancuso, Nicholas (
committee member
)
Creator Email
joyeuxnoel8@gmail.com,tsungyul8@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111352077
Unique identifier
UC111352077
Legacy Identifier
etd-LuTsungYu-10807
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lu, Tsung-Yu
Type
texts
Source
20220706-usctheses-batch-950
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
copy number variation
eQTL mapping
fine-mapping
genome graph
genotyping
graph aligner
motif variation
tandem repeat
VNTR