Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Application of machine learning methods in genomic data analysis
(USC Thesis Other)
Application of machine learning methods in genomic data analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPLICATION OF MACHINE LEARNING METHODS IN GENOMIC DATA ANALYSIS by Han Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2019 Copyright 2019 Han Li Acknowledgments First, I would like to express my sincere gratitude to my advisors Professor Fengzhu Sun and Professor Peter Ralph for their support and mentoring. It's my greatest luck in the past six years to have them as my advisors. Their influences on me are far beyond research. I'm also grateful to Professor Liang Chen, Professor Ian Ehrenreich and Pro fessor Paul Marjoram for being committee members for my qualifying exam or dissertation defense. I would like to thank Professor Michael Waterman for his encouragement to my study and setting a model of being a researcher. My thanks also go to my lab mates and friends in our program for their help and care: Mengge Zhang, Wangshu Zhang, Weili Wang, Beibei Xin, Liz Ji, Jie Ren, Chao Deng, Xiaofei Wang, Lin Yang, Wenzheng Li, Junsong Zhao, Long Pei and many others. Last, I want to sincerely thank my mom and dad for their unconditional love, trust and support for me. 11 Contents Acknowledgments ii List of Tables v List of Figures vi Abstract xi 1 Introduction 1 1.1 Genomic variation . . . . . . . . . . . . . 1 1.1.1 DNA mutations: types and effects . 1 1.1. 2 Genetic polymorphisms . . . . . . . 3 1. 2 Detection of genomic variation . . . . . . . 4 1.2.1 Experimental approaches to detect genomic variation 4 1.2.2 Variant calling from NGS data. . . . . . . . . . . . . 5 1.3 Machine learning applications in bioinformatics . . . . . . . 6 1.3.1 Unsupervised learning methods in sequencing data investi- gation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Supervised learning methods in sequencing data classification 8 2 Population Structure Variation Along the Genome 11 2.1 Introduction . . . . . . . . . . . 11 2.2 Materials and Methods . . . . . . 14 2.2.1 Choosing window length . 14 2.2.2 PCA in genomic windows 17 2.2.3 Similarity of patterns of relatedness between windows 18 2.2.4 Visualization of results 19 2.2.5 Testing . 20 2.2.6 Datasets . 24 2.3 Results . . . . . . 30 2.3.1 Validation 30 2.3.2 Drosophila m elanogaster 34 lll 2.3.3 Human ....... . 2.3.4 Medicago truncatula 2 .4 Discussion . . . . . . . . . . 3 Predicting the hosts of viruses based on viral sequences 3.1 Introduction ........................ . 3.2 Materials and Methods .................. . 3. 2.1 Calculate the pairwise distance matrix of viruses . 3.2.2 Visualize the distance matrix .......... . 3.2.3 Predict the host of a virus ............ . 38 38 59 64 64 65 65 66 66 3.2.4 Investigate the impact of sample sizes on prediction accuracy 67 3.2.5 Datasets . . 68 3.3 Results . . . . . . . 69 3.3.1 Rabies virus 70 3.3.2 Coronavirus 73 3.3.3 Influenza A virus 7 4 3.4 Discussion . . . . . . . . 89 4 Perspectives and Future Work Reference List 91 94 lV List of Tables 2.1 Measures of signal and noise, computed separately for each chromo- some arm in the Drosophila dataset, at different window sizes. 17 2.2 Descriptive statistics for each dataset used. . . . . . . . . . . 26 2.3 Correlations between MDS coordinates of genomic regions between runs with different parameter values. . . . . . . . . . . . . . . . . . 40 V List of Figures 1.1 2.1 Different beaks in Darwin's finches .......... . An illustration of the method; see Methods for details. 2.2 PCA plots for chromosome arms 2L, 2R, 3L, 3R and X of the 2 15 Drosophila melanogaster dataset. . . . . . . . . . . . . . . . . . 27 2.3 PCA plots for all 22 human autosomes from the POPRES data. 28 2.4 PCA plots for all 8 chromosomes in the Medicago truncatula dataset. 29 2.5 MDS visualizations of the Gaussian genotypes described in Gaussian simulation, for 50 individuals from each of three populations. . . . . 31 2.6 MDS visualizations of the results of individual-based neutral simu lations using SLiM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2. 7 MDS visualizations of the results of individual-based simulations with linked selection using SLiM. . . . . . . . . . . . . . . . . . . . 33 2.8 Variation in patterns of relatedness for windows across Drosophila melanogaster chromosome arms. . . . . . . . . . . . . . . . . . . . . 35 2.9 MDS visualizations for each chromosome arm of Drosophila melanogaster, as in Figure 2.8, except that the method was run using five PCs (k = 5) instead of two. . . . . . . . . . . . . . . . . . . . . . . . . . 41 Vl 2.10 The proportion of data in each window that are missing, com pared to the value of the first MDS coordinate for the Drosophila melanogaster data from Figure 2.8. . . . . . . . . . . . . . . . . . . 42 2.11 PCA plots for the three sets of genomic windows colored in Figure 2.8, on each chromosome arm of Drosophila melanogaster. . . . . . . 43 2.12 PCA plots for the three sets of genomic windows colored in Figure 2.8, on each chromosome arm of Drosophila melanogaster. . . . . . . 44 2.13 The effects of population structure without inversions is correlated to recombination rate in Drosophila melanogaster. . . . . . . . . . . 45 2.14 Variation in structure for windows of 1,000 SNPs across Drosophila melanogaster chromosome arms: without inversions. . . . . . . . . . 46 2.15 Recombination rate, and the effects of population structure for Drosophila melanogaster 47 2.16 Variation in structure between windows on human chromosomes 8, 15, and 17. . . . . . . . . . . . . . . . . . 48 2.17 MDS plots for human chromosomes 1-8. 49 2.18 MDS plots for human chromosomes 9-16 50 2.19 MDS plots for human chromosomes 17-22 51 2.20 Comparison of PCA figures within outlying windows (center col- umn) and flanking non-outlying windows (left and right columns) for the two windows having outlying MDS scores on chromosome 8. 52 2.21 MDS visualization of variation in the effects of population structure amongst windows across all human autosomes simultaneously. . . . 53 2.22 MDS visualizations of the effects of population structure for all 8 chromosomes of the Medicago truncatula data, using windows of 10 4 SNPs. 54 Vll 2.23 PCA plots for regions colored in Figure 2.22 on all 8 chromosomes of Medicago truncatula: (A) green, (B) orange, and (C) purple. 55 2.24 MDS visualization of patterns of relatedness on M. truncatula chro mosome 3, with corresponding PCA plots. . . . . . . . . . . . . . . 56 2.25 MDS coordinate and gene density for each window in the Medicago genome, for chromosomes 1-8 . . . . . . . . . . . . . . . . . . . . . 5 7 2.26 First MDS coordinate against gene density for all 8 chromosomes of M. truncatula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1 The prediction accuracy of the KNN algorithm for the rabies virus dataset using varying distance measures, k-mer length from 3 to 6, and the number of neighbors ( K, horizontal axis). . . . . . . . . . 70 3.2 The prediction accuracy of the KNN algorithm for the coronavirus dataset using varying distance measures, k-mer length from 3 to 6, and the number of neighbors ( K , horizontal axis). . . . . . . . . . 71 3.3 The prediction accuracy of the KNN algorithm for the influenza A virus dataset using varying distance measures, k-mer length from 3 to 6, and the number of neighbors ( K, horizontal axis) . . . . . . . 72 3.4 MDS plots for the 148 rabies viruses with complete N gene sequences based on the Manhattan distance using 6-mers (left) and alignment (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Hierarchical clustering of 148 rabies viruses with complete N gene sequences using the alignment-based distances of the virus sequences. 75 3.6 Hierarchical clustering of 148 rabies viruses with complete N gene sequences based on the Manhattan distance between the 6-mer fre quencies of the virus sequences. . . . . . . . . . . . . . . . . . . . . 76 Vlll 3. 7 The prediction accuracy for different sample sizes for the rabies virus dataset using alignment-based distance, Manhattan distance with 6-mers and one nearest neighbor, and SVM. . . . . . . . . . . 77 3.8 The prediction accuracy for different sample sizes for the rabies virus dataset using alignment-based distance, Manhattan distance with 6- mers and one nearest neighbor, and SVM using 10-fold (upper) and 20-fold (lower) cross-validation. . . . . . . . . . . . . . . . . . . . . 78 3.9 MDS plots for the 707 coronaviruses with spike gene sequences based on the distances calculated by the Manhattan distance with 6-mers (left) and alignment (right). . . . . . . . . . . . . . . . . . . . . . . 79 3.10 Hierarchical clustering of the 707 coronaviruses with spike gene sequences using the alignment-based distances of the virus sequences. 80 3.11 Hierarchical clustering of the 707 coronaviruses with spike gene sequences using the Manhattan distance between the 6-mer frequen- cies of the virus sequences. . . . . . . . . . . . . . . . . . . . . . . . 81 3.12 The prediction accuracy for different sample sizes for coronaviruses "allspike" set using alignment-based distance, Manhattan distance with 6-mers and 7 nearest neighbor, and SVM. . . . . . . . . . . . . 82 3.13 The prediction accuracy for different sample sizes for the coronavirus dataset using alignment-based distance, Manhattan distance with 6- mers and one nearest neighbor, and SVM using 10-fold (upper) and 20-fold (lower) cross-validation. . . . . . . . . . . . . . . . . . . . . 83 3.14 MDS plots for the influenza A viruses with N gene sequences based on the distances calculated using Manhattan distance of 6-mer fre quencies (left) and alignment method (right). . . . . . . . . . . . . 84 IX 3.15 Hierarchical clustering of the Influenza A viruses with N gene sequences using the alignment-based distances of the virus sequences. . . . . . 85 3.16 Hierarchical clustering of the Influenza A viruses with N gene sequences using the Manhattan distance between the 6mer frequencies of the virus sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.17 The prediction accuracy for different sample sizes for influenza A dataset using alignment-based distance, Manhattan distance with 6-mers and 7 nearest neighbor, and SVM. 87 3.18 The prediction accuracy for different sample sizes for the influenza A virus dataset using alignment-based distance, Manhattan dis- tance with 6-mers and one nearest neighbor, and SVM using 10-fold (upper) and 20-fold (lower) cross-validation. . . . . . . . . . . . . . 88 X Abstract The Human Genome Project opened an era of deciphering the mystery of the genome. Then, Next Generation Sequencing (NGS) technologies enabled us to obtain massive high quality sequencing data and revolutionized genomic research. Meanwhile, the development and extensive application of machine learning shows its impressive competence in the big data era. In this dissertation, I implement statistical and machine learning tools to extract information from the genetic vari ation in genome sequences and make predictions for future outcomes of interest. I first present a study using unsupervised learning methods to investigate the variation of population structure along the genome and try to identify the under lying biological causes. In this work, I use principal component analysis (PCA) to see the effect of population structure on each segment of the genome, followed by multidimensional scaling (MDS) to describe the variation across the genome. Exploration in three species and validation from simulations reveal significant vari ation in the effect of population structure on the scale of only a few megabases. In modern humans, polymorphic chromosomal inversions are the main factors that cause the localized variation. In Medicago truncatula, the variation is correlated with local gene density, and is perhaps due to linked selection. In Drosophila melanogaster, both inversions and gene density contribute to the variation. Xl Next, I present another work comparing various supervised learning methods to predict the hosts of viruses based on viral sequences. In this work, I com pared alignment and alignment-free based distances followed by K nearest neigh bors (KNN), and k-mer frequency feature engineering followed by support vector machine (SVM) to predict hosts for rabies virus, coronavirus, and influenza A virus datasets. I showed that alignment-free methods could be a promising substitute for alignment-based methods and their performances are highly correlated with sample size, evenness of sampling and sequence conservation. Xll Chapter 1 Introduction 1.1 Genomic variation DNA stores genetic information as sequences of nucleotides. In the human genome, there are approximately 3 billion nucleotides of DNA [12]. As of 2015, it was found that around 99.4% of the genomes of any two humans are identical. The remaining 0.6%, together with environmental factors, make us all unique [23]. Even a subtle change in the DNA sequences may lead to a great difference in gene expression, and thus to variation in traits. The differences of DNA sequences among individuals are called genomic variation. This dissertation focuses on study ing the genomic variation uncovered by sequencing data. Investigation of the amount and pattern of genomic variation will help to deci pher the mystery of DNA and help us to better understand evolutionary history. Major types of genomic variations are mutations and polymorphisms. 1.1.1 DNA mutations: types and effects Mutations arise from errors during DNA replication or repair. Generally, muta tions happen randomly and at a low rate. In human, it was estimated as around 2.5 * 10-s mutations per nucleotide site per generation [79]. They can also be induced by environmental factors, such as exposure to radiation or certain chem icals ( eg. dioxins) [86]. Somatic mutations occurring in non-germline cells only affect the present individual itself. If mutations occur in germline cells and affect 1 J 1. Geospiza magnirostris. 3. Geuspiza parvula. 2. Geospiza fortis. 4. Certhidea c,liva1ea. Figure 1.1 Different beaks in Darwin's / Galapagos finches. Charles Darwin first observed and recorded the remarkable diversity of them during the second voyage of the Beagle [27]. the hereditary material, they can be passed on to the descendants thus affect the next generation. The accumulation of heritable mutations together with the nat ural selection by environmental pressures generate species diversity, which are the basis for evolution [ 68]. Figure 1.1 shows different sizes and shapes of beaks in Darwin's finches. They were evolved to adapt to different food resources on the Galapagos Islands [27]. Recent genomic work by Grant and Grant [36] has helped us to understand how this ongoing evolution is possible. The effect of mutation is uncertain. It can be advantageous, neutral or deleteri ous in term of fitness to the environment. Studies of the distribution of mutational 2 effects, eg. by Eyre-Walker and Keightley [31], show that most mutations have trivial effects, while most non-neutral mutations are harmful. Small-scale mutations affect one or several nucleotides, including insertions, deletions or substitution of one or a few nucleotides. Point mutations are the smallest mutations where a single nucleotide is changed. Mutations of coding sequence that do not alter the amino acid sequences are called synonymous muta tions, which can happen due to the degeneracy of codons. Other mutations that do not affect protein structure take place in noncoding regions such as introns and intergenic regions. In contrast, nonsynonymous mutations change the protein production, and may result in malfunction or premature termination of a protein. In terms of small-scale mutations, insertions or deletions of one or two nucleotides ( or more generally, of a number that is not divisible by 3) within coding regions have the greatest influence since they cause a reading frame shift and affect the whole amino acid sequence [12]. Large-scale mutations involve the alteration of relatively large segments of chro mosomes, including gene duplication and deletion, chromosomal translocations and inversions and so on. In Chapter 2, we'll discuss how chromosomal inversions have strong effects on genetic variation patterns. 1.1.2 Genetic polymorphisms Genetic polymorphism results from mutations. It refers to the presence of two or more alleles at a particular locus and each allele has a frequency of at least 1 % in a population [11 , 54]. They contribute to the normal differences of traits among individuals. For example, different people might have different hair color or eye color naturally. 3 The most common type of polymorphism is the single nucleotide polymorphism (SNP) , occurring roughly every 1000 bps in the human genome [23]. SNPs occur throughout the genome, though the distribution is not homogenous. Due to nat ural selection, SNPs occur more frequently in non-coding regions than in coding sequences. The differences in recombination rate in different genomic regions also contribute greatly to the variation of SNP density [78]. The study of SNPs has surged in recent years with next generation sequencing (NGS) approaches facil itating the tremendous discovery of SNPs, and plays a key role in personalized medicine. SNPs are extensively used in association studies as markers related to certain diseases or traits. In population genetics, the linkage disequilibrium phenomenon of SNPs being inherited coherently is exploited widely to understand evolutionary events [102]. In Chapter 2, we'll investigate SNPs data in three species using unsupervised learning methods. Another type of polymorphism, copy number variation, such as minisatellites and microsatellites, refer to the variations in the number of repeats of certain DNA sequences. They are usually located in non-coding region and are extensively used as molecular markers in DNA fingerprinting and genetic studies. Some are also found to be related to the regulation of gene expression [39]. 1. 2 Detection of genomic variation 1. 2 .1 Experimental approaches to detect genomic variation There are various approaches to detect genomic variation. Historically, the enzymatic method in wet lab experiments was the first widely used method to detect polymorphisms. It takes use of different restriction enzymes to digest the DNA sequences and visualizes the products with electrophoresis on a gel. 4 Sequences with different restriction sites will be cut into segments with differ ent lengths. Polymorphisms detected in this way are called restriction fragment length polymorphisms (RFLPs) [60]. A more modern method uses hybridization analysis on high-density oligonucleotide arrays, which can measure thousands of genes' quantitative expression levels. The hybridization patterns for each gene are specific and depend on the sequences (probes) that bind. This allows the identifica tion of potential sequence differences [37]. A chromatographic method, Denaturing High Performance Liquid Chromatography (DHPLC), takes use of the difference of retention time in homoduplexes and heteroduplexes, since heteroduplexes contain DNA mismatches and are less stable than homoduplexes at a certain tempera ture. DHPLC has high speed and resolution and is particularly useful to detect base substitutions and small indels [85]. Other physical methods such as mass spectrometry and fluorescence exchange-based methods are also exploited to find allelic variants [ 60] . 1.2.2 Variant calling from NGS data Traditional wet lab experimental approaches to detect the genetic variation could be quite expensive or with limited throughput. With the development of NGS technologies, unprecedented high quality sequencing data have been obtained in a wide range of species. Techniques that perform SNP genotyping from NGS data are becoming more and more popular. The general process of SNP calling from raw NGS data are as follows [82]. 1) Pre-process the raw data through quality control and length filtering to transform it into reads with high quality score. 5 2) Align the reads with the reference sequence. This step is crucial for the SNP calling accuracy. Alignment algorithms need to be able to deal with mismatches resulting from real variations as well as sequencing errors. 3) Use probabilistic methods to predict the genotype likelihood at each site, based on recalibrated quality scores and depth of coverage. 4) Post-process the predicted results to filter the SNPs set with high scores. 1.3 Machine learning applications in bioinfor matics Machine learning is a part of artificial intelligence that provides automated methods to detect patterns in data and predict outcomes of interest. In this big data era, it's been widely used in areas such as pattern recognition, financial services, and product recommendations. In bioinformatics, it also plays a very important role in revealing the mystery in sequences. The usual steps of machine learning workflow are: 1) Collecting and preprocessing data; 2) Choosing a model; 3)Model training and parameter tuning; 4) Model assessment; 5) Prediction and future application. Based on whether the original data has an outcome measurement, we divide machine learning methods into two big categories, unsupervised learning and super vised learning. 6 1.3.1 Unsupervised learning methods in sequencing data investigation Unsupervised learning methods such as dimension reduction and cluster anal ysis are widely used in population genetics to investigate population structure and evolution histories. Principal component analysis (PCA) is a dimension reduction method that is often used to describe overall "population structure", or patterns of relatedness between individuals in large genomic datasets. Mean relatedness is an average of the relationships across locus-specific genealogical trees. In population genetics, SNP data can be recoded into numeric data (0/1 for haplotypes, 0/1/2 for diplo types) to which we can apply PCA to summarize the patterns of genetic relatedness [89]. Another dimension reduction method, multidimensional scaling (MDS), works on the matrix of pairwise distances among the observations and is a great tool to intuitively check whether there are clusters for the observations. In Chapter 2, I use Euclidean distances among reconstructed covariance matrices from principal components to represent the dissimilarity of individuals' SNP genotypes. Those methods provide a flexible new way to discover biological drivers of genetic varia tion. I applied these methods to genomic data from three species, and found that the effect of population structure can vary substantially across only a few mega bases in each species. In modern human dataset, this variation is likely explained by poly morphic chromosomal inversions. In Medicago truncatula, the variation is found to be correlated with local gene density, and may be caused by linked selection, such as background selection or local adaptation. In Drosophila melanogaster, both 7 chromosomal inversions and linked selection may drive the variation of population structure along the genome. Common clustering algorithms such as K-means usually requires choosing the number of clusters beforehand, while hierarchical clustering, which is also imple mented on the distance matrix of observations, does not need that specification. It represents a hierarchy relation among the observations, and each level is obtained by merging clusters in the lower level. In Chapter 3, I try to obtain the genetic distances between individuals with next generation sequencing data. This can be achieved by the traditional way relying on sequence alignment, which requires assembly of short reads and is highly time-consuming. Or, the alignment-free based measurements, which takes in the information of k-mer counts, are also proven to be a promising substitute [97]. Then I visualize the clusters of the individuals through MDS and hierarchical clustering for helping the subsequent analysis. 1.3.2 Supervised learning methods in sequencing data clas sification The dissimilarities between sequences can be achieved from alignment method or alignment-free methods. Also, sequencing data can be converted to numeric data through feature engineering, for example, the transformation of k-mer frequencies. Based on whether the outcomes are quantitative or qualitative, we can divide supervised learning into two categories, classification and regression. In Chapter 3, I focus on predicting categorical variables, i.e. classification. There are many algorithms for classification, such as (regularized) logistic regression, K nearest neighbors (KNN), support vector machines (SVM), deci sion tree based models (Random Forest, Gradient Boosting Decision Tree) and so on. No one dominant algorithm beats all others for every problem. Each model 8 has its own characteristics, regarding aspects such as time and memory complexity, overfitting tendency, tolerance of features numbers, interpretability. For example, logistic regression is easy to implement and interpret and has low time and space requirments. However, for data with non-linear decision boundaries, it tends to underfit and thus have poor performance. SVM, on the other hand, can choose different kernels and better model for data with non-linear decision boundaries. However, it requires a large amount of memory, and thus is hard to apply to large datasets. After choosing an algorithm, we usually use cross-validation to do model selection (parameter tuning) and evaluation. An essential topic in epidemiology is to predict the hosts of newly discovered viruses for pandemic surveillance of infectious diseases. Viruses reproduce and evolve rapidly. Many virus infections are great threats to human health. Stud ies showed many emerging infectious diseases are caused by virus cross-species transmission, such as the outbreaks of SARS, MARS, and HlNl [16, 70]. Thus to precisely and quickly identify the viral origins is highly helpful for the control and prevention of the infectious diseases caused. In Chapter 3, I collected 3 sets of viruses with known hosts and using this qualitatively labeled data, I applied supervised learning (mainly KNN and SVM) to do classification. The predicted result for a virus would be its most probable host species. I obtained the dissimilarities among virus sequences using both alignment-based and alignment-free methods, as discussed in Chapter 1.3.1. Then I applied KNN to the pairwise distance matrices to do the classification. I also evaluated the use of mononucleotide frequency and dinucleotide bias of the sequences as features and apply SVM to the feature matrices to predict the hosts of viruses. I applied these approaches to three datasets: rabies virus, coronavirus, and influenza A virus. For coronavirus, I used the spike gene sequences, while for rabies and influenza A 9 viruses, I used the more conserved nucleoprotein gene sequences. I compared the three methods under different scenarios and showed that their performances are highly correlated with the variability of sequences and sample size. For conserved genes like the nucleoprotein gene, longer k-mers than mono- and dinucleotides are needed to better distinguish the sequences. I also showed that both alignment based and alignment-free methods can accurately predict the hosts of viruses. When alignment is difficult to achieve or highly time-consuming, alignment-free methods can be a promising substitute to predict the hosts of new viruses. 10 Chapter 2 Population Structure Variation Along the Genome 2 .1 lntrod uction Wright [114] defined population structure to encompass "such matters as num bers, composition by age and sex, and state of subdivision", where "subdivision" refers to restricted migration between subpopulations. The phrase is also com monly used to refer to the genetic patterns that result from this process, as for instance reduced mean relatedness between individuals from distinct populations. However, it is not necessarily clear what aspects of demography should be included in the concept. For instance, Blair [8] defines population structure to be the sum total of "such factors as size of breeding populations, periodic fluctuation of popu lation size, sex ratio, activity range and differential survival of progeny" ( emphasis added). The definition is similar to Wright 's, but differs in including the effects of natural selection. On closer examination, incorporating differential survival or fecundity makes the concept less clear: should a randomly mating population con sisting of two types that exhibit partial post-zygotic reproductive isolation from each other be said to show population structure or not? Whatever the definition, it is clear that due to natural selection, the effects of population structure - the realized patterns of genetic relatedness - differ depending on which portion of the genome is being considered. For instance, strongly locally adapted alleles of a gene 11 will be selected against in migrants to different habitats, increasing genetic differ entiation between populations near to this gene. Similarly, newly adaptive alleles spread first in local populations. These observations motivate many methods to search for genetic loci under selection, as for example in Huerta-Sanchez et al. [ 47], Martin et al. [75], and Duforet-Frebourg et al. [28]. These realized patterns of genetic relatedness summarize the shapes of the genealogical trees at each location along the genome. Since these trees vary along the genome, so does relatedness, but averaging over sufficiently many trees we hope to get a stable estimate that doesn't depend much on the genetic markers chosen. This is not guaranteed: for instance, relatedness on sex chromosomes is expected to differ from the autosomes; and positive or negative selection on particular loci can dramatically distort shapes of nearby genealogies [6, 17, 57]. Indeed, many species show chromosome-scale variation in diversity and divergence (e.g., Langley et al. [63]); species phylogenies can differ along the genome due to incomplete lineage sorting, adaptive introgression and/ or local adaptation (e.g., Ellegren et al. [30], Nadeau et al. [80], Pease and Hahn [90], Pool [93], Vernot and Akey [110]) ; and theoretical expectations predict that geographic patterns of relatedness should depend on selection [21]. Patterns in genome-wide relatedness are often summarized by applying princi pal components analysis (PCA, Patterson et al. [89]) to the genetic data matrix, as inspired by the pioneering work of Menozzi et al. [77]. The results of PCA can be related to the genealogical history of the samples, such as time to most recent common ancestor and migration rate between populations [76, 83], and sometimes produce "maps" of population structure that reflect the samples' geographic origin distorted by rates of gene flow [84]. 12 Modeling such "background" kinship between samples is essential to genome wide association studies (GWAS, Astle and Balding [4], Price et al. [95]) , and so understanding variation in kinship along the genome could lead to more generally powerful methods, and may be essential for doing GWAS in species with substantial heterogeneity in realized patterns of mean relatedness along the genome. Others have applied PCA to windows of the genome: Ma and Amos [72] used local PCA much as we do to identify putative chromosomal inversions. Brye et al. [13] and Brisbin et al. [10] used PCA to infer tracts of local ancestry in recently admixed populations, but by projecting each genomic window onto the axes of a single, globally-defined PCA rather than doing PCA separately on each window. A note on nomenclature: In this work we describe variation in patterns of relatedness using local PCA, where "local" refers to proximity along the genome. A number of general methods for dimensionality reduction also use a strategy of "local PCA" (e.g., Kambhatla and Leen [51], Manj6n et al. [74], Roweis and Saul [99], Weingessel and Hornik [113]), performing PCA not on the entire dataset but instead on subsets of observations, providing local pictures which are then stitched back together to give a global picture. At first sight, this differs from our method in that we restrict to subsets of variables instead of subsets of observations. However, if we flip perspectives and think of each genetic variant as an observation, our method shares common threads, although our method does not subsequently use adjacency along the genome, as we aim to identify similar regions that may be distant. It is common to describe variation along the genome of simple statistics such as Fsr and to interpret the results in terms of the action of selection (e.g., Ellegren et al. [30], Turner et al. [109]). However, a given pattern (e.g. , valleys of Fsr) can be caused by more than one biological process [14, 26], which in retrospect is 13 unsurprising given that we are using a single statistic to describe a complex process. It is also common to use methods such as PCA to visualize large-scale patterns in mean genome-wide relatedness. In this paper we show if and how patterns of mean relatedness vary systematically along the genome, in a way particularly suited to large samples from geographically distributed populations. Geographic population structure sets the stage by establishing "background" patterns of relatedness; our method then describes how this structure is affected by selection and other factors. The method is descriptive: it does not aim to identify outlier loci, but rather to describe larger-scale variation shared by many parts of the genome, and give clues about its source. 2.2 Materials and Methods As depicted in Figure 2.1 , the general steps to the method are: ( 1) divide the genome into windows, (2) summarize the patterns of relatedness in each window, (3) measure dissimilarity in relatedness between each pair of windows, ( 4) visualize the resulting dissimilarity matrix using multidimensional scaling (MDS), and (5) combine similar windows to more accurately visualize local effects of population structure using PCA. 2.2.1 Choosing window length To begin, we first recoded sampled genotypes as numeric matrices in the usual manner, by recording the number of nonreference alleles seen at each locus for each sample. We then divided the genome into contiguous segments ("windows"). The choice of window length entails a balance between signal and noise. In very short windows, genealogies of the samples will only be represented by a few trees, so 14 1) ACT AGT samples AGT ACT ACT 2 3 \ CGCCAACACAGATGTGAC CGCCTACACAGAGGTGCC CGTCTTCACAGATGTCAC CGCCTAGACAGAGGTGAC AGCCAACACAGATGTGCC 8 2) J 4 ~ ,------,----, r-::,;--'------, 0 PCA , . . on each ~ g j;r. . 1. window ,;.· · · g - 0.05 3) Find distances between D PCA maps 0.00 - - ,c, 1 2 3 4 5 • .. ·- (~ 1,,/ .L N PC, 5) 4) UseMDS to display variation between windows ACT AGT AGT ACT ACT Show extremes of structure with PCA on merged outlying windows 0 0 N ~ g 0 0 cf " .. - 0.15 - 0.1 0 - 0.05 0.00 0.05 PC1 Figure 2.1 An illustration of the method; see Methods for details. variation between windows represents demographic noise rather than meaningful variation in patterns of relatedness. Longer windows generally have more distinct trees ( and SNPs), allowing for less noisy estimation of local patterns of related ness. However, to better resolve meaningful signal, i.e. , differences in patterns of relatedness along the genome, we would like reasonably short windows. Since we summarize patterns of relatedness using relative positions in the prin cipal component maps, we quantify "noise" as the standard error of a sample's position on PCl in a particular window, averaged across windows and samples, and "signal" as the standard deviation of the sample's position on PCl over all 15 windows, averaged over samples. The definition of eigenvectors does not specify their sign, and so when comparing between windows we choose signs to best match each other: after choosing PC1 1 , for instance, if u is the first eigenvector obtained from the covariance matrix for window j, then we next choose PClj = ±u, where the sign is chosen according to which of IIPC1 1 - ull or IIPC1 1 + ull is smaller. After doing this, the mean variance across windows is where PClij is the position of the i th individual on PCl m window j, and PClj = (l/N) :Z:::f= 1 PClij· We estimate the standard error for each PClij using the block jackknife [15, 29]: we divide the lh window into 10 equal-sized pieces, and let PClij,k denote the first principal component of this region found after removing the k th piece; then the estimate of the squared standard error is (J'[j = {o :Z:::l~ 1 (PClij,k - / 0 :Z:::1~ 1 PClij,e)2. Averaging over samples and windows, For the main analysis, we defined windows to each consist of the same number of neighboring SNPs, and calculated (J';ignal and (J'~oise for a range of window sizes (i.e., numbers of SNPs). For our main results we chose the smallest window for which (J';ignal was consistently larger than (J'~oise (but checked other sizes); the values for various window sizes across Drosophila chromosomes are shown in Table 2.1. In the cases we examined, we found nearly identical results after varying window size, and choosing windows to be of the same physical length ( in bp) rather than in numbers of SNPs. 16 window length (SNPs) chrom. arm 100 500 1,000 10,000 100,000 2L 2 a noise 2.05 1.64 1.18 0.17 0.04 2 asignal 2.76 2.69 2.23 0.68 0.31 2R 2 a noise 2.18 1.92 1.63 0.58 0.13 2 asignal 2.78 2.70 2.65 2.31 1.82 3L 2 a noise 2.08 2.00 1.64 0.73 0.25 2 asignal 2.60 2.52 2.40 1.68 1.89 3R 2 a noise 1.95 1.76 1.44 0.59 0.20 2 asignal 2.58 2.51 2.44 1.96 1.40 X 2 a noise 2.48 2.04 1.54 1.62 0.17 2 asi nal 2.61 2.43 2.30 0.32 1.14 Table 2.1 Measures of signal and noise, computed separately for each chromosome arm in the Drosophila dataset, at different window sizes. All values are multiplied by 1, 000. Starting at windows of 1,000 SNPs, the signal (variation of PCl between windows) starts to be substantially larger than the noise (standard error of PCl for each window). 2.2.2 PCA in genomic windows We applied principal component analysis (PCA) as described in McVean [76] separately to the submatrices that corresponded to each window. Precisely, denote by Z the L x N recoded genotype matrix for a given window (Lis the number of SNPs and N is the sample size), and by Zs the mean of non-missing entries for alleles, so that Zs= ;s Lj Zsj , where the sum is over thens nonmissing genotypes. We first compute the mean-centered matrix X, as Xsi = Zsi - Zs , and preserving missingness. (This mean-centering makes the result not depend on the choice of reference allele, exactly if there is no missing data, and approximately otherwise.) Next, we find the covariance matrix of X, denoted C , as Cij = m 1 _ 1 Ls XsiXsj - '] miJ(~iJ-l) (Ls Xsi)(Ls Xsj), where all sums are over the mij sites where both sample i and sample j have nonmissing genotypes. The principal components are the eigenvectors of C , normalized to have Euclidean length equal to one, and ordered by magnitude of the eigenvalues. 17 The top 2- 5 principal components are generally good summaries of population structure; for ease of visualization we usually only use the first two (referred to as PCl and PC2), and check that results hold using more. The above procedure can be performed on any subset of the data; for future reference, denote by PClj and PC2j the result after applying to all SNPs in the lh window. (Note, however, that our measure of dissimilarity between windows does not depend on PC ordering.) 2.2.3 Similarity of patterns of relatedness between win dows We think of the local effects of population structure as being summarized by relative position of the samples in the space defined by the top principal compo nents. However, we do not compare patterns of relatedness of different genomic regions by directly comparing the PCs, since rotations or reflections of these imply identical patterns of relatedness. Instead, we compare the low-dimensional approx imations of the local covariance matrices obtained using the top k PCs, which is invariant under reordering of the PCs, reflections, and rotations and yet contains all other information about the PCs. (For results shown here, we use k = 2.) Furthermore, to remove the effect of artifacts such as mutation rate variation, we also rescale each approximate covariance matrix to be of similar size (precisely, so that the underlying data matrix has trace norm equal to one). To do this, define the N x k matrix V(i) so that V(i).e, the £ th column of V(i), is equal to the £ th principal component of the i th window, multiplied by 18 ( >.cd :Z:::~=l Ami) 1 / 2 , where Aci is the £ th eigenvalue of the genetic covariance matrix. Then, the rescaled, rank k approximate covariance matrix for the i th window is k M(i) = L V(i).cV(i)~- (2.1) C=l To measure the similarity of patterns of relatedness for the i th window and lh window, we then use Euclidean distance Dij between the matrices M(i) and M(j), defined by D;J = Lkc(M(i)k,c - M(j)k,c)2. The goal of comparing PC plots up to rotation and reflection turned out to be equivalent to comparing rank-k approximations to local covariance matrices. This suggests instead directly comparing entire local covariance matrices. How ever, with thousands of samples and tens of thousands of windows, computing the distance matrix would take months of CPU time, while as defined above, D can be computed in minutes using the following method. Since for square matrices A and B, Lij(Aij - BiJ) 2 = Lij(A;j + Et) - 2 tr(AT B), then due to the orthogonality of eigenvectors and the cyclic invariance of trace, Dij can be computed efficiently as ( k 2 k 2 k ) 1/2 D·. = :Z:::c=1 >.ci + :Z:::c=1 A cJ _ 2 '°' (V( ·)TV( ')) 2 iJ ("k , )2 ("k , )2 ~ i J Cm L....C=l ACi L....C=l ACj C ,m=l (2.2) 2.2.4 Visualization of results We use multidimensional scaling (MDS, Kruskal and Wish [61]) to visualize relationships between windows as summarized by the dissimilarity matrix D. MDS produces a set of m coordinates for each window that give the arrangement in m-dimensional space that best recapitulates the original distance matrix. For results here, we use m = 2 to produce one- or two-dimensional visualizations of relationships between windows' patterns of relatedness. 19 We then locate variation in patterns of relatedness along the genome by choos ing collections of windows that are nearby in MDS coordinates, and map their positions along the genome. A visualization of the effects of population struc ture across the entire collection is formed by extracting the corresponding genomic regions and performing PCA on all, aggregated, regions. 2.2.5 Testing We tested the method using two types of simulation. First, to verify expected behavior, we simulated "genomes" as an independent sequence of correlated Gaus sian "genotypes", using a different covariance matrix in the first quarter, middle half, and last quarter of the chromosome. To verify robustness to missing data, we ran the method after randomly drop ping 50% of the genotypes in the first half of the genome; if the method is misled by missing data, then it will distinguish the two halves of the chromosome rather than the segments having different covariance matrices. To provide a realistic test, we next used forwards-time, individual-based sim ulations, implemented using SLiM v3 [40]. To provide realistic population struc ture for PCA to identify, each simulation had at least 5,000 diploid individuals, living across a continuous square range, with Gaussian dispersal and local density dependent competition. Each genome was modeled on human chromosome 7, which is 1.54 x 10 8 bp long, with an overall recombination rate of 1.6785 crossovers per chromosome per generation. To improve speed, we used tskit [56] to record tree sequences in SLiM [41] and to add neutral mutations afterwards, at a rate of 10- 9 per bp per generation. Most simulations were neutral, but we also included linked selection, of two types. First, we introduced selected mutations into two regions, which extended from 1/3 to 1/2 and from 5/6 to the end of the genome 20 respectively. These had selection coefficients from a Gamma distribution with shape 2 and mean 0.005 at a rate of 10- 10 per bp, that were either beneficial (with probability 1/30) or deleterious (otherwise). Second, to roughly model a recent expansion followed by local adaptation, we introduced mutations in the same man ner as above, except that mutations were no longer unconditionally deleterious or beneficial: each selection coefficient was multiplied by a factor depending on the spatial location of the individual being evaluated, varying linearly from -1 at the left side of the range to + 1 at the right edge. In all simulations, genome-wide PCA displayed a map of the population range, as expected. Gaussian simulations We simulated genotypes at each locus independently, drawing each vector of genotypes from a multivariate Gaussian distribution with zero mean and covariance matrix I:. Sampled individuals came from three popula tions, and each I:ij depends on which populations the individuals i and j are in, as well as the location along the chromosome. There are three population-level mean 21 relatedness matrices along the genome, which apply to the first quarter (S( 1 )), the middle half (S( 2 )), and the last quarter (S( 3 )), respectively: 0.75 0.25 0.0 s(l) = 0.25 0.75 0.0 0.0 0.0 1.0 1.0 0.0 0.0 sc2) = 0.0 0.75 0.25 0.0 0.25 0.75 0.75 0.0 0.25 s(3) = 0.0 1.0 0.0 0.25 0.0 0.75 If indiviudals i and j are in populations p(i) and p(j) respectively, then the covari ance between their genotypes is Liij = Sp(i),p(j), using the appropriate S for that seg ment of the genome. The variance of individual i's genotype is Liii = Sp(i),p(i) + 0.1. We first created "genotypes" in this way with fifty individuals from each of the three populations; running our method on a genome with 99 windows of 400 loci each produced the first plot in Figure 2.5. These matrices are chosen so that the top two eigenvalues Li are the same (both 50.1), and so the ordering of the top two PCs is arbitrary. If our method was sensitive to PC ordering, then half the windows in each region that have one ordering would cluster with each other, separate from the other half. We then marked each genotype in the first half of the chromosome as missing, independently, with probability 1/2 and ran our method again, producing the 22 second plot of Figure 2.5. If our method was influenced by missing data, we would expect the first half of the chromosome to separate from the second in the MDS plot. SLiM simulations Our SLiM simulations were constructed as follows. Individ uals are diploid, and genomes have a length of 153,520,244 bp. Recombination was either (a) flat, with a constant rate of 10- 9 ; (b) according to the human female HapMap map for chromosome 7; or (c) constant in each of seven equal sized regions, beginning at 2.04 x 10-s, descending by a factor of four for three steps, and then ascending by a factor of four for three steps, so that the middle seventh has the lowest recombination rate, and the outer two sevenths has a rate 64 times higher. Selected mutations are introduced at a rate of 10- 10 per bp per individual per generation, and have selection coefficients drawn from a Gamma distribution with mean 0.005 and shape 2; each coefficient are either positive or negative with probabilities 1/30 and 29/30 respectively. Each simulation was run for 50,000 generations. Each individual has a spatial position in the two-dimensional square of width W = 8. Each time step, each individual chooses the nearest other to mate with, producing a random, Poisson distributed number of offspring with mean 1/3. Off spring are assigned random spatial locations displaced from their parent's by a bivariate Gaussian with mean zero and standard deviation a = 0.2, reflected to stay within the habitat range. Each individual survives to the next time step with probability equal to their fitness. Fitness values are determined multiplicatively by the effects of each muta tion, but are multiplied by an additional factor determined by the local density 23 of individuals. This factor is equal to p/(1 + C), where p = 21rK(J' 2 is the carry ing capacity per circle of radius (J'; K = 100 is the mean equilibrium population density; and C is the sum of a Gaussian kernel with standard deviation (J' = 0.1 between the focal individual and all other individuals within distance 3(]'. To avoid edge effects, fitnesses are further multiplied by min(l , z), where z is the distance to the nearest boundary. This produces populations that fluctuate at equilibrium around 6,000 individuals in total, fairly evenly spread across the square. In one additional simulation, we modified fitnesses by multiplying the selective effect of each allele in each individual by multiplying it by 2x/W - 1, where x is the x coordinate of the individual. This makes the effect of each allele opposite on the left and on the right, and neutral in the middle, and leads to a moderate number of balanced polymorphisms. 2.2.6 Datasets We applied the method to genomic datasets with good geographic sampling: 380 African Drosophila melanogaster from the Drosophila Genome Nexus [62], a worldwide dataset of humans, 3,965 humans from several locations worldwide from the POPRES dataset [81], and 263 Medicago truncatula from 24 countries around the Mediterranean basin a range-wide dataset of the partially selfing weedy annual plant from the Medicago truncatula Hapmap Project [107], as summarized in Table 2.2. Drosophila melanogaster: We used whole-genome sequencing data from the Drosophila Genome Nexus (http://www.johnpool.net/genomes.html, [62]), consisting of the Drosophila Population Genomics Project phases 1- 3 [63, 94], and additional African genomes [62]. After removing 20 genomes with more than 8% 24 missing data, we were left with 380 samples from 16 countries across Africa and Europe. Since the Drosophila samples are from inbred lines or haploid embryos, we treat the samples as haploid when recoding; regions with residual heterozygosity were marked as missing in the original dataset; we also removed positions with more than 20% missing data. Each chromosome arm we investigated (X, 21, 2R, 31, and 3R) has 2-3 million SNPs; PCA plots for each arm are shown in Figure 2.2. Human: We also used genomic data from the entire POPRES dataset [81], which has array-derived genotype information for 447,267 SNPs across the 22 autosomes of 3,965 samples in total: 346 African-Americans, 73 Asians, 3,187 Europeans and 359 Indian Asians. Since these data derive from genotyping arrays, the SNP density is much lower than the other datasets, which are each derived from whole genome sequencing. We excluded the sex chromosomes and the mitochondria. PCA plots for each chromosome, separately, are shown in Figure 2.3. M edicago truncatula: Finally, we used whole-genome sequencing data from the Medicago truncatula Hapmap Project [107], which has 263 samples from 24 countries, primarily distributed around the Mediterranean basin. Each of the 8 chromosomes has 3-5 million SNPs; PCA plots for these are shown in Figure 2.4. We did not use the mitochondria or chloroplasts. Data access The methods described here are implemented in an open-source R package available at https: / /gi thub. com/petrelharp/local_pca, as well as scripts to perform all analyses from VCF files at various parameter settings. Datasets are available as follows: human (POPRES) at dbGaP with acces sion number phs000145.v4.p2, Medicago at the Medicago Hapmap http://www. 25 species # SNPs mean window mean # windows mean % variance per length (bp) per chromosome ex- window plained by top 2 Drosophila 1,000 9,019 2,674 0.53 melanogaster Human 100 636,494 203 0.55 Medicago 10,000 102,580 467 0.50 truncatula Table 2.2 Descriptive statistics for each dataset used. medi cagohapmap. org/, and Drosophila at the Drosophila Genome Nexus, http: //www.johnpool.net/genomes.html. 26 N u c.. N u c.. N u c.. I.() 0 I.() 0 0 I.() C! 0 I 0 0 0 0 0 I 0 2L • • . . , •• . • ;;. * . ,. ... # . . . -0.05 0.00 0.05 0.10 PC1 3L •• . . . .,. ~ . .. ';•. -0.05 0.00 0.05 0.10 0.15 PC1 X ~ - .. . . ... . . • . "• .... , .... 0 - 0 .,. . ., .. - ~ ''\·"-.r·'• - 0 0 - I I I I I I -0.15 -0.10 -0.05 0.00 0.05 PC1 I.() 0 0 I.() N 0 u 0 c.. I I.() 0 I 0 0 N ~ g 0 0 0 I 2R ~ -:, .. •• • •• ...... •• • . .. ,, ... -0.10 0 0.00 0.05 0.10 0.15 PC1 3R ., .. ... • • ••• • l • . ... . . - ; • ••. -.;. If' - . .. ..... ..... -- ~ .. ... : , Je• • • - ·\Ii =· . . -0.05 0.00 PC1 0.05 o Cameroon • MW I!> Congo • Nigeria + Egypt ll Rwanda Ethiopia South_Africa O France • Tanzania v Gabon a Uganda Guinea Zambia * Kenya • Zimbabwe Figure 2.2 PCA plots for chromosome arms 21, 2R, 31, 3R and X of the Drosophila melanogaster dataset. 27 1 2 3 4 w ~ W u w ~ (0 C, • 0 0 0 NO NO NO NO () () . ~ ... () a.. • • a.. ., • a.. N • I • N • N • • N 0 • C, • 0 , 0 0 0 0 0 I I I I -0.06 -0.02 -0.06 -0.02 -0.06 -0.02 -0.06 -0.02 PC1 PC1 PC1 PC1 5 6 7 8 ~~~ 8~u ~~ w ~ 0 NO NO ~ . () . () . .. a.. -. a.. • • a.. C\J • • • N - • N • I ~ . ·i C, • C, 0 .. . 0 0 0 0 I I I I - 0.06 - 0.02 - 0.06 - 0.02 - 0.06 - 0.02 - 0.06 - 0.02 PC1 PC1 PC1 PC1 9 10 11 12 w ~ ~~ w g (0 0 • 0 0 Nci • NO NO NO () .. () · .i" () () a.. C\f .. ... a.. C\J • • • • a.. - a.. C\J • • • N 0 • 0 ~ .. 0 0 0 0 0 I I I I - 0.06 - 0.02 - 0.06 - 0.02 - 0.06 - 0.02 - 0.06 - 0.02 PC1 PC1 PC1 PC1 13 14 15 16 w ~ w ~ (0 (0 0 0 0 0 NO C\J C) • NO NO ~ ~ • ! •· () .. () () a..~ • •• Q..N a.. N 0 0 0 0 I I - 0.06 0.00 - 0.06 - 0.02 - 0.06 - 0.02 - 0.06 - 0.02 PC1 PC1 PC1 PC1 17 18 19 20 w ~ ~:~ w ~ N 0 0 Ne, NO No ()0 () . () . : , a.. a..~ J ;;,t,4• a. C\I • • • "<I" 0 C, • 0 0 I I - 0.06 0.00 - 0.06 - 0.02 - 0.06 0.00 - 0.06 - 0.02 PC1 PC1 PC1 PC1 21 22 w ~ ~~~~-. · 1 • African-American 0 No , N • • • Asian () I • • () ... • European 0.. .... II a.. (0 • lndianAsian N • 0 0 • Latin American 0 0 I I - 0.06 0.00 - 0.06 0.00 PC1 PC1 Figure 2.3 PCA plots for all 22 human autosomes from the POPRES data. 28 0 N ci N u"' (l_ 0 ci ;c ci I 0 N ci I -0.05 -0.05 1 0.05 PC1 5 :t. ~ -·· . ' .. , 0.05 PC1 0 N ci N ,n 0 0 (l_ ci ;c ci I "' N 0 u ci o._ I 0 N ci I -0.05 -0.05 2 0.05 PC1 6 + ., .Att . .... . ~ . . .. \, •°t 0 0.05 PC1 ';2 ci N ,n 0 0 (l_ ci ;c ci I -0.05 -0.05 3 0.05 PC1 7 0.05 PC1 0 N ci N ,n u 0 (l_ ci ;c ci I -005 -0.05 4 0.05 PC1 8 0.05 PC1 o Algeria Jordan 6 Australia Lebanon + blank • Libya X Canada • Morocco Cyprus A Poland France • Portugal France_Corsica • Spain • Germany • Sweden • Greece o Syria • Greece_Crete D Tunisia 0 Israel O Turkey Italy A U.S Figure 2.4 PCA plots for all 8 chromosomes m the Medicago truncatula dataset. 29 2.3 Results In all three datasets: a worldwide sample of humans, African Drosophila melanogaster, and a rangewide sample of M edicago truncatula, PCA plots vary along the genome in a systematic way, showing strong chromosome-scale correla tions. This implies that variation is due to meaningful heterogeneity in a biological process, since noise due to randomness in choice of local genealogical trees is not expected to show long distance correlations. Below, we discuss the results and likely underlying causes. 2.3.1 Validation Simple non-population-based simulations with Gaussian "genotypes" showed that the method performs as expected, clearly separating regions of the genome with different underlying covariance matrices without being affected by extreme differences in amount of missing data (Figure 2.5). This simulation also verifies insensitivity to ordering of top PCs, since it was performed using a covariance matrix with the top two eigenvalues equal, so that the order of empirical eigenvectors (PCs) switches randomly. Individual-based simulations using SLiM [40] allowed us to test the effects of recombination and mutation rate variation, as well as linked selection. As expected, varying recombination rate stepwise by a factor of 64 did not induce patterns in the MDS visualizations correlated with recombination rate (Figure 2.6). Since varying mutation rate with a fixed recombination map is equivalent to varying the recombination map and remapping windows, this also indicates that the method is not misled by variation in mutation rate. On the other hand, a 30 N <I) iii C: ~ 0 0 (.) Cf) 0 ~ N <I) iii C: E 0 0 (.) Cf) 0 ~ .. . . N MOS coordinate 1 N MOS coordinate 1 . . . . . . ... ,.-- . .... . ... ·.• ,t111t:•.• ,. . ·- . . ·, - - ·.---· .· .-.¥ ,,.,, ..••. "' • • .:,:,, .. Y e• , 111 • chromosome matrix . , . • .,.. • .,. '!'\,, ..... ,,. ... . ,-. ... ·-.. . • ., ... ·-· _ ,,,,,_ . . . • ··: , .. , _ . • chromosome matrix . . ...... ... , ...... . . . . . . ........... . ··-- Figure 2.5 MDS visualizations of the Gaussian genotypes described in Gaussian simulation, for 50 individuals from each of three populations. (top) The first quarter, middle half, and final quarter of the chromosome each have different population structure, as expected, despite the possibility for PC switching within each. (bottom) The same picture results even after marking a random 50% of the genotypes in the first half of the chromosome as missing. recombination map with hotspots ( the Hap Map human female map for chromo some 7 [ 49]) induced outliers at long regions of low recombination rate (also as expected) . Simulations with linked selection produced mixed results (Figure 2. 7). The method strongly identified the regions under spatially varying linked selec tion ( also our simulation with strongest positive selection). It also identified the 31 • . . . \ . . . ~-.,, . .. -.. . . ~ , :• " ".-.: . . , . ... , • --~ ~-.· • • ... ,,.. . ~;: : . .. . . : .,,. "- , .... ·. ·~ -:--:.,·: . . . - . . , MOS coordinate 1 MOS coordinate 1 1 • i,.. MOS coordinate 1 comer1 I $ Y, t I. £,,., - .. ~ -~~-~ . ~~ ~ . . .. , ;,,- • .:. '•J • ! : - -4, ,, ~ .. .. •"" chromosome 1 . .. : . . . ~; . :·;. . . . .. , .. :: ,: ,.·\/ 1 •·\ t ··tyf. 'f.U. ·. " :=it', ... : '!)l,:,;.·~;·.: ~~ fr._;,a.;~• ~ . ,;,_• ~~ t!.•.u._ :,.~~7l ··,: · ,1 .. • ,~;. r · ii~~~ · ~~ .•.. • ~'i • •• !• ! . ~ .. .. .. -: . .. . . !· :. • , chromosome 1 • ~ . ~~:A~ -is.. -~ ~ ;i_~_.___:i.. _ _,_ ~ .-;,~..... tv..,,,.,,'!"tl • • ,I . ,. ~~-!•.r.~~ V.£'f.¥.rA~ chromosome 1 comer3 Figure 2.6 MDS visualizations of the results of individual-based neutral simulations using SLiM. All simulations are neutral, and recombination is: (top) constant; (top middle) varies stepwise by factors of two in seven equal-length segments, with highest rates on the ends, so the middle segment has a recombination rate 64 times lower than the ends; (bottom middle) according to the HapMap human female chromosome 7 map. The bottom figure shows PCA maps corresponding to the three colored windows of the last (HapMap) situation; the outlying regions are long regions of low recombination rate, so that region can be dominated by a few correlated trees, similar to an inversion. The (inset) provides a key to the locations of the individuals on the spatial lanscape. 32 MOS coordinate 1 ,· .. . ... MOS coordinate 1 ... MOS coordinate 1 comer1 'I, • :,~··· . ., • . I- ,. . . . · . ~ : . . t . . J" .. · . .. .. ' . .. l>. • • " . "'• .. . . t IV ec, ~ • • : ; • , • ."!1 :! • · .• • i ·. 1."• ~~dt -...;.,,_ \.."..;..~ .• ' t~ • .. , ,~· •: . ,.,.; ::,,~. I • :J:O:• if&"I l""• ~ • -~~~ •~ ". ~~ ,1~ ); ¥-'!5• ~=--', ".~e-,,;• 1 .-t:1{- : ·: • -: ' ,. • ·• .... .. ! • : ; chromosome 1 . .. . . . ·. . . . ... . . . · f~~- . , ,, I• ;! ~~ • ,~ : ~ :__.i~, ,Z\ .~l'I r, ~ , . \",r t, ~~. . . ~. r ~ ... ... . . "' ~ ... corner2 • . 't:' ·: . . • h. . J •• • • ••.. .: ' :, • !, • 1 ... • ' f chromosome 1 chromosome 1 comer3 ..,.0 :r(I · t oo -i.• • 'l, : : - • I ! ~ o .. ~ .. ,/ .l . . . . . ' \ ·: Figure 2. 7 MDS visualizations of the results of individual-based simulations with linked selection using SLiM. All simulations incorporate linked selection by allow ing selected mutations to appear in the same two regions of the genome: the one sixth of the genome immediately before the halfway point, and the last one-sixth of the genome. (top) Constant recombination rate. ( top middle) Stepwise vary ing recombination rate (as described in Figure 2.6). (bottom middle) Constant recombination rate with spatially varying effects of selection. (bottom) PCA plots corresponding to the highlighted corners of the last MDS visualization, showing how spatially varying linked selection has affected patterns of relatedness. The (inset) provides a key to the locations of the individuals on the spatial lanscape. 33 regions ( although less unambiguously) with constant selection and stepwise varying recombination rate, but did not clearly identify them with constant recombination rate. This difference is likely because recombination rates are overall lower in the first case, leading to a stronger effect of linked selection. These tests are not meant to be comprehensive survey of linked selection, but only to demonstrate that linked selection can produce signals similar to what we see in real data. 2.3.2 Drosophila melanogaster We applied the method to windows of average length 9 Kbp, across chromosome arms 2L, 2R, 3L, 3R and X separately. The first column of Figure 2.8 is a mul tidimensional scaling (MDS) visualization of the matrix of dissimilarities between genomic windows: in other words, genomic windows that are closer to each other in the MDS plot show more similar patterns of relatedness. For each chromosome arm, the MDS visualization roughly resembles a triangle, sometimes with addi tional points. Since the relative position of each window in this plot shows the similarity between windows, this suggests that there are at least three extreme manifestations of population structure typified by windows found in the "corners" of the figure, and that other windows' patterns of relatedness may be a mixture of those extremes. The next two columns of Figure 2.8 respectively depict the two MDS coordinates of each window, plotted against the window's position along the genome, to show how the plot of the first column is laid out along the genome. The patterns did not depend on the number of PCs used (see Figure 2.9 for the same plot with k = 5 PCs), and are only weakly correlated with variation in missingness (see Figure 2.10). To help visualize how clustered windows with similar patterns of relatedness are along each chromosome arm, we selected three "extreme" windows in the MDS 34 N N * N Q) Q) N 0 1ii 1ii 0 C C C ...J ~ ~ 0 ~ C"II 0 0 0 0 0 0 0 0 0 0 0 (.) (.) (.) (/) (/) st (/) 0 0 0 :a: "' :a: 0 :a: "' 0 I 0 I I -0.4 0.0 0.2 0 5 10 15 20 0 5 10 15 20 N N * "' .'!l * "' 0 "' "' 0 C C 0 C cc ~ ~ ~ C"II 0 0 0 0 0 0 0 0 0 (.) (.) (.) (/) (/) (/) 0 0 0 0 :a: 0 :a: :a: 0 I I I -0.1 0.1 0.3 5 10 15 20 5 10 15 20 N N .'!l 0 Q) st Q) 0 "' 1ii 0 1ii C C C ...J ~ 0 ~ ~ 0 M 0 0 0 0 I 0 0 0 I (.) (.) (.) (/) "' (/) (/) "' 0 0 0 :a: 0 :a: N :a: 0 I 0 I I - 0.2 0.0 0.2 0.4 CD 0 5 10 15 20 0 5 10 15 20 N ~ 0 N * N * * N C 0 C C 0 cc ~ ~ N ~ M 0 0 0 0 0 0 0 0 \ 0 0 0 (.) (.) (.) (/) N (/) (/) N 0 0 0 :a: 0 :a: N :a: 0 I 0 I I -0.2 0.0 0.2 0.4 0.6 0 5 10 15 20 25 0 5 10 15 20 25 N ;:' N * 0 Q) * 0 1ii 0 C C 0 C >< ~ 0 ~ ~ 0 0 0 0 0 I 0 0 I (.) (.) (.) (/) "' (/) ": (/) "' 0 0 0 0 0 0 :a: I :a: I :a: I - 0.4 - 0.2 0.0 0.2 0 5 10 15 20 0 5 10 15 20 MDS coordinate 1 position (Mbp) position (Mbp) Figure 2.8 Variation in patterns of relatedness for windows across Drosophila melanogaster chromosome arms. In all plots, each point represents one window along the genome. The first column shows the MDS visualization of relationships between windows, and the second and t hird columns show t he two MDS coordi nates against t he midpoint of each window; rows correspond to chromosome arms. Colors are consistent for plots in each row. Vertical lines show t he breakpoints of known polymorphic inversions. Solid black lines are for t he inversions we used in Figure 2. 11 , while dotted grey lines are for other known inversions. 35 plot and the 5% of windows that are closest to it in the MDS coordinates, then highlighted these windows' positions along the genome, and created PCA plots for the windows, combined. Representative plots are shown for three groups of windows on each chromosome arm in Figure 2.8 (groups are shown in color), and in Figure 2.12 (PCA plots). The latter plots are quite different , showing that genomic windows in different regions of the MDS plot indeed show quite different patterns of relatedness. The most striking variation in patterns of relatedness turns out to be explained by several large inversions that are polymorphic in these samples, discussed in Corbett-Detig and Hartl [24] and Langley et al. [63]. To depict this, Figure 2.11 shows the PCA plots in Figure 2.12 recolored by the orientation of the inversion for each sample. Taking chromosome arm 2L as an example, the two regions of similar, extreme patterns of relatedness shown in green in the first row of Figure 2.8 lie directly around the breakpoints of the inversion In(2L)t, and the PCA plots in the first rows of Figure 2.11 shows that patterns of relatedness here are mostly determined by inversion orientation. The regions shown in purple on chromosome 2L lie near the centromere, and have patterns of relatedness reflective of two axes of variation, seen in Figure 2.11 and Figure 2.12, which correspond roughly to latitude within Africa and to degree of cosmopolitan admixture respectively (see Lack et al. [62] for more about admixture in this sample) . The regions shown in orange on chro mosome 2L mostly lie inside the inversion, and show patterns of relatedness that are a mixture between the other two, as expected due to recombination within the (long) inversion [38]. Similar results are found in other chromosome arms, albeit complicated by the coexistence of more than one polymorphic inversion; however, 36 each breakpoint visibly affects patterns in the MDS coordinates (see vertical lines in Figure 2.8). To see how patterns of relatedness vary in the absence of polymorphic inver sions, we performed the same analyses after removing, for each chromosome arm, any samples carrying inversions on that arm. In the result, shown in Figure 2.13 and Figure 2.14, the striking peaks associated with inversion breakpoints are gone, and previously smaller-scale variation now dominates the MDS visualization. For instance, the majority of the variation along 31 in Figure 2.8 is on the left end of the arm, dominated by two large peaks around the inversion breakpoints; there is also a relatively small dip on the right end of the arm ( near the centromere). In contrast, Figure 2.14 shows that after removing polymorphic inversions, remain ing structure is dominated by the dip near the centromere. Without inversions, variation in patterns of relatedness shown in the MDS plots follows similar pat terns to that previously seen in D. melanogaster recombination rate and diversity [63, 73]. Indeed, correlations between the recombination rate in each window and the position on the first MDS coordinate are highly significant (Spearman's p = 0.54, p < 2 x 10- 16 ; Figures 2.13 and 2.15). This is consistent with the hypothesis that variation is due to selection, since the strength of linked selection increases with local gene density, measured in units of recombination distance. The number of genes - measured as the number of transcription start and end sites within each window - was not significantly correlated with MDS coordinate (p = 0.22). 37 2.3.3 Human As we did for the Drosophila data, we applied our method separately to all 22 human autosomes. On each, variation in patterns of relatedness was domi nated by a small number of windows having similar patterns of relatedness to each other that differed dramatically from the rest of the chromosome. These may be primarily inversions: outlying windows coincide with three of the six large poly morphic inversions described in Antonacci et al. [3], notably a particularly large, polymorphic inversion on 8p23 (Figure 2.16). Similar plots for all chromosomes are shown in Figures 2.17, 2.18, and 2.19. PCA plots of many outlying windows show a characteristic trimodal shape (shown for chromosome 8 in Figure 2.20), presumably distinguishing samples hav ing each of the three diploid genotypes for each inversion orientation ( although we do not have data on orientation status). This trimodal shape has been proposed as a method to identify inversions [72] , but distinguishing this hypothesis from others, such as regions of low recombination rate, would require additional data. We also applied the method on all 22 autosomes together, and found that, remarkably, the inversion on chromosome 8 is still the most striking outlying signal (Figure 2.21). Further investigation with a denser set of SNPs, allowing a finer genomic resolution, may yield other patterns. 2.3.4 Medicago truncatula Unlike the other two species, the method applied separately on all eight chromo somes of Medicago truncatula showed similar patterns of gradual change in patterns of relatedness across each chromosome, with no indications of chromosome-specific 38 patterns. This consistency suggests that the factor affecting the population struc ture for each chromosome is the same, as might be caused by varying strengths of linked selection. To verify that variation in the effects of population structure is shared across chromosomes, we applied the method to all chromosomes together. Results for chromosome 3 are shown in Figure 2.24, and other chromosomes are similar: across chromosomes, the high values of the first MDS coordinate coin cide with the position of the heterochromatic regions surrounding the centromere, which often have lower gene density and may therefore be less subject to linked selection. To verify that this is a possible explanation, we counted the number of genes found in each window using gene models in Mt4.0 from j cvi. org [107] , which are shown juxtaposed with the first MDS coordinate of each window in Figure 2.25, and are significantly correlated, as shown in Figure 2.26. (Values shown are the number of start and end positions of each predicted mRNA transcript, divided by two, assigned to the nearest window.) However, other genomic features, such as distance to centromere show roughly the same patterns, so we cannot rule out alternative hypotheses. In particular, fine-scale recombination rate estimates are not available in a form mappable to Mt4.0 coor dinates ( although those in Paape et al. [87] appear visually similar) . The results were highly consistent across window sizes, window types (SNPs or bp) , and number of PCs, as shown in Table 2.3. 39 10000 SNPs 1000 SNPs 10000 SNPs lO0000bp lO000bp MDSl 2 PCs 2 PCs 5 PCs 2 PCs 2 PCs 10000 SNPs, 2 PCs 1.00 0.87 0.96 0.90 0.88 1000 SNPs, 2 PCs 0.68 1.00 0.73 0.68 0.94 10000 SNPs, 5 PCs 0.96 0.92 1.00 0.88 0.93 lO0000bp, 2 PCs 0.90 0.87 0.88 1.00 0.87 lO000bp, 2 PCs 0.68 0.93 0.72 0.67 1.00 MDS2 10000 SNPs, 2 PCs 1.00 0.54 0.93 0.87 0.56 1000 SNPs, 2 PCs 0.82 1.00 0.76 0.83 0.92 10000 SNPs, 5 PCs 0.93 0.50 1.00 0.83 0.52 lO0000bp, 2 PCs 0.87 0.59 0.84 1.00 0.58 lO000bp, 2 PCs 0.83 0.92 0.77 0.84 1.00 Table 2.3 Correlations between MDS coordinates of genomic regions between runs with different parameter values. To produce these, we first ran the algorithm with the specified window size and number of PCs (k in equation (2.1)) on the full Medicago truncatula dataset. Then to obtain the correlation between results obtained from parameters A in the row of the matrix above and parameters B in the column of the matrix above, we mapped the windows of B to those of A by averaging MDS coordinates of any windows of B whose midpoints lay in the corresponding window of A; we then computed the correlation between the MDS coordinates of A and the averaged MDS coordinates of B. This is not a symmetric operation, so these matrices are not symmetric. As expected, parameter values with smaller windows produce noisier estimates, but plots of MDS values along the genome are visually very similar. 40 (") (") N ci N ci 2 ., 2 N 2 ro ro ci ro C , C C ...J E ci E E ci N 0 0 ~ 0 0 0 0 (.) (.) 0 (.) (/) (/) (/) 0 ci 0 "! 0 ci 2 I 2 0 2 I I -0.2 0.0 0.2 0 5 10 15 20 0 5 10 15 20 N N 2 2 2 ro LO ro ro LO C 0 C C 0 et: E ci E E ci 0 I 0 ci 0 I N 0 0 0 (.) (.) (.) (/) LO (/) (/) LO 0 N 0 0 N 2 ci 2 ci 2 ci I I I -0.1 0.0 0.1 0.2 5 10 15 20 5 10 15 20 N 0 N 0 Q) N Q) N Q) N ro ci ro ci ro ci C C C ...J E LO E E LO ('I) 0 0 0 ~ 0 0 0 ci 0 0 ci (.) (.) 0 (.) (/) (/) (/) 0 0 0 0 0 2 2 N 2 ci ci ci I I I -0.2 0.0 0.1 0.2 0 5 10 15 20 0 5 10 15 20 0 0 N N Q) ci Q) Q) ci ro ro ~ ro C C 0 C et: E LO E E LO 0 -· 0 ('I) 0 ci 0 0 ci 0 0 0 (.) I (.) ci (.) I (/) (/) (/) 0 0 0 0 0 2 N 2 ci 2 N ci I ci I I -0.1 0.1 0.3 0 5 10 15 20 25 0 5 10 15 20 25 N N Q) Q) 0 Q) ro ro ro C C ci C E 0 E E 0 >< 0 ci 0 "! 0 ci 0 0 0 (.) (.) 0 (.) (/) (/) I (/) 0 N 0 st 0 N 2 ci 2 ci 2 ci I I I -0.4 -0.2 0.0 0 5 10 15 20 0 5 10 15 20 MOS coordinate 1 position (Mbp) position (Mbp) Figure 2.9 MDS visualizations for each chromosome arm of Drosophila melanogaster, as in Figure 2.8, except that the method was run using five PCs (k = 5) instead of two. 41 00 ~t;~~-~~J I Ci I I I I I df\· . · · I I I I I 0 5 10 15 20 5 10 15 20 0 5 10 15 20 <I) ~ ~ I ~ ci 0::: 6> • • • • M C M ••• • ~ : ~.til ·,,,.,.,_ Ci I I I I I I <I) <I) Q) C: "' >< . [ ~ <I) E 0 0 5 10 15 20 25 ~ ~~-----.-----,~-.-=-=---~~ 0 5 10 15 20 position (Mbp) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 ; ~1· I ro M • • ~ ci • ' 8 ; .. . (/) .: ~ · ~ ? , .. .:..- .. I I I I I 0.00 0.02 0.04 0.06 0.08 0.10 0.12 <D 0.00 0.02 0.04 0.06 0.08 0.10 0.12 i: ~I:.·} . ····· § -;I_.•:. ~ - ..... :,; N • ? I I I I I 0.00 0.02 0.04 0.06 0.08 0.10 0.12 ~ -. * ~ - :t:!:. ~ 0 ••• •• § - • (j) "' • 0 ? - ••• :;; .-. ..... ,~-.,--.,-"'T,-"'T,---r,--r', 0.00 0.02 0.04 0.06 0.08 0.10 0.12 missingness Figure 2.10 The proportion of data in each window that are missing, compared to the value of the first MDS coordinate for the Drosophila m elanogaster data from Figure 2.8. 42 green orange purple LO [] 0 0 6 6 !6 0 6 C\J LO ~c:i C\J LO 0 ..J & ..... uC! N T (.) 0 "' C\I c.. . c.. .. ,. ~ c.. 9 "' 0 ... ._,. "' "' LO LO 0 0 0 0 0 C\J I I 0 I -0.04 0.02 0.06 0.10 -0.10 0.00 0.10 -0.06 0.00 0.04 PC11 2L PC21 2L PC31 2L 0 0 .., 00 0 0 a: C\J C\J 0 C\J LO (.) (.) (.) C! C\I c.. c.. c.. 0 LO LO 0 0 0 I 0 I 0 I -0.15 -0.05 -0.20 -0.10 0.00 -0.05 0.00 0.05 0 0 0 0 LO ..J C\J~ C\JO C\10 ~c:i (.) . (.) 0 C") c.. 0 c.. I - 0 LO 0 0 0 0 I C\J I 0 I -0.15 -0.05 -0.10 0.00 -0.05 0.00 0.05 0.10 LO 0 0 "'.._, C\J LO 0 a: C\J LO C\J 0 IJ. A (.) C! (.) C! (.) C") c.. 0 c.. 0 c.. .. "' ~ LO 0 A "' LO 0 0 0 0 0 I I I -0.05 0.00 0.05 -0.15 -0.05 0.05 -0.05 0.05 0.10 0 0 LO 0 LO C\J LO C\J LO C\l"-: X (.) C! 66 (.) 0 (.) 0 c.. 9 A c.. . c.. 46 0 0 0 i LO 0 C\J 0 0 0 0 0 I I -0.10 -0.04 0.02 -0.20 -0.10 0.00 -0.20 -0.10 0.00 PC1 PC1 PC1 Figure 2.11 PCA plots for the three sets of genomic windows colored in Figure 2.8, on each chromosome arm of Drosophila melanogaster. In all plots, each point represents a sample. The first column shows the combined PCA plot for windows whose points are colored green in Figure 2.8; the second is for orange windows; and third is for purple windows. In each, samples are colored by orientation of the polymorphic inversions In(2L)t, In(2R)NS, In(3L)OK, In(3R)K and In(l)A respectively (data from [62]). In each "INV" denotes an inverted genotype, "ST" denotes the standard orientation, and "N" denotes unknown. 43 ..J C\J N a: a: C\J N a: ..J C\J C") a: a: C\J C") a: 0 0 0 I!) 0 I 0 0 I!) ~ 0 I I!) 0 0 0 0 I 0 0 >< ~ 0 0 C\J 0 I green -0.04 0.02 0.08 -0.15 -0.05 • -0.15 -0.05 • • ... ~ .. • f • .. . . ,_.. ..... , ~ -0.05 0.05 ..... ~ ... . . ~ - •• • .. . ' .. • -0.10 -0.04 0.02 PC1 orange I!) .. 0 I!) 0 0 I!) ~ 0 I 0 0 I!) ~ 0 I 0 -0.20 0 'It.· -0.05 0.05 0 0 0 0 0 I I!) 0 I!) 0 0 I!) 0 0 I 0 0 I!) 0 0 I -0.10 0.00 ""· ··1 . :, .... v , ~ , ., .. . ~ . :\7~ -0.15 -0.05 0.05 .,. • tC -0.20 -0.05 PC1 I!) ~ 0 I 0 C\J 0 I I!) 0 0 0 0 I I!) 0 0 I 0 C\J 0 I 0 ~ 0 I!) 0 I I!) 0 0 0 0 purple • . . ..... . . -...- • • . . • -0.06 0.00 0.06 ,.. • . . v.. . V • .t.\• ..... ., ......... .. ... -~ :.r.· .. ~-, -0.05 • ... • II! , .. :. . .... • -0.05 0.05 0.05 ' -0.20 -0.10 0.00 PC1 o Cameroon t. Congo Egypt Ethiopia France v Gabon Guinea * Kenya Malawi • Nigeria Jill Rwanda South Africa Tanzania a Uganda • Zambia • Zimbabwe Figure 2.12 PCA plots for the three sets of genomic windows colored in Figure 2.8, on each chromosome arm of Drosophila melanogaster. In all plots, each point represents a sample. The first column shows the combined PCA plot for windows whose points are colored green in Figure 2.8; the second is for orange windows; and the third is for purple windows. 44 chr2L chr2R chr3L chr3R chrX - J;j,; ,: -.;~ ~.l~"; • : ,. ~ (w;~ ~~,.. ,,: .. J."='::,'i~-. 'llt~_· ... •. •.· .,~( ... ·,) . ' •• ~.··. • ...... ~~. Cl) 0 ... :, • •• , ~- .~~ ,.,,_ ... ~ ..... • • •;.n::·. o O - • • ., -.g:.. : . - 1 • • \ ; ·: • • ,. ' ... ~ •• •• , ~-i.· ::ie I i, tl 1•• • • • f• • •• " 1. .";.:. 1 ,•.'< • "'-"" " ~-. ¥, ;l.;.t : "£:. . J..· ~ . "•'!II ,?=- r ? -~---~ '_ r_ ·_ ·~ ~-· ~ ·----~ ~----,~~ ~- · ~ ·----~ ~ · ~ ·----~ 5 - -~: .. ~ 0 - ·:-·, • ..... :0 N (," ::,1:0::t:. 1 ."! E - ,~ •• , ;,_•• •~~- • 8 0 - ; . • • -.- f \ ... ~ ~ .... - • I • . . .. : . ·:--: . ·.:-: ;:~;_:.:j -··· . .. ·,•:'};- ~· ) "?t\; :;,;,1-. .. .. .. . • . : • __, : .. · .· .. :. :-· . ~ ' . § 55 .- ~.-... ~..... •.• : . . ..... ,,, ...... ,-• .-!""'- .- u o • •• ff • ••• • •••••• ··~· : -:'"t. ~: ... ·-,-1-··1:. .:,.•, •-;;: · .• ~ .,. ..... • <L> 0 , ..... ,... .... .. ... '\ ~--~ ... ; .. Ca, N ;,4•• ,•.•,,: ;•-j.,,.~'_( .. }, :•I\••: •• • •••• •'••• ,• • ""-" • .-~.:.-. e :t'\ •••-.•• Cl O -,,._~• ~• -• ~•- • .. ~•- • :~•-,......,..:» Y.•t 1\ • • •• I I I I I -: ..... . . .. . ,, . ... :-.:~ . .it,·:· .. , ~ '~"··- ·. -·'iL" ~,.r. 'Od · i ~,-. • ' .~ ...... ••• • I• :,. ~~-: •r-· ;.=. ~·~ •,1 I ••si:•',• .. ~,-,•: t,y.. /#OJ \",-. ..... \. • .~ .. : ..... ,....,,. .... q-.• •• ...... -.--•. ,, 2-•• : _.,. · .. :._.4 ':• .. " ·.• ... -:1:-:. I I I I I I I I I I I 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 25 0 5 10 20 0 5 10 15 20 position (Mbp) position (Mbp) position (Mbp) position (Mbp) position (Mbp) Figure 2.13 The effects of population structure without inversions is correlated to recombination rate in Drosophila melanogaster. The first plot (in red) shows the first MDS coordinate along the genome for windows of 10,000 SNPs, obtained after removing samples with inversions. (A plot analogous to Figure 2.8 is shown in Figure 2.14.) The second plot (in blue) shows local average recombination rates in cM/Mbp, obtained as midpoint estimates for lOOKbp windows from the Drosophila recombination rate calculator [33] release 5, using rates from Comeron et al. [22]. The third plot (in black) shows the number of genes' transcription start and end sites within each lOOKbp window, divided by two. Transcription start and end sites were obtained from the RefGene table from the UCSC browser. The histone gene cluster on chromosome arm 21 is excluded. 45 N N (I) ~ (I) (I) ~ <ii 0 <ii <ii 0 C: C: N C: ...J =e ~ =e c:i =e ~ N 0 c:i 0 0 c:i 0 8 8 u I I (/) (/) (/) Cl (') Cl N Cl (') :i: c:i :i: 0 :i: c:i I I I -0.2 0.0 0.2 0.4 0 5 10 15 20 0 5 10 15 20 N N * 0 * N * 0 C: C: 0 C: cc: ~ ~ ~ 0 ~ ~ N 0 0 0 0 0 0 0 0 0 u I u u I (/) (/) N (/) Cl (') Cl c:i Cl (') :i: 0 ' :i: I :i: 0 I I -0.2 0.0 0.2 5 10 15 20 5 10 15 20 ~ N 0 N * N * * N c:i c:i C: C: ~ C: ...J ~ ~ c:i ~ ('I) 0 0 0 I 0 0 0 0 0 0 0 u u u (/) (/) (') (/) Cl N Cl 0 Cl N :i: 0 :i: I :i: 0 I I -0.3 -0.1 0.1 0 5 10 15 20 25 0 5 10 15 20 25 N N (I) N (I) ""' (I) N <ii 0 <ii 0 <ii 0 C: C: C: cc: =e =e =e ('I) 0 0 0 ~ 0 0 0 c:i 0 c:i 0 c:i u u u (/) (/) (/) Cl N Cl N Cl N :i: c:i :i: :i: c:i I 0 I I -0.2 0.0 0.2 0.4 0 5 10 15 20 25 0 5 10 15 20 25 N N * N * * N 0 0 0 C: C: 0 C: >< ~ 0 ~ ~ 0 0 0 0 0 0 0 0 0 u u u (/) N (/) ""' (/) N Cl 0 Cl 0 Cl 0 :i: I :i: I :i: I -0.4 -0.2 0.0 0.2 0 5 10 15 20 0 5 10 15 20 MDS coordinate 1 position (Mbp) position (Mbp) Figure 2.14 Variation in structure for windows of 1,000 SNPs across Drosophila melanogaster chromosome arms: without inversions. As in Figure 2.8, but after omitting for each chromosome arm individuals carrying the less frequent orienta tion of any inversions on that chromosome arm. The values differ from those in 2.13 in the window size used and that some MDS values were inverted (but rela tive orientation is meaningless as chromosome arms were run separately, unlike for M edicago). In all plots, each point represents one window along the genome. The first column shows the MDS visualization of relationships between windows, and the second and third columns show the midpoint of each window against the two MDS coordinates; rows correspond to chromosome arms. Vertical lines show the breakpoints of known polymorphic inversions. 46 chr2L chr2R chr3L chr3R chrX N • 6 • Q) ro C: 0 ~ 6 • 0 0 u Cl) N ' Cl 0 ,. ~ I • 'I. • st • • 6 I 0 2 4 6 8 10 0 2 4 6 8 0 2 4 6 8 0 5 10 15 0 5 10 recomb. rate recomb. rate recomb. rate recomb. rate recomb. rate Figure 2.15 Recombination rate, and the effects of population structure for Drosophila melanogaster: this shows the first MDS coordinate and recombina tion rate (in cM/Mbp) , as in Figure 2.13, against each other. Since the windows underlying estimates of Figure 2.13 do not coincide, to obtain correlations we divided the genome into lO0Kbp bins, and for each variable (recombination rate and MDS coordinate 1) averaged the values of each overlapping bin with weight proportional to the proportion of overlap. The correlation coefficient and p-values for each linear regression are as follows: 21: correlation 0. 52, r 2 = 0.27; 2R: correlation = 0.43, r 2 = 0.18; 31: correlation = 0.47, r 2 = 0.21; 3R: correlation = 0.46, r 2 = 0.21; X: correlation = 0.50, r 2 = 0.24. 47 15 chromosome 8 N N Q) $ <D $ iii "' a "' - .£: C C °2 ..,. ~ "' ~ ..,. - 0 ci 0 ci 0 ci 0 0 0 (.) (.) (.) CJ) CJ) - CJ) Cl a • . ..... . Cl a - Cl a - 2 ci 2 a 2 ci I I I I I I 0.0 0.2 0.4 0.6 0 50 100 150 0 50 100 150 MDS coordinate1 position (Mbp) position (Mbp) chromosome 15 N N $ $ "' $ "' "' ci "' C C C ~ N ~ . : ~ N 0 ci .. 0 ci . . ... : . .. -;··· 0 ci . , 0 .. 0 ••• '· ••• # • 0 . (.) ~ ..... (.) ··•<!•'1,,;;. .. ..•.t'·.i.• (.) .t,'-""':/f' .,; ..... ~- ... . . ,. . ..., .. ., . . CJ) N CJ) CJ) N . . ' Cl ci - Cl N ' Cl ci 2 2 a 2 I I I I I I I I I -0.2 0.0 0.2 20 40 60 80 100 20 40 60 80 100 MDS coordinate1 position (Mbp) position (Mbp) chromosome 17 N ~ (X) N $ • $ a $ - .. - "' N "' "' N C ci C C ci - ~ I ~ ..,. ~ I 0 0 ci 0 - 0 0 0 (.) <D (.) (.) <D CJ) ci CJ) CJ) ci - . Cl I Cl ~ - Cl I 2 2 I I I 2 I I I 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 0 20 40 60 80 MDS coordinate1 position (Mbp) position (Mbp) Figure 2.16 Variation in structure between windows on human chromosomes 8, 15, and 17. Each point in each plot represents a window. The first column shows the MDS visualization of relationships between windows; the second and third columns show the two MDS coordinates of each window against its position (midpoint) along the chromosome. Rows, from top to bottom show chromosomes 8, 15, and 1 7. The vertical red lines show the breakpoints of known inversions from Antonacci et al. [3]. 48 ..... C\J CV) -.:I" LO co 'i' ~ 'i' 'i' ~ i 0 M I 0 - ~ 0 " 'i' w i 0 I N 0 "' 0 N " 0 ' i 0 I 0 "' N 0 'i' " i w 0 1 N 0 ~ " N 'i' I! M 1! 0 I :; ~ N " 'i' I! w ~ 0 co 'E ~ § N ~ 0 " 0 0 r···- I , .... .. I :; - ~fW'•1;,i, 0 - f~i ' 1 !~: 1!= 11,1• 1 , 0 ~\~h, • 1 • 1 11 ~;Iii:•"~'.,' .. r .. : . ... . .. - .. . I • .,• •.i•J• N ..... . . 'i' - :• I I I I I I I -0.2 -0.1 0.0 0.1 50 100 150 200 250 l .;1 0 i 0 I ~ 'i' ~ " ro I I 'i' -0.8 -0.6 -0.4 -0.2 0.0 50 100 150 200 250 ~- '· .. ] i N 0 I r,~•u~~ • i: 0 1~iif ~~~I, , ... 0 i • I~ •·• .... . ,~ · - "' 0 N .. . . . . ., " .. 'i' I I -0.2 -0.1 0.0 0.1 0.2 0.3 50 100 150 200 ~/ .. I ~ i 0 M I 0 "' :; 0 " 0 I I I ' -0.1 0.0 0.1 0.2 0.3 0.4 0.5 50 100 150 l .1 i M 0 1 :; . . . ... .•;,, ,;. ''•. ·~, ... ·~ .•. c . " ~ ,ll(l,,,1'1i-l~~a,;,,~;7<~:o,;: " 'i' I -0.1 0.0 0.1 0.2 0.3 50 100 150 ~- ·---~-ii 0 ;; 0 1! I M 'i' ~ w " 'i' I I I -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0 50 100 150 L (I I: ~ nr:-~1 -0.6 -0.4 -0.2 0.0 0 ~ 100 1~ ~~: ----.-------.--· .. __ .I Ii ~ ! ________ ] I I I ~ I I 0.0 0.1 0.2 0.3 0.4 0.5 0.6 MOS coordinate1 50 100 position (Mbp) 150 .. , p .... , ....... " ' l ......... llfl • 'i' ~ 'i' ~ 'i' 50 100 150 200 250 ~ i 0 M I 0 - ~ 0 " 'i' 50 100 150 200 250 w i 0 I N 0 • .,,...11 '" "' 0 N " 0 ' 50 100 150 200 i 0 I 0 "' N 0 'i' " 50 100 150 i w 0 1 N 0 ~ ·•--: ••--·•---'.~---- " N 'i' 50 100 150 I! 6 1! I : :; " 1111: !!~~· 1'1¥••" ';, 1 1,~1: ~ N I , • • • '• ... " 'i' 50 100 150 ,. , •.· · •.<.• l: . . : 1'\ ''Ji., .;'1•,,,~ ., 50 100 150 ................. ,, .... , . . , ... ,. ...... ,. 50 100 150 position (Mbp) Figure 2.17 MDS plots for human chromosomes 1-8. The first column shows the MDS visualization of relationships between windows, and the second and third columns show the midpoint of each window against the two MDS coordinates; rows correspond to chromosomes. Colorful vertical lines show the breakpoints of known valid inversions, while grey vertical lines show the breakpoints of predicted inversions. 49 ~ J .. .-:. .I I I I -0.2 -0.1 0.0 0.1 0.2 0.3 -0.4 -0.3 -0.2 -0.1 0.0 0.1 ~ P h -I ' "-'-r1-"T"""--ir----r--.-----.-~ 0.0 0.1 0.2 0.3 0.4 0.5 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 -0.6 -0.4 -0 .2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0 0 N 0 ' - 0 20 40 60 80 100 120 140 . . . ' , , . l!V~~. ,,;ii, ~~-+-J,,ti$:.'.:11J1 ... ,:, ... ; •• • Iii, ·, ~ -~•,"-.I' • . . 20 40 60 80 100 120 0 20 40 60 80 100 120 20 40 60 80 100 20 40 60 80 100 I I I I -0.2 -0.1 0.0 0.1 0.2 0.3 20 40 60 80 100 ~ !: ~......-----l ----, · ,I ! ~,J I .J 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 MOS coordinate1 20 40 60 80 position (Mbp) i N 0 ~ ~ 8 "' Q " w 0 N ~ i ~ ;; 8 "' Q " ' 0 20 40 60 80 100 120 140 -~111r:;~,.-~111 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 20 40 60 80 100 . . )1\ ... ,.,::-r:~. ; 60 80 100 . · ... . ••!1· • ·"·• ... .,,. i. ·:~.; :,. : .. ' )~: r . , I 20 40 60 80 position (Mbp) Figure 2.18 MDS plots for human chromosomes 9-16, as in Figure 2.17. 50 ~ : ~~~ ..-------.------ , -1 : U lliLl : rIT[[-l 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 20 40 60 80 ~ ! : l - .~I 1 : r . ---1 ! ; : -,--- , I I I I I -0.6 -0.4 -0.2 0.0 20 40 60 0 20 40 60 ~ ~ . ·1 i ci 76 --: ""' ,g ~ .g O • .--~· • ,,.::: • ••• ··.·. ,t . ," :::: ,§lgoo ~, · • • • g - •• ,, • • •• .. ,,, - -~ ... ._. ::, 'f • •• · • "• • • • 0 :, ~ ....,,~~~~-('--,............ 0 .,.___,..........,,.__~~~~ - ~ H I ~ 0 ' ~ 0! i; ·-~-·- ·-- ~ . ~:·- --· :--.~-, I I I I I I I -0.3 -0.2 -0.1 0.0 0.1 0.2 0 10 20 30 40 50 60 0 10 20 30 40 50 60 ~ Ii t .I! H-.-~I n ........ ,-~-~-~~ I I I I I I I N 1 1 r "' " " . . 1 H L ~-L~~-~] 1 ! r ~ r.~~-:~1 ~-•~,~~-~~~..,....,. . I I l~I I I - . I I 1-1 I I I. N ~ i ci N 'E"': N 8 ° "'., C, 0 :, 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ... 0.0 0.2 0.4 0.6 MOS coordinate1 15 20 25 30 35 40 45 N ~ i ci ~ ~ 8 "'., C, 0 -····•·..-.... . . . ----·•····· ~ ~ ' 15 20 25 30 35 40 45 position (Mbp) 15 20 25 30 35 40 45 .. ............... ·~· -•····.~ 15 20 25 30 35 40 45 position (Mbp) Figure 2.19 MDS plots for human chromosomes 17-22, as in Figure 2.17. 51 non-outliers(13-23) st 0 0 NO (.)O a. 0 c'; 0 I -0.06 -0.02 0.02 PC1 non-outliers( 46-56) c'; 0 ~ .... 0 I ••• • • -0.04 0.00 PC1 outliers(24-34) N ~:.,, N (.) a. -0.03 0.00 0.02 PC1 non-outliers(57-67) N 0 0 N 0 0 I -0.06 -0.02 0.02 PC1 non-outliers(35-45) st 0 0 N (.)0 a. 0 0 st 0 0 I - .,., I:.;"'. , - t, . . - ·. ,~ . !• • - - . I I I I I -0.06 -0.02 0.02 PC1 • African-American • Asian • European • lndianAsian • Latin American non-outliers(68-78) - 0 ,. a (j~- ii: a. c'; :·~ •• ? -~~~ I I I I I -0.06 -0.02 0.02 PC1 Figure 2.20 Comparison of PCA figures within outlying windows (center column) and flanking non-outlying windows (left and right columns) for the two windows having outlying MDS scores on chromosome 8. 52 MDS, all human autosomes LO 0 0 ...... LO (].) 0 _. ci ro C: I ,::, .... 0 0 LO <..) ...... (/) 0 0 I 2 ··..:--· ·. . ·,;. ·• ,.. • • ·1 lo •• .. 4! •• "'· .. . • • • • ••• • • • •• .. • •• •• • • • • • • •• • •• • • • • • • • "' • • • •• • • • • • •• • • • • • • • • :- • •• • • • LO N • • 0 I • 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 22 chromosome number Figure 2.21 MDS visualization of variation in the effects of population structure amongst windows across all human autosomes simultaneously. The small group of windows with positive outlying MDS values lie around the inversion at 8p23. 53 ,.... - 0.4 - 0.2 0.0 0.2 0.4 'll ~ CV) ~ ~ 8 (/) 0 <O ::; 9 -0.4 -0.2 0.0 0.2 N i d '<:j- ~ N d 8 I (/) 0 <O ::; d I -0.4 -0.2 0.0 0.2 0.4 'll ;; " - ~ - ~ N ~ ~ d LO - 8 d 8 (/) (/) N 0 N 0 9 ::; 9 ::; -0.2 0.0 0.2 0.4 I ;:; i .. d CD ~ N 'g ;:; 8 9 8 (/) (/) "' N 0 0 ::; d ::; d I I -0.2 0.0 0.2 0.4 'll i ~ d N ~ 'g d I'-- 8 ~ 8 (/) (/) ~ 0 "' 0 ::; d ::; I -0.2 0.0 0.2 0.4 <O N d " " ~ ,. - ~ ~ ~ ~ 0 CX) d 8 8 (/) (/) 0 0 ~ ::; N ::; 9 ...().4 -0.2 0.0 0.2 MO S coordinate1 10 20 30 40 50 . ' . . ~ . , ' . ..L. 10 20 30 40 ~ -,] -. - 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 posiOon (Mbp) 'll ~ ~ 8 (/) 0 ::; 'll ~ 'g 8 (/) 0 ::; 1 ~ § (/) 0 ::; (/) 0 ::; N " ~ ~ 8 (/) 0 ::; N d - 9 ;; I ~ <O 9 N d ~ <O d I "' 9 <O d N 9 0 10 20 30 40 50 ,.-- ..__,,,_ . I• • ~ ~ • • ' 10 20 30 40 ...._ -:•~ •• '. J. ' . .. 0 1020304050 . . . . ' :· ,, . . - 10 20 30 40 50 , .. . , ' ' .. - . ,, ,- . •.· - . ·.-: ·.·. ... ~ . 0 5 10 15 20 25 30 35 Jr;""'~ .. ., ., . ' ' ·- 0 10 20 30 40 50 . ., . - .. • • I ' - . ~ ' , . . ~ 0 10 20 30 40 position (Mbp) Figure 2.22 MDS visualizations of the effects of population structure for all 8 chromosomes of the Medicago truncatula data, using windows of 10 4 SNPs. 54 A B C o Algeria Jordan A Australia Lebanon + blank Libya ~ 0 Canada • Morocco Cyprus • Poland ~ France • Portugal N ~ g 0 France_Corsica • Spain ..- (.) 0 ci • Germany • Sweden Q. ci ci + • Greece o Syria 0 • Greece_Crete D Tunisia 0 ~ 0 a 1srael • Turkey Italy A U.S 0 ? ' ' ~ -0.05 0.05 0.10 -0.15 -0.05 0.05 -0.05 0.05 0.15 ~ ci ~ ci g ci N N ci 0 0 (.) 0 0 Q. ci ci ~ ~ ? ? ~ ? 0 -0.10 0.00 0.05 ~ -0.10 0.00 0.10 -0.05 0.05 0.15 ~ ci N g g (") (.) ci ci g Q. ci ~ ~ 0 + ? ~ ' ? -0.05 0.05 0.10 -0.15 -0.05 0.05 -0.05 0.05 0.15 ~ g ~ N 0 'Q" (.) 0 ci ci Q. 0 + ' ~ 0 fl 0 0 0 ' ' ' ~ -0.05 0.00 0.05 0.10 ~ -0.15 -0.05 0.05 -0.05 0.05 0.15 ~ ci I ""· 0 , .. ~ . ~ + , 0 N ~ · '· · ,~ ~ I.() (.) 0 0 g Q. ? ... ci ci fl ~ ~ ? ? 0 ' 0 -0.05 0.00 0.05 0.10 -0.15 -0.05 0.05 -0.05 0.05 0.15 N ~ -:~•'!. ci ·•. + • N 1,;.: ·,.... . .. . ~ . .,,· ..... ~ ~ 0 co (.) 0 •. , . , . .., .. · '!.-.i g ci Q. ci ci + ~ · - ~ ~ ~ ' l • : ? ? 0 ' 0 -0.05 0.05 0.10 -0.10 0.00 0.10 -0.05 0.05 0.15 0 ~ N 0 ~ t-- ci 0 0 (.) ci ci Q. ~ ~ 0 ? + 0 ? ' -0.05 0.05 0.10 -0.10 0.00 0.10 -0.05 0.05 0.15 ~ ~ ~ ci 0 ~ N ~ 0 0 CX) (.) 0 ci ci Q. ci ~ 0 0 0 0 ? ' ' -0.05 0.05 0.10 -0.10 0.00 0.10 -0.05 0.05 0.15 PC1 PC1 PC1 Figure 2.23 PCA plots for regions colored m Figure 2.22 on all 8 chromosomes of M edicago truncatula: (A) green, (B) orange, and (C) purple. 55 chromosome 3 (") ci N ,- N 2 2 2 ro ro ci ro C: C: C: E N E ,- E N 0 0 0 ci 0 0 0 I 0 0 I u u I u en en en 0 0 0 :i: .. :i: 'SI" :i: (1) .. ci (1) • 0 I 0 I I -0.4 -0.2 0.0 0.2 0 10 20 30 40 50 0 10 20 30 40 50 MOS coordinate1 position (Mbp) position (Mbp) A B C I!) o Algeria A Australia ci D + blank o , 0 ~ D D oe x Canada I!) Jl D Cyprus 0 0 D France ci • ,P: .... # I!) D D ~D~ France Corsica 0 + o \ • • o • ci ci :a a • Germari"y o a•• • • lJt; • Greece C\J I!) C\J C\J D • Greece_Crete 0 0 o l.d ~ ~. 0 I!) 0 0 a Israel a... ci 5° • • • a a... 0 a... C! + Italy I D ~ • • 0 e • e ci 0 \!;ii D e p I Jordan • ·+ Lebanon I!) di D q, Libya of!llJe I!) DD D 0 • Morocco ci D aD +" • Poland D ci I!) I DD D + I D • Portugal ci • Spain I • Sweden -0.05 0.05 -0.15 -0.05 0.05 -0.05 0.05 0.15 o Syria a Tunisia o: Turkey PC1 PC1 PC1 A U.S. Figure 2.24 MDS visualization of patterns of relatedness on M. truncatula chro mosome 3, with corresponding PCA plots. (upper panels) Each point in the plot represents a window; the structure revealed by the MDS plot is strongly clustered along the chromosome, with windows in the upper-right corner of the MDS plot (colored red) clustered around the centromere, windows in the upper-left corner (purple) furthest from the centromere, and the remaining corner (green) interme diate. Plots for remaining chromosomes are shown in Figure 2.22. PCA plots for the sets of genomic windows colored (A) green, (B) orange, and (C) purple in the upper panels. (lower panels) Each point corresponds to a sample, colored by country of origin. Plots for remaining chromosomes are shown in Figure 2.23. 56 2 rn C: E 0 8 Cl) 0 2 "E :, 8 Q) C: Q) 0) 2 rn C: ~ 0 1 0 10 30 position (Mbp) 5 50 •:•-ti\ ...... t . . #,.,.... . ... . . '..: .... - ·•,-,.:., .... : . ~ .... i :•· ,. . .,_ . •. ·"'•.t,=-' . 8 Cl) 0 2 - .• :..;·t.. <"a,.;,;~-:;,~ .... -1 •:-• ,.~~ . . . ~-1 N ~:p-:_ •• ",ti• ci - • •• ·•';•··, I • -. •• I •• • 1 •• • "E LO (") :, LO 8 N ~ LO Q) 0) LO 0 10 20 30 40 position (Mbp) 2 - .. • ,.\1!1,t • •• !l'· .... N • • • :\ • • N ci - •• "' ~l- -::.·. ·. .. ci •••• :a!:_~•~ • • •• '!"~-:-I~ ... -.. ~,:.-p~,=--~- N • ~~•• .,••Ii • ~ ci - • ·'~ii.• • •• ,, :;.41i- 0 I •• 14 • ."N I 0 (") 0 . . . 0 10 20 30 40 position (Mbp) 6 ;; - . {:•, ~"A~Jl'' .·:·-~--..,:, · ... · ci - .... t' -~t\• • '· ., - -~- . .,.,..t~:, .. . 0 ,':/ ..... .. , <-.... _ .... . 0 - ...... , ........... \ \ - ... -~ . .. . ~ 0 (") 0 0 N 0 0 0 ,.A .... ,... • , ... ,:( •• - .c r?.· • • . •• . C'0 •• ••.. • a•• C"?o· 0 -~--------~ I ~-------~ I 0 (") 0 N 0 . . ... 0 '<t ... .. . . . .. ~M· ; .. = -~~ 0 .• •• •hf.'la, N : -~. .. . ' . 0 . . . • • • • • 0 10 20 30 position (Mbp) 3 - ~,..~-: .,.,, •v , . . - ~<i\!&•:..:•. . . :., ... , :;l ·:. ... N 0 4 - .., 1~~ .;~:1.i,;.,l "-,; ... t.c.. v.-:- •.. , • ••• , ...... N - ~- . '· •. .. ? . . 0 . . 20 40 position (Mbp) 7 ~;.•a<.; . . . •• ~,. •1 • - • .: .. : :. • ,-., 'I•• . ,. . -: .. \ . •u¥±~ i'• ,• • . l''tl: :.·~~?1 ·=:· & :=•· .•.•· • • . - . ' . .. . .. . .. 0 10 30 50 position (Mbp) 0 20 40 position (Mbp) 8 - • ..,. b~•• .1 ~r: . N •~ : -floe~,\• • 0 - • • •• "t"•a• 1~_1 • ~ . :-., .. ~,~ .. : . •.• y .... •:•.,-•• 9: _,,. -~ ·~ .. ,j,,;:.;,.--: : .. ;.~~-;·r·· ci - • !JJ)!•' :H ~ I ••-'Ill • •! 0 (") 0 0 10 20 30 40 position (Mbp) Figure 2.25 MDS coordinate and gene density for each window in the Medicago genome, for chromosomes 1- 8 (numbered above each pair of figures). For each chromosome, the red plot above is first coordinate of MDS against the middle position of each window along each chromosome. The black plot below is gene count for each window against the middle position of each window. 57 N ...... 0 Q) +-' rn C "O 0 ,_ 0 0 0 u 0 (j') 0 ~ N 0 I 0 0 10 20 30 40 50 gene count Figure 2.26 First MDS coordinate against gene density for all 8 chromosomes of M. truncatula. The first MDS coordinate is significantly correlated with gene count (r = 0.149, p = 2.2 X 10- 16 ). 58 2.4 Discussion Our investigations have found substantial variation in the patterns of related ness formed by population structure across the genomes of three diverse species, revealing distinct biological processes driving this variation in each species. More investigation, particularly on more species and datasets, will help to uncover what aspects of species history can explain these differences. With growing appreciation of the heterogeneous effects of selection across the genome, especially the impor tance of adaptive introgression and hybrid speciation [9, 34, 48, 93, 105], local adaptation [66, 112], and inversion polymorphisms [58, 59], local PCA may prove to be a useful exploratory tool to discover important genomic features. We now discuss possible implications of this variation in the effects of pop ulation structure, the impact of various parameter choices in implementing the method, and possible additional applications. Chromosomal inversions A major driver of variation in patterns of relatedness in two datasets we examined seems to be inversions. This may be common, but the example of M edicago truncatula shows that polymorphic inversions are not ubiquitous. PCA has been proposed as a method for discovering inversions [72]; however, the signal left by inversions likely cannot be distinguished from long haplotypes under balancing selection or simply regions of reduced recombination without additional lines of evidence. Inversions show up in our method because across the inverted region, most gene trees share a common split that dates back to the origin of the inversion. However, in many applications, inversions are a nuisance. For instance, SMARTPCA [89] reduces their effect on PCA plots by regressing out the effect of linked SNPs on each other. Removing samples with the less common orientation of each inversion reduced, but did not eliminate, the signal 59 of inversions seen in the Drosophila melanogaster dataset, demonstrating that the genomic effects of transiently polymorphic inversions may outlast the inversions themselves. Genealogical noise? The field of phylogenetics has long had to deal with the fact that there can be a great number of different local phylogenies along the genome, even between species [5, 44, 88]. Loosely, we are studying the within species patterns that might lead to such incomplete lineage sorting in the future. The neutral distribution of variation in these patterns has been used to infer demo graphic history, both between species [101] and within species [7]. If these distinct phylogenies are merely a result of neutral stochasticity, there is not expected to be a correlation between local phylogeny and other genomic features. However, in some cases the local assortment of ancestral diversity and subsequent introgression between sister taxa shows clear signs of selection, as for instance in the wild tomato clade [91]. The effect of selection Neutral processes are not expected to produce the chromosome-scale correlations we see in patterns of relatedness in the Medicago truncatula and Drosophila melanogaster datasets, because correlations in patterns of relatedness induced by neutral processes should extend no further than does linkage disequilibrium (i.e., much less than a chromosome's length). This suggests that they are produced by linked selection, a hypothesis backed up by correlations with gene density and recombination rate. We have also shown with simulations that linked selection can, in at least some circumstances, produce the sorts of pat terns we observe. How might selection cause variation in patterns of relatedness? For instance, background selection ( the effect on linked sites of selection against deleterious mutations Charlesworth et al. [17], Charlesworth [20]), can informally 60 be thought of as reducing the number of potential contributors to the gene pool in regions of the genome with many possible deleterious mutations [46]. For this reason, if it acts in a spatial context, it is expected to induce samples from nearby locations to cluster together more frequently. Therefore, regions of the genome har boring many targets of local adaptation may show similar patterns, since migrant alleles in these regions will be selected against, and so locally gene trees will more closely reflect spatial proximity. Other forms of selection, such as hard sweeps on new mutations, repeated selection on standing variation, local adaptation, or tem porally fluctuating selection, could clearly lead to variation in geographic patterns of relatedness in a similar way. Another possible contributor is recent admixture between previously separated populations, the effects of which were not uniform across the genome due to selec tion. For instance, it has been hypothesized that large-scale variation in amount of introgressed Neanderthal DNA along the genome is due to selection against Neanderthal genes, leading to greater introgression in regions of lower gene den sity [42, 50]. African Drosophila melanogaster are thought to have a substantial amount of recently introgressed genome from "cosmopolitan" sources; if selection regularly favors genes from one origin, this could lead to substantial variation in patterns of relatedness correlated with local gene density. There has been substantial debate over the relative impacts of different forms of selection (e.g., Burri et al. [14], Charlesworth et al. [18], Charlesworth [19], Corbett Detig et al. [25], Harris and Nielsen [42], Hedrick [43], Martin et al. [75], Pease and Hahn [90], Phung et al. [92], Stankowski et al. [104]). These have been difficult to disentangle in part because for the most part theory makes predictions which are only strictly valid in randomly mating (i.e. , unstructured) populations, and it is unclear to what extent the spatial structure observed in most real populations will 61 affect these predictions. It may be possible to design more powerful statistics that make stronger use of spatial information, or even using the variation in relatedness that we observe here, similar to the method of Beeravolu et al. [7]. Parameter choices There are several choices in the method that may in prin ciple affect the results. As with whole-genome PCA, the choice of samples is important, as variation not strongly represented in the sample will not be discov ered. The effects of strongly imbalanced sampling schemes are often corrected by dropping samples in overrepresented groups; but downweighting may be a bet ter option that does not discard data. Next, the choice of window size may be important, although in our applications results were not sensitive to this. Finally, which collections of genomic regions are compared to each other (steps 3 and 4 in Figure 2.1), along with the method used to discover common structure, will affect results. We used MDS, applied to either each chromosome separately or to the entire genome; for instance, human inversions are clearly visible as outliers when compared to the rest of their chromosomes, but genome-wide, their signal is obscured by the numerous other signals of comparable strength. Besides window length, there is also the question of how to choose windows. In these applications we have used nonoverlapping windows with equal numbers of polymorphic sites. However, we found little change in results when using different window sizes or when measuring windows in physical distance (in bp). Finally, our software allows different choices for how many PCs to use in approx imating structure of each window (kin equation 2.1), and how many MDS coor dinates to use when describing the distance matrix between windows, but in our exploration, changing these has not produced dramatically different results. How ever, this choice could easily be important: for instance, if the k th and (k + l) st 62 PCs are sufficiently different but have similar eigenvalues, then small amounts of noise could cause these to switch, leading to spuriously inferred differences between windows in which one or the other was included in the top k PCs. This does not seem to be a problem in our applications, as changing the number of PCs did not affect the qualitative results. These choices are all part of more general techniques in dimension reduction and high-dimensional data visualization; we encourage the user to experiment. Applications So-called cryptic relatedness between samples has been one of the major sources of confounding in genome-wide association studies (GWAS) and so methods must account for it by modeling population structure or kinship [4, 115]. Modern "mixed model" methods [ e.g. 69] account for this with either a single, genome-wide kinship matrix or one constructed using only sites unlinked to the focal SNP. Since the effects of population structure is not constant along the genome, this could in principle lead to an inflation of false positives in parts of the genome with stronger population structure than the genome-wide average. A method such as ours might be used to estimate local kinship matrices, thus providing a more sensitive correction, although doing so without removing the signal itself could be challenging. Fortunately, in our human dataset this does not seem likely to have a strong effect: most variation is due to small, independent regions, possibly primarily inversions, and so may not have a major effect on GWAS. In the other species we examined, particularly Drosophila melanogaster, treating population structure as a single quantity would entail a substantial loss of power, and could potentially be misleading. 63 Chapter 3 Predicting the hosts of viruses based on viral sequences 3.1 Introduction Viruses are ubiquitous and can reproduce and evolve very fast. Virus infections in human can cause various diseases and are a big threat to human health. Many infectious disease studies showed that virus cross-species transmissions are highly prevalent resulting in emerging infectious diseases (EIDs) [16]. EIDs continue to pose significant public health problems as shown by the recent outbreaks of West Nile virus, SARS, MARS, and HlNl [70]. Rapidly identifying the reservoir of the new pathogenic bacterial or viral origins responsible for these diseases will help the containment, control, and prevention of the outbreaks [16, 65, 67]. Further, investigating the potential host of a virus can throw light on the evolutionary history of the virus, thus provide guidance on how to cut off the transmission path. The biological presumption for most of the host identification methods is that the more similar two viruses' DNA/RNA sequences are, they are more likely to share the same host [106]. With the availability of various databases containing different types of pathogenic microbial species, one of the most commonly used approaches for iden tifying the origin of the new pathogen responsible for an EID is to find similar 64 sequences in the pathogen databases using alignment by the Smith-Waterman algorithm [103], BLAST [2], or other alignment tools. Recently, several alignment-free methods have been developed for the identifi cation of the hosts of pathogenic species. Kapoor et al. [52] used relative dinu cleotide frequencies and discriminant analysis to infer the hosts of novel picorna-like viruses. Aguas and Ferguson [1] developed a feature selection method and used random forests (RF) based on the diverged nucleotide or amino acid bases among a set of aligned molecular sequences to predict the host species of pathogens. Tang et al. [108] developed a support vector machine (SVM) based method using mono and dinucleotide frequencies as features to detect the original hosts of coronaviruses with high accuracy. Kargarfard et al. [53] predicted the host range of the influenza virus using various machine learning approaches. Several new alignment-free statis tics including d; and df for molecular sequence comparison using k-mers (k-grams, words, etc.) were developed recently [97, 111]. It was shown that such measures are highly associated with the evolutionary distances estimated from alignment-based methods, thus validating the usefulness of alignment-free methods for the compar ison of molecular sequences [71 , 98]. In this study, we investigate the effectiveness of alignment, alignment-free and machine learning based methods for inferring the hosts of viruses responsible for emerging infectious diseases. 3.2 Materials and Methods 3.2.1 Calculate the pairwise distance matrix of viruses We compare the performance of alignment-based, alignment-free, and machine learning based approaches for inferring the hosts of viruses. For the alignment based method, we first use the software "Clustal Omega" [100] for multiple 65 sequence alignment using the default parameters and then use the software "Phylip" [32] and choose the "F84" evolutionary model to calculate the pair wise distance using the alignment results as input. We also investigate several alignment-free methods for calculating the distances/dissimilarities between viral sequences using the CAFE package [71] and these include Chebyshev, Euclidean, Manhattan, CVTree [96], d2 , d; , d~ [97], etc. The definitions of these dis tances/ dissimilarities were given in Lu et al. [71]. Our objective is to evaluate if alignment-free approaches have similar accuracies in predicting the hosts of virus sequences, but with much higher computational efficiency. 3.2.2 Visualize the distance matrix To empirically see if viruses from the same host tend to be more similar to each other than those from different hosts, we first use multidimensional scaling (MDS) [61] to project the virus sequences onto two-dimensional Euclidean space. We also use hierarchical clustering [55] with average linkage to visualize the relationship among the viruses, and intuitively assess whether the viruses infecting the same hosts are indeed closer than those infecting different hosts. 3.2.3 Predict the host of a virus We apply K-nearest neighbors (KNN) [64] method based on the pairwise dis tance matrix for both alignment-based and alignment-free distances for predicting the host of the virus. The alignment-free distance measures include Chebyshev (Ch), Euclidean (Eu), Manhattan (Ma), CVTree (CVT), d2 , d; and d~ with var ious k-mer sizes. For each virus, we choose the K viruses that are closest to the virus from the pairwise distance matrix, and then count the frequency of the hosts of the K viruses. We use the most frequent host as the predicted host of the 66 virus. For machine learning based prediction, we apply SVM on matrices with mono- and dinucleotide frequencies as features (3 mononucleotide frequencies and 16 dinucleotide biases [108]). R package e1071 was used for SVM analysis with "C-classification" as the model type and "Radial" as the SVM kernel [108]. We use leave-one-out cross-validation (LOOCV) [29] and N-fold cross validation to evaluate the prediction accuracy. We implement this process for all the viruses and then compare the predicted host with its true host to obtain the prediction accuracy. 3.2.4 Investigate the impact of sample sizes on prediction accuracy The number of corresponding virus sequences for each host category can sig nificantly affect the prediction accuracy. In order to quantify the effects of sample sizes on prediction accuracy, we randomly choose a certain number of sequences and then apply the KNN and SVM approaches to the set of sequences to obtain the prediction accuracy as described above. We repeat this process for a series of sample sizes to see how the prediction accuracy changes with sample size. We let the sample size change from 70 to 145 with a step size of 5 for the rabies virus dataset, from 100 to 700 with a step size of 50 for the coronavirus, and from 200 to 1100 with a step size of 100 for the influenza A virus dataset. For each sample size, we randomly choose 10 sets of sequences and calculate the prediction accuracy for each dataset. 67 3.2.5 Datasets We analyze three virus datasets with different characteristics: rabies virus, coronavirus, and influenza A virus, to see if consistent results related to the relative performance of alignment, alignment-free, and machine learning based approaches can be obtained. The rabies virus dataset from Streicker et al. [106] The rabies virus is a single-stranded RNA virus and has a wide host range. We first investigate the rabies virus dataset from Streicker et al. [106] consisting of 372 rabies virus samples from 23 bat host species. Among them, 148 viruses have complete N gene (1,353 bp) sequenced. In this paper, we concentrate on the study of these 148 viruses. The accession numbers of the complete genomes and Nucleoprotein (N) gene of the viruses were provided in Streicker et al. [106] and the corresponding gene sequences can be downloaded from NCBI genbank database using their accession numbers at https://www.ncbi.nlm.nih.gov/genbank/. The coronavirus dataset from Tang et al. [108] Tang et al. [108] devel oped a SVM based method using mono- and di-nucleotide sequences to predict the host of coronavirus. We use the same data as in Tang et al. [108] consist ing of 724 coronavirus samples from 6 host species (human, porcine, bovine, bat, murine and avian). Among them, 392 samples have complete genome sequenced, and 326 samples only have their spike gene sequenced. We refer the 392 samples as set "cgenome", and the 326 samples as "spike". We also extract the spike gene sequences from the complete genomes by checking the coding sequence annotation in NCBI and obtain additional 381 extracted spike gene sequences. Together with the original 326 sequences, we have a total of 707 all spike sequences and refer them as "allspike ". 68 Influenza A virus dataset from the Influenza Research Database [117] . Finally, we investigate the host of influenza A virus as in Kargarfard et al. [53] We collect the avian influenza A virus from the Influenza Research Database [117] and exclude those sequences with ambiguous host species such as chicken, duck, avian, and gull, and also those host species with less than 200 virus sequences in the database. We restrict the samples to the same taxonomic rank, and choose the level as "species" in the taxonomic hierarchy. The prediction can surely be easier for more general categories. There are six remaining avian host species: American Black Duck Anas rubripes, Blue-Winged Teal Anas dis cors, Green-Winged Teal Anas carolinensis, Northern Pintail Anas acuta, North ern Shoveler Anas clypeata, and Ruddy Turnstone Arenaria interpres, for further study. For each host species, we randomly choose 200 virus sequences in our study. 3.3 Results We initially calculated the prediction accuracies of the KNN algorithm based on the alignment method and the alignment-free distance/dissimilarity measures for k-mer length from 3 to 6 and the number of neighbors K from 1 to 10. The results for the rabies virus, coronavirus, and influenza A virus datasets are given as Figure 3.1, 3.2, and 3.3, respectively. These figures show that except for the Chebyshev divergence, all the other alignment-free distance/dissimilarity measures have similar prediction accuracy, very close to that of the alignment-based distance measure. The prediction accu racy is not markedly affected by the length of k-mers from 3 to 6. For clarity of presentation in the remaining of the paper, we let the k-mer size to be 6. Based on the sample size and distribution for each dataset, we choose K = l 69 ~ ~ 0 co 0 :::::io 8 t- ro o C: 0 '.§ "O 0 ~ <D a. . 0 0 L() 0 o k3.Ch A k3.D2 + k3.D2shepp x k3.D2star � k3.Eu V k3.Hao k3.Ma o k4.Ch A k4.D2 + k4.D2shepp x k4.D2star � k4.Eu V k4.Hao k4.Ma o kS.Ch A k5.D2 + k5.D2shepp x k5.D2star � kS.Eu V kS.Hao kS.Ma o k6.Ch A k6.D2 + k6.D2shepp x k6.D2star � k6.Eu v k6.Hao k6.Ma '-----.-------,-------r------r-----,......... • align 2 4 6 KNN 8 10 Figure 3.1 The prediction accuracy of the KNN algorithm for the rabies virus dataset using varying distance measures, k-mer length from 3 to 6, and the number of neighbors ( K, horizontal axis) . as the number of neighbors in KNN for the rabies virus dataset , and K = 7 for the coronavirus dataset and influenza A virus dataset. For alignment-free distance measures, we use Manhattan distance as an representative as many of them have similar prediction accuracies. 3.3.1 Rabies virus Figure 3.4 shows the MDS plots of the 148 rabies viruses with complete N gene sequences based on the Manhattan distance using 6-mers (left) and alignment (right). 70 0 (J) G' ci ~ :::J 8 ro co §o '.§ "O Q) 0.. t-- 0 <D 0 - - - - - ( ' -!~ , ,= ~ ,~ f: ~ i O .. / 0, . ~. / ,. V ---.. • .;;, : O' _,/ O ·.; , 0 / 0 , ' . / · , 0 ;,, 0 0 , , , 0 I I 2 4 :::;; t -- i :::: == V I 6 KNN - ~ ~ ~ ::--: :::-:::= 3 ~ ' I 8 o k3.Ch ' •• · ~ ~ X I> k3.D2 ~ © + k3.D2shepp ' - ' x k3.D2star � k3.Eu V k3.Hao k3.Ma o k4.Ch I> k4.D2 + k4.D2shepp x k4.D2star � k4.Eu v k4.Hao k4.Ma o k5.Ch I> k5.D2 + k5.D2shepp x k5.D2star � k5.Eu v k5.Hao k5.Ma o kB.Ch I> kB.D2 + kB.D2shepp x kB.D2star � kB.Eu V kB.Hao kB.Ma I • align 10 Figure 3.2 The prediction accuracy of the KNN algorithm for the coronavirus dataset using varying distance measures, k-mer length from 3 to 6, and the number of neighbors ( K, horizontal axis) . In addition, Figures 3.5 and 3.6 show the hierarchical clustering of the viruses using alignment-based distance and Manhattan distances using 6-mers, respec tively. The clustering results using the alignment-based method and the Manhattan distance are highly similar indicating that the alignment and alignment-free based methods give roughly similar results. Figure 3. 7 shows the LOO CV prediction accuracies of one nearest neighbor using alignment-based distance and Manhattan distance with 6-mers, and SVM for different sample sizes. It can be seen from the figure that the average prediction accuracies for the alignment based method and the Manhattan distance based method are similar. 71 · ~ ··. .. A . ·. ~ ~ , ;;;:,,.. =-=... 1 -_ .. .,. _ -,,. · ....... ~ ~i ;:.:::;,; g~ - ~ij ~ ~ ~ ~-~-- i~1 =1 ~ -" + ~ 0 . . .. -- ~; • • · - .. .. , ,... . . ---=:::: >--= 0 °'~. 0 0 ,.._ 0 ', ---.-, V ,.,_.__ ·~ '., . ~- V :.::..: '.., oe. .,,,,,: , ............ .. 0 ;;,,., 0 :-:- 0 -- - .:..: ·- · · , o ,,,,. o o_.... o ~ o 0 0-- g - >- g 2 4 6 KNN 8 10 o k3.Ch A k3.D2 + k3.D2shepp x k3.D2star � k3.Eu V k3.Hao k3.Ma o k4.Ch A k4.D2 + k4.D2shepp x k4.D2star � k4.Eu V k4.Hao k4.Ma o k5.Ch A k5.D2 + k5.D2shepp x k5.D2star � k5.Eu V k5.Hao k5.Ma o k6.Ch A k6.D2 + k6.D2shepp x k6.D2star � k6.Eu v k6.Hao k6.Ma • align Figure 3.3 The prediction accuracy of the KNN algorithm for the influenza A virus dataset using varying distance measures, k-mer length from 3 to 6, and the number of neighbors ( K, horizontal axis). Ma align -s:I" " • 6 -s:I" 0 Antrozous_pallidus (\J (\J 6 Corynorhinus_townsendii Q) (\J Q) 0 + Eptesicus_fuscus cu 6 cu 6 Lasionycteris_noctivagans Lasiurus_blossevillii C: C: V Lasiurus_borealis 'E 'E Lasiurus_cinereus 0 •a 0 a • Lasiurus_i ntermedius 0 0 0 0 ' •a + • Lasiurus_seminolus .. , • Lasiurus_xanthinus 0 0 0 • Myotis_austroriparius CJ .. , , CJ . .... Myotis_californicus (f) (f) Myotis_evo tis 0 • 0 Myoti s_lucifugus • Myotis_yumanensis ~ -ill- ~ • • Nycticeius_humera lis -s:I" "' Parastrellus_hesperus 6 c.o ... • Peri myotis_subflavus • 0 .. • Tadarida_brasiliensis I 0 I -0.4 0.0 0.2 0.4 -0.04 0.00 0.04 0.08 MOS coordinate1 MOS coordinate1 Figure 3.4 MDS plots for the 148 rabies viruses with complete N gene sequences based on the Manhattan distance using 6-mers (left) and alignment (right). Each point in the plots is a sample colored by the host species' name. 72 However, the prediction accuracies for the alignment-based method have an rel atively larger variance than that for the Manhattan distance based method for almost all the sample sizes considered. On the other hand, the average prediction accuracies for the SVM based method are lower than that for both the alignment and Manhattan distance based methods. The estimated prediction accuracy using 10-fold and 20-fold cross-validation are given in Figure 3.8. The 10-fold and 20-fold cross-validation results also support that alignment based method and the Manhattan distance based method have highly similar per formance. Comparing Figure 3. 7 with Figure 3.8, we can see that the prediction accuracies increase with "N" for N-fold cross-validation, which is reasonable since the proportion for the training dataset increases with "N". 3.3.2 Coronavirus Similar to the rabies virus dataset, we first visually check the pairwise distance matrix using MDS and hierarchical clustering. Figure 3.9 shows the MDS plots of the 707 coronaviruses with spike gene sequences based on alignment-free and alignment distances. Figure 3.10 and Figure 3.11 show the hierarchical clustering of the viruses using alignment-based distance and the Manhattan distance with k-mer length of 6, respectively. Figure 3.12 shows the LOOCV prediction accuracies of the coronavirus dataset using alignment-based distance and Manhattan distance, and SVM for different sample sizes. The prediction accuracies for all the methods are greater than 0.90 when the sample size is above 400. When the sample size is < 400, the prediction 73 accuracy of SVM is somewhat higher than the KNN methods. Similar results are observed based on 10-fold and 20-fold cross-validations as shown in Figure 3.13. 3.3.3 Influenza A virus Parallel to the investigations for the rabies virus and the coronavirus, Figure 3.14 shows the MDS plots for the 1,200 influenza A viruses with N gene sequences based on Manhattan distance using 6-mers and alignment method. Figure 3.15 and Figure 3.16 show the corresponding hierarchical clustering results. The MDS and hierarchical clustering plots do not show a clear clustering pat tern according to the hosts. A potential explanation is that the sources of the influenza A virus data are much more diverse and consist of several different virus clades. Figure 3.17 shows the LOOCV prediction accuracies of the influenza A virus dataset using alignment-based distance, Manhattan distance using 6-mers, and SVM for different sample sizes. The prediction accuracies for alignment and alignment-free method are similar, and are higher than that of SVM method for this dataset. Similar results are obtained based on 10-fold and 20-fold cross-validations as shown in Figure 3.18. 74 Figure 3.5 Hierarchical clustering of 148 rabies viruses with complete N gene sequences using the alignment-based distances of the virus sequences. Each leaf in the figure is a virus sample colored by the host species' name. 75 Figure 3.6 Hierarchical clustering of 148 rabies viruses with complete N gene sequences based on the Manhattan distance between the 6-mer frequencies of the virus sequences. Each leaf in the figure is a virus sample colored by the host species' name. 76 ~ 0 ~ • • f 0 >- (.) I'-- ell 0 ,.._ :::::) (.) (.) ell (!) C 0 0 :;::::; (.) '6 LO Q) 0 ,.._ a. -.::I" 0 CV) 0 80 100 size : .. ; 120 ,,. A j a I • Ma • svm align 140 Figure 3. 7 The prediction accuracy for different sample sizes for the rabies virus dataset using alignment-based distance, Manhattan distance with 6-mers and one nearest neighbor, and SVM. The smooth lines are the fitted curves for the mean prediction accuracy for different sample sizes. Ma: Manhattan distance; align: Alignment based method; SVM: support vector machine based method. 77 CX) C >, 0 0 :.;::::; (.) ro ~ t-- 0 "O ::J 0 II ro (.) II 0 ~ I i - ~ > (.) .. i,; • • * ro I ! • A c.o * ~ n • i • en 'CJ •·.•· ~ en C 0 _x_ '<7 n 0 0 ' i i ... I ~ ~ · --· ii ~) . :::: 0 ,i ill ~ ~ ,_ :.;::::; :,.; • � l!f i:, (.) (.) LO I ~ Q ~ "O "O 0 • (I) • • Ma .g ,_ • svm I a. ~ align 0 0 'I'""" 80 100 120 140 CX) 0 C 0 0 >, 0 :.;::::; (.) tr. ro ro t-- 0 t Iii ' i ! * • • "O ,_ � II ,, ::J 0 Iii I t ! • ! ; • ~ ro (.) ;~, "" tll - --, > (.) lil Ill'!' ro ~ i i ~ • 3 I~ .::. c.o ' II * en i • • ~ ~ en C 0 IR s s 0 0 ' ' � � ,_ :.;::::; ~ 8 (.) (.) LO "O "O 0 .g (I) • Ma ,_ I a. • svm ~ align 0 N 0 80 100 120 140 sample size Figure 3.8 The prediction accuracy for different sample sizes for the rabies virus dataset using alignment-based distance, Manhattan distance with 6-mers and one nearest neighbor, and SVM using 10-fold (upper) and 20-fold (lower) cross validation. The smooth lines are the fitted curves for the mean prediction accu racy for different sample sizes. Ma: Manhattan distance; align: Alignment based method; SVM: support vector machine based method. 78 Ma align I co , 'SI' ci o avian ci tc (0 bat ~ ci + bovine C\I C\I C\I x human (I) (I) co ci co 'SI' � murine C 000 C ci v porcine 'E 0 'E 0 0 - o 0 4i~ 1 0 C\I ~ 0 ci 0 V 0 ci (.) ·- (.) (/) (/) 0 •'> 0 C\I 0 ci • .a. ~ ci ~ I V • .. 'SI: 0 • 'SI: I 0 I -0.4 -0.2 0.0 0.2 -0.5 0.0 0.5 MDS coordinate1 MDS coordinate1 Figure 3.9 MDS plots for the 707 coronaviruses with spike gene sequences based on the distances calculated by the Manhattan distance with 6-mers (left) and alignment (right). Each point in the plots is a sample colored by the host's name. 79 o avian bat + bovine x human � murine '\/ porcine Figure 3.10 Hierarchical clustering of the 707 coronaviruses with spike gene sequences using the alignment-based distances of the virus sequences. Each leaf in the figure is a virus sample colored by the host's name. 80 o avian bat + bovine x human � murine '\/ porcine Figure 3.11 Hierarchical clustering of the 707 coronaviruses with spike gene sequences using the Manhattan distance between the 6-mer frequencies of the virus sequences. Each leaf in the figure is a virus sample colored by the host's name. 81 0 T""" O'> 0 >- (.) ell ,.._ :::::) CX) (.) 0 (.) ell ., C • 0 :;::::; I'-- (.) 0 '6 Q) ,.._ a. <D 0 • Ma • svm LO align 0 100 200 300 400 500 600 700 size Figure 3.12 The prediction accuracy for different sample sizes for coronaviruses "allspike" set using alignment-based distance, Manhattan distance with 6-mers and 7 nearest neighbor, and SVM. The smooth lines are the fitted curves for the mean prediction accuracy for different sample sizes. Ma: Manhattan distance; align: Alignment based method; SVM: support vector machine based method. 82 0 C >, T""" 0 :.;::::; (.) ro ~ O') "O :J 0 ro (.) (.) > ro CX) en C 0 en 0 0 ,.._ :.;::::; ~ • Ma (.) (.) r-- • "O "O 0 • svm .g (I) ,.._ align I a. <D • 0 0 T""" 100 200 300 400 500 600 700 0 C T""" 0 >, :.;::::; (.) ro ro O') "O ,.._ :J 0 ro (.) (.) > ro CX) en C 0 en 0 0 ,.._ :.;::::; • Ma (.) (.) r-- • "O "O 0 • svm .g (I) ,.._ I a. align <D 0 0 N 100 200 300 400 500 600 700 sample size Figure 3.13 The prediction accuracy for different sample sizes for the coron avirus dataset using alignment-based distance, Manhattan distance with 6-mers and one nearest neighbor, and SVM using 10-fold (upper) and 20-fold (lower) cross-validation. The smooth lines are the fitted curves for the mean prediction accuracy for different sample sizes. Ma: Manhattan distance; align: Alignment based method; SVM: support vector machine based method. 83 Ma align C\J <> ci . ~ v• ., .. 'I, vO ... C\J ci C\J s:t <> x o 0 + o Ame ricanBlackDuck Q.) . .. Q.) 0 e • Blue- WingedTeal co . ..... co + Green- Wingeo'Teal 0 0 + >C Northern Pintail C ci . .. ' «t' • t C . 0 NorthernShoveler '6 '6 . V RuddyTu rnstone 0 . ~~ 0 0 * 0 (.) (.) . (/) .... (/) 0 0 C\J +. <> 0 0 0 0 ~ I +x ~ . •o ~ '~ Ii . <> + <> ! .0 ..... 'SI: s:t ~ 0 0 ci I I -0.3 -0.1 0.1 0.3 -0.04 0.00 0.04 MDS coordinate1 MDS coordinate1 Figure 3.14 MDS plots for the influenza A viruses with N gene sequences based on the distances calculated using Manhattan distance of 6-mer frequencies (left) and alignment method (right) . Each point in the plots is a sample colored by the host species' name. 84 o AmericanBlackDuck Blue-WingedTeal + Green-WingedTeal x NorthernPintail o NorthernShoveler v RuddyTurnstone Figure 3.15 Hierarchical clustering of the Influenza A viruses with N gene sequences using the alignment-based distances of the virus sequences. Each leaf in the figure is a virus sample colored by the host species' name. 85 o AmericanBlackDuck Blue-WingedTeal + Green-WingedTeal x NorthernPintail o NorthernShoveler v RuddyTurnstone Figure 3.16 Hierarchical clustering of the Influenza A viruses with N gene sequences using the Manhattan distance between the 6mer frequencies of the virus sequences. Each leaf in the figure is a virus sample colored by the host species' name. 86 ~ 0 I'-- 0 >- (.) <D Ctl 0 ,._ :::::) (.) (.) Ctl LO C 0 0 I V :;::::; (.) '6 "": Q) 0 ,._ a. C') ! 0 • Ma • svm C\J align 0 200 400 600 800 1000 size Figure 3.17 The prediction accuracy for different sample sizes for influenza A dataset using alignment-based distance, Manhattan distance with 6-mers and 7 nearest neighbor, and SVM. The smooth lines are the fitted curves for mean pre diction accuracy for different sample sizes. 87 I.() C >, 0 0 :.;::::; (.) ro ~ -.::t" "O ::J 0 ro (.) (.) > ro C") en C 0 en 0 0 ,.._ :.;::::; (.) (.) N "O "O 0 * (I) V • Ma .g ,.._ • svm I a. T""" align 0 0 T""" I.() 200 400 600 800 1000 0 C >, 0 :.;::::; (.) ro ro -.::t" "O ,.._ ::J 0 ro (.) (.) > ro C") en C 0 en 0 0 ,.._ :.;::::; (.) (.) N "O "O .g (I) 0 ,.._ • Ma I a. • svm 0 T""" align N 0 200 400 600 800 1000 size Figure 3.18 The prediction accuracy for different sample sizes for the influenza A virus dataset using alignment-based distance, Manhattan distance with 6-mers and one nearest neighbor, and SVM using 10-fold (upper) and 20-fold (lower) cross-validation. The smooth lines are the fitted curves for the mean prediction accuracy for different sample sizes. Ma: Manhattan distance; align: Alignment based method; SVM: support vector machine based method. 88 3.4 Discussion We investigated the use of alignment-based and alignment-free distance meth ods and support vector machine to predict the host of viruses based on three virus data sets: rabies virus, coronavirus and influenza A. None of the three meth ods consistently performs the best than the other methods. For the rabies virus dataset, the alignment based and alignment-free methods performs similarly and both outperform SVM with a large margin. For the coronavirus data set, SVM outperforms alignment based method followed by alignment-free method when the sample size is low. When the number of samples is large, eg. over 400, all the three methods perform similarly. Finally, for the influenza A virus, none of the methods performs well with prediction accuracy below 0.6. Alignment-free method does a little bit better than the alignment based method and both outperform SVM. Thus, this study shows both alignment-based and alignment-free methods can be effectively used to predict the hosts of viruses. For both the rabies and the influenza A viruses, we used the nucleoprotein gene sequences, while for coronavirus we used the spike gene. It was shown before that spike gene evolve fast to better prevent the virus from detection by the host [116]. While nucleoprotein gene is much more conserved, especially at the non-synonymous sites [35, 45]. For conserved genes like nucleoprotein, longer kmers than mono- and dinucleotide are needed to dis tinguish the sequences. Therefore, SVM using only mono- and dinucleotide does not perform as well as alignment based or Manhattan distance using 6mers for the rabies and influenza viruses. On the other hand, for highly divergent genes like the spike gene, mono- and dinucleotide are enough to capture the differences among the sequences resulting in better performance of the SVM method. Our study has several limitations. First, since viruses can jump from one host to another, a virus can belong to multiple hosts. We use the host which the virus is discovered from 89 as its only host. However, discovered in species A, the virus may also use species B as host but unknown. This will influence the prediction accuracy for both align ment and alignment-free methods. We only investigated three types of viruses and there are many other types of viruses available. More studies are needed to see the general applicability of our prediction methods. In conclusion, our study showed that both alignment based and alignment-free methods can be successfully used to predict the hosts of viruses. Therefore, when alignment is difficult or too time consuming, we can use alignment-free methods to predict the hosts of new viruses. 90 Chapter 4 Perspectives and Future Work In this dissertation, I have discussed the applications of machine learning meth ods to next generation sequencing data in bioinformatics. This work has helped to uncover the information hidden in genomic variation and to find patterns in sequences for the prediction of outcomes of interest. The unsupervised learning investigation in Chapter 2 reveals substantial vari ation in patterns of relatedness and found different biological drivers for this vari ation in different species. The methods described in Chapter 2 are developed into an open source R package called "lostruct". Our software provides choices of window lengths ( in bps or in SNP numbers) of the chromosomal segments to be compared, the schemes to deal with imbalanced sample distribution, choices of principal components summarizing the variation and so on. With increasing awareness of the heterogeneity of the effects of selection across the genome, such as the study of adaptive introgression, local adaption, and chromosomal inver sions, we have confidence that our local PCA methods described in "lostruct" will be a useful exploratory tool to investigate genetic variation and discover important genomic features . As found in Chapter 2, major biological processes driving the differentiation in the effect of population structure seem to be chromosomal inversions and certain types of selection that affect long regions of the genome. More exploration in more diverse species and datasets using our tools will help to reveal the different aspects of species history that explain the variation. In genome-wide association 91 studies, one major confounding factor is cryptic relatedness, i.e., kinship among individuals unknown to the investigators, which results in increasing false positive rate. Our method can be used to estimate the local relatedness matrix for samples and provide a correction to the association studies with high resolution. The comparative studies of various supervised learning application in Chapter 3 showed that alignment-free based distance measurements of sequences could be a promising substitute for alignment based measurements when they are used to predict the hosts of virus using viral sequences. As the so-called "no free lunch" theorem in machine learning, no algorithm will perform the best in all situations. The choice of models greatly depends on the properties of datasets, such as data sanity, sample size and distribution. In our study of three virus datasets, using alignment and alignment-free based distances followed by KNN have quite similar performances. They outperform SVM in the rabies virus dataset, while SVM does a better job in the coronavirus dataset. In the influenza A virus dataset, alignment and alignment-free based methods have a little bit better prediction accuracies than SVM, though overall, they all do not perform well, potentially due to the relatively low data quality and consistency. In our study in Chapter 3, we also compare KNN and SVM for classification, including many dissimilarity measurements among sequences for KNN. For SVM, we used mononucleotide frequency and dinucleotide bias as features , to be consis tent with the description in Tang et al. [108]. This feature engineering methods only take 1-mer and 2-mer information in the sequences into consideration and potentially lead to more loss of information contained in the sequences compared to utilizing more k-mer frequencies. With increasing k, the number of k-mer fre quency features increases exponentially, and so the required sample size also grows 92 greatly. When much larger datasets are obtained, we can try incorporating more k mer frequencies as features. Other algorithms, such as random forest and gradient boosting decision trees are also promising. With the improvement of classification models, we can try to predict the possible cross species transmission pathway of a given virus and uncover its evolutionary history, thus helping to understand and control potential emerging infectious diseases. 93 Reference List [1] Ricardo Aguas and Neil M Ferguson. Feature selection methods for identi fying genetic determinants of host species in RNA viruses. PLOS Comput. Biol. , 9(10):e1003254, 2013. [2] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. J. Mal. Biol, 215(3): 403-410, 1990. [3] Francesca Antonacci, Jeffrey M Kidd, Tomas Marques-Bonet, Mario Ven tura, Priscillia Siswara, Zhaoshi Jiang, and Evan E Eichler. Characterization of six human disease-associated inversion polymorphisms. Human molecular genetics, 18(14):2555- 2566, 2009. [4] William Astle and David J. Balding. Population structure and cryptic relat edness in genetic association studies. Statistical Science, 24(4):451- 471, 11 2009. doi: 10.1214/09-STS307. URL http:/ /dx. doi. org/10. 1214/ 09-STS307. [5] J C Avise, J F Shapira, S W Daniel, C F Aquadro, and R A Lansman. Mito chondrial DNA differentiation during the speciation process in Peromyscus. Mal Biol Evol, 1(1):38- 56, December 1983. doi: 10.1093/oxfordjournals. molbev.a040301. URL https: / /www. ncbi. nlm. nih. gov /pubmed/640064 7. [6] Nicholas H. Barton. Genetic hitchhiking. Philos Trans R Soc Land B Biol Sci, 355(1403):1553-1562, November 2000. doi: 10.1098/rstb.2000.0716. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1692896/. [7] Champak R. Beeravolu, Michael J. Hickerson, Laurent A. F. Frantz, and Konrad Lohse. Able: blockwise site frequency spectra for inferring com plex population histories and recombination. Genome Biology, 19(1):145, September 2018. ISSN 1474-760X. doi: 10.1186/s13059-018-1517-y. URL https://doi.org/10.1186/s13059-018-1517-y. 94 [8] Albert P. Blair. Population structure in toads. The American Naturalist, 77 (773):563- 568, 1943. ISSN 00030147, 15375323. URL http://www. j stor. org/stable/2457848. [9] Yaniv Brandvain, Amanda M. Kenney, Lex Flagel, Graham Coop, and Andrea L. Sweigart. Speciation and introgression between Mimulus nasutus and Mimulus guttatus. PLoS Genet, 10(6):e1004410, 06 2014. doi: 10.1371/journal.pgen.1004410. URL http://dx.doi.org/10.1371% 2Fjournal.pgen.1004410. [10] A. Brisbin, K. Brye, J. Byrnes, F. Zakharia, L. Omberg, J. Degenhardt, A. Reynolds, H. Ostrer, J. G. Mezey, and C. D. Bustamante. PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Hum. Biol. , 84(4):343- 364, August 2012. [11] Anthony J Brookes. The essence of snps. Gene, 234(2):177- 186, 1999. [12] Terence A Brown. Genomes 4- Garland science, 2018. [13] K Brye, A Auton, M R Nelson, J R Oksenberg, S L Hauser, S Williams, A Froment, J M Bodo, C Wambebe, S A Tishkoff, and C D Busta mante. Genome-wide patterns of population structure and admixture in West Africans and African Americans. Proc Natl Acad Sci U S A, 107 (2):786- 791, January 2010. doi: 10.1073/pnas.0909559107. URL http: //www.ncbi.nlm.nih.gov/pubmed/20080753. [14] R Burri, A Nater, T Kawakami, C F Mugal, P I Olason, L Smeds, A Suh, L Dutoit, S Bures, L Z Garamszegi, S Hogner, J Moreno, A Qvarnstrom, M Ruzic, S A Smther, G P Smtre, J Torok, and H Ellegren. Linked selec tion and recombination rate variation drive the evolution of the genomic landscape of differentiation across the speciation continuum of Ficedula fly catchers. Genome Res, 25(11):1656- 1665, November 2015. doi: 10.1101/gr. 196485.115. URL https: / /www. ncbi. nlm. nih. gov /pubmed/26355005. [15] Frank M.T.A. Busing, Erik Meijer, and Rien Van Der Leeden. Delete-m jackknife for unequal m. Statistics and Computing, 9(1):3- 8, 1999. ISSN 0960-3174. doi: 10.1023/ A:1008800423698. URL http: //dx. doi. org/10. 1023/A%3A1008800423698. [16] Jasper Fuk Woo Chan, Kelvin Kai Wang To, Honglin Chen, and Kwok Yung Yuen. Cross-species transmission and emergence of novel viruses from birds. Curr Opin Viral., 10:63- 69, 2015. 95 [17] B Charlesworth, M T Morgan, and D Charlesworth. The effect of deleteri ous mutations on neutral molecular variation. Genetics, 134(4):1289- 1303, August 1993. URL http://www. genetics. org/content/134/4/1289. [18] B Charlesworth, M Nordborg, and D Charlesworth. The effects of local selection, balanced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided populations. Genet Res, 70(2): 155-174, October 1997. URL https://www.ncbi.nlm.nih.gov/pubmed/ 9449192. [19] Brian Charlesworth. The effects of deleterious mutations on evolution at linked sites. Genetics, 190(1):5-22, January 2012. doi: 10.1534/genetics.111. 134288. URL http://www. ncbi. nlm. nih. gov /pubmed/22219506. [20] Brian Charlesworth. Background selection 20 years on: The Wilhelmine E. Key 2012 invitational lecture. Journal of Heredity, 104(2):161-171, 2013. doi: 10.1093/jhered/ess136. URL http:/ /jhered. oxford journals. org/ content/104/2/161.abstract. [21] Brian Charlesworth, Deborah Charlesworth, and Nicholas H. Bar ton. The effects of genetic and geographic structure on neu tral variation. Annual Review of Ecology, Evolution, and System atics, 34(1):99- 125, 2003. doi: 10.1146/annurev.ecolsys.34.011802. 132359. URL http: //arjournals. annualreviews. org/doi/abs/10. 1146/annurev.ecolsys.34.011802.132359. [22] J M Comeron, R Ratnappan, and S Bailin. The many landscapes of recombination in Drosophila melanogaster. PLoS Genet, 8(10), 2012. doi: 10.1371/journal.pgen.1002905. URL http://www. ncbi. nlm. nih. gov/pmc/ articles/PMC3469467/. [23] 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015. [24] Russell B Corbett-Detig and Daniel L Hartl. Population genomics of inversion polymorphisms in Drosophila melanogaster. PLoS Genet, 8(12): e1003056, 2012. [25] Russell B. Corbett-Detig, Daniel L. Hartl, and Timothy B. Sackton. Natural selection constrains neutral diversity across a wide range of species. P LoS Biol, 13(4):e1002112, 04 2015. doi: 10.1371/journal.pbio.1002112. URL http://dx.doi.org/10.1371%2Fjournal.pbio.1002112. [26] T E Cruickshank and M W Hahn. Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Mal Ecol, 96 23(13):3133- 3157, 07 2014. doi: 10.1111/mec.12796. URL https: //www. ncbi.nlm.nih.gov/pubmed/24845075. [27] Charles Darwin. Journal of Researches Into the Natural History and Geol ogy of the Countries Visited During the Voyage of HMS" Beagle" Round the World: Under the Command of Capt. Fitz Roy, RN, volume 1. D. Appleton, 1890. [28] Nicolas Duforet-Frebourg, Keurcien Luu, Guillaume Laval, Eric Bazin, and Michael G.B. Blum. Detecting genomic signatures of natural selec tion with principal component analysis: Application to the 1000 genomes data. Molecular Biology and Evolution, 2015. doi: 10.1093/molbev/ msv334. URL http:/ /mbe. oxfordj ournals. org/ content/ early /2016/ 01/12/molbev.msv334.abstract. [29] B. Efron. The Jackknife, the Bootstrap and Other Resampling Plans. Society for Industrial and Applied Mathematics, 1982. doi: 10.1137 / 1.9781611970319. URL http:/ /epubs. siam. org/doi/abs/10. 1137 /1. 9781611970319. [30] Hans Ellegren, Linnea Smeds, Reto Burri, Pall I. Olason, Niclas Back strom, Takeshi Kawakami, Axel Kunstner, Hannu Makinen, Krystyna Nadachowska-Brzyska, Anna Qvarnstrom, Severin Uebbing, and Jochen B. W. Wolf. The genomic landscape of species divergence in Ficedula fly catchers. Nature, 491(7426):756-760, November 2012. ISSN 00280836. doi: 10.1038/nature11584. URL http: //dx. doi. org/10 .1038/nature11584. [31] Adam Eyre-Walker and Peter D Keightley. The distribution of fitness effects of new mutations. Nature Reviews Genetics, 8(8):610, 2007. [32] J Felsenstein. PHYLIP: phylogenetic inference package, version 3.5 c. 1993. [33] Anna-Sophie Fiston-Lavier, Nadia D. Singh, Mikhail Lipatov, and Dmitri A. Petrov. Drosophila melanogaster recombination rate calculator. Gene, 463(1-2):18 - 20, 2010. ISSN 0378-1119. doi: http://dx.doi.org/10. 1016/j.gene.2010.04.015. URL http://www. sciencedirect. com/science/ article/pii/S0378111910001769. [34] B M Fitzpatrick, J R Johnson, D K Kump, J J Smith, S R Voss, and H B Shaffer. Rapid spread of invasive genes into a threatened native species. Proc Natl Acad Sci US A, 107(8):3606-3610, February 2010. doi: 10.1073/pnas. 0911802107. URL http://www. ncbi. nlm. nih. gov /pubmed/20133596. 97 [35] Owen T Gorman, William J Bean, Yoshihiro Kawaoka, and ROBERT G Webster. Evolution of the nucleoprotein gene of influenza A virus. J. Viral., 64(4):1487- 1497, 1990. [36] Peter R Grant and B Rosemary Grant. Evolution of character displacement in darwin's finches. science, 313(5784):224- 226, 2006. [37] Jennifer A Greenhall, Matthew A Zapala, Mario Caceres, Ondrej Libiger, Carrolee Barlow, Nicholas J Schork, and David J Lockhart. Detecting genetic variation in microarray expression data. Genome research, 17(8):1228- 1235, 2007. [38] Rafael F. Guerrero, Frangois Rousset, and Mark Kirkpatrick. Coalescent patterns for chromosomal inversions in divergent populations. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1587):430- 438, 2011. ISSN 0962-8436. doi: 10.1098/rstb.2011.0246. URL http:/ /rstb. royalsocietypublishing.org/content/367/1587/430. [39] Melissa Gymrek, Thomas Willems, Audrey Guilmatre, Haoyang Zeng, Barak Markus, Stoyan Georgiev, Mark J Daly, Alkes L Price, Jonathan K Pritchard, Andrew J Sharp, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nature genetics, 48(1) :22, 2016. [40] Benjamin C. Haller and Philipp W. Messer. SLiM 2: Flexible, interactive forward genetic simulations. Molecular Biology and Evolution, 34(1):230- 240, 2017. doi: 10.1093/molbev /msw211 . URL /brokenurl#+http: / /dx. doi.org/10.1093/molbev/msw211. [ 41] Benjamin C. Haller, Jared Galloway, Jerome Kelleher, Philipp W. Messer, and Peter L. Ralph. Tree-sequence recording in SLiM opens new hori zons for forward-time simulation of whole genomes. bioRxiv, 2018. doi: 10.1101/407783. URL https: //www.biorxiv.org/content/early/2018/ 09/04/407783. [ 42] Kelley Harris and Rasmus Nielsen. The genetic cost of Neanderthal intro gression. Genetics, 203(2):881- 891, June 2016. URL http://www. genetics. org/content/203/2/881. [43] P W Hedrick. Adaptive introgression in animals: examples and comparison to new mutation and standing variation as sources of adaptive variation. Mol Ecol, 22(18):4606-4618, September 2013. doi: 10.1111/mec.12415. URL https://www.ncbi.nlm.nih.gov/pubmed/23906376. [ 44] Asger Hobolth, Ole F Christensen, Thomas Mailund, and Mikkel H Schierup. Genomic relationships and speciation times of human, chimpanzee, and 98 gorilla inferred from a coalescent hidden markov model. PLOS Genet ics, 3(2):1- 11, 02 2007. doi: 10.1371/journal.pgen.0030007. URL http: //dx.doi.org/10.1371%2Fjournal.pgen.0030007. [45] Edward C Holmes, Christopher H Woelk, Raid Kassis, and Herve Bourhy. Genetic constraints and the adaptive evolution of rabies virus in nature. Virology, 292(2):247-257, 2002. [46] R R Hudson and N L Kaplan. Deleterious background selection with recombination. Genetics, 141(4):1605- 1617, December 1995. URL http: //www.genetics.org/content/141/4/1605. [47] Emilia Huerta-Sanchez, Michael DeGiorgio, Luca Pagani, Ayele Tarekegn, Rosemary Ekong, Tiago Antao, Alexia Cardona, Hugh E. Montgomery, Gianpiero L. Cavalleri, Peter A. Robbins, Michael E. Weale, Neil Brad man, Endashaw Bekele, Toomas Kivisild, Chris Tyler-Smith, and Rasmus Nielsen. Genetic signatures reveal high-altitude adaptation in a set of Ethiopian populations. Molecular Biology and Evolution, 30(8):1877- 1888, 2013. doi: 10.1093/molbev/mst089. URL http://mbe.oxfordjournals. org/content/30/8/1877.abstract. [48] Matthew B. Hufford, Pesach Lubinksy, Tanja Pyhajarvi, Michael T. Deven genzo, Norman C. Ellstrand, and Jeffrey Ross-Ibarra. The genomic signa ture of crop-wild introgression in maize. PLoS Genet, 9(5):e1003477, 05 2013. doi: 10.1371/journal.pgen.1003477. URL http:/ /dx. doi. org/10. 1371%2Fjournal.pgen.1003477. [49] International HapMap Consortium, K A Frazer, D G Ballinger, D R Cox, DA Hinds, LL Stuve, RA Gibbs, J W Belmont, A Boudreau, P Hardenbol, S M Leal, S Pasternak, DA Wheeler, TD Willis, F Yu, H Yang, C Zeng, Y Gao, H Hu, W Hu, C Li, W Lin, S Liu, H Pan, X Tang, J Wang, W Wang, J Yu, B Zhang, Q Zhang, H Zhao, H Zhao, J Zhou, S B Gabriel, R Barry, B Blumenstiel, A Camargo, M Defelice, M Faggart, M Goyette, S Gupta, J Moore, H Nguyen, RC Onofrio, M Parkin, J Roy, E Stahl, E Winchester, L Ziaugra, D Altshuler, Y Shen, Z Yao, W Huang, X Chu, Y He, L Jin, Y Liu, Y Shen, W Sun, H Wang, Y Wang, Y Wang, X Xiong, L Xu, M M Waye, S K Tsui, H Xue, J T Wong, L M Galver, J B Fan, K Gunderson, S S Murray, AR Oliphant, MS Chee, A Montpetit, F Chagnon, V Ferretti, M Leboeuf, J F Olivier, M S Phillips, S Roumy, C Sallee, A Verner, T J Hudson, P Y Kwok, D Cai, D C Koboldt, R D Miller, L Pawlikowska, P Taillon-Miller, M Xiao, L C Tsui, W Mak, Y Q Song, P K Tam, Y Nakamura, T Kawaguchi, T Kitamoto, T Morizono, A Nagashima, Y Ohnishi, A Sekine, T Tanaka, T Tsunoda, P Deloukas, C P Bird, M Delgado, E T Dermitzakis, R Gwilliam, 99 S Hunt, J Morrison, D Powell, B E Stranger, P Whittaker, D R Bentley, M J Daly, P I de Bakker, J Barrett, Y R Chretien, J Maller, S McCarroll, N Patterson, I Pe'er, A Price, S Purcell, D J Richter, P Sabeti, R Saxena, SF Schaffner, PC Sham, P Varilly, D Altshuler, L D Stein, L Krishnan, AV Smith, MK Tello-Ruiz, GA Thorisson, A Chakravarti, PE Chen, DJ Cutler, C S Kashuk, S Lin, G R Abecasis, W Guan, Y Li, H M Munro, Z S Qin, D J Thomas, G McVean, A Auton, L Bottolo, N Cardin, S Eyheramendy, C Freeman, J Marchini, S Myers, C Spencer, M Stephens, P Donnelly, LR Cardon, G Clarke, D M Evans, A P Morris, B S Weir, T Tsunoda, J C Mullikin, S T Sherry, M Feolo, A Skol, H Zhang, C Zeng, H Zhao, I Matsuda, Y Fukushima, DR Macer, E Suda, C N Rotimi, CA Adebamowo, I Ajayi, T Aniagwu, P A Marshall, C Nkwodimmah, C D Royal, M F Leppert, M Dixon, A Peiffer, R Qiu, A Kent , K Kato, N Niikawa, I F Adewole, B M Knoppers, M W Foster, E W Clayton, J Watkin, R A Gibbs, J W Belmont, D Muzny, L Nazareth, E Sodergren, GM Weinstock, DA Wheeler, I Yakub, S B Gabriel, R C Onofrio, D J Richter, L Ziaugra, B W Birren, M J Daly, D Altshuler, R K Wilson, L L Fulton, J Rogers, J Burton, N P Carter, C M Clee, M Griffiths, M C Jones, K McLay, R W Plumb, M T Ross, S K Sims, D L Willey, Z Chen, H Han, L Kang, M Godbout, J C Wallenburg, P L' Archeveque, G Bellemare, K Saeki, H Wang, D An, H Fu, Q Li, Z Wang, R Wang, A L Holden, L D Brooks, J E McEwen, M S Guyer, VO Wang, J L Peterson, M Shi, J Spiegel, L M Sung, L F Zacharia, F S Collins, K Kennedy, R Jamieson, and J Stewart. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164):851- 861, October 2007. doi: 10.1038/ nature06258. URL http://www. ncbi. nlm. nih. gov/pubmed/17943122. [50] Ivan Juric, Simon Aeschbacher, and Graham Coop. The strength of selection against Neanderthal introgression. bioRxiv, 2016. doi: 10.1101/ 030148. URL http://biorxiv.org/content/early/2016/07/22/030148. [51] N. Kambhatla and T. K. Leen. Dimension reduction by local principal com ponent analysis. Neural Computation, 9(7):1493-1516, July 1997. ISSN 0899- 7667. doi: 10.1162/neco.1997.9.7.1493. URL http: //ieeexplore. ieee. org/xpl/freeabs_all.jsp?arnumber=6795533. [52] A Kapoor, P Simmonds, WI Lipkin, S Zaidi, and E Delwart. Use of nucleotide composition analysis to infer hosts for three novel picorna-like viruses. J. Viral. , 84(19):10322-10328, 2010. [53] Fatemeh Kargarfard, Ashkan Sarni, Manijeh Mohammadi-Dehcheshmeh, and Esmaeil Ebrahimie. Novel approach for identification of influenza virus 100 host range and zoonotic transmissible sequences by determination of host related associative positions in viral genome segments. BMC Genomics, 17 (1):925, 2016. [54] Roshan Karki, Deep Pandya, Robert C Elston, and Cristiano Ferlini. Defin ing "mutation" and "polymorphism" in the era of personal genomics. BMC m edical genomics, 8(1):37, 2015. [55] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an intro duction to cluster analysis, volume 344. John Wiley & Sons, 2009. [56] Jerome Kelleher, Kevin Thornton, Jaime Ashander, and Peter Ralph. Effi cient pedigree recording for fast population genetics simulation. bioRxiv, 2018. doi: 10.1101/248500. URL https: //www .biorxiv.org/content/ early/2018/06/07/248500. [57] Yuseob Kim and Wolfgang Stephan. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics, 160(2):765-777, 2002. URL http://www.genetics.org/cgi/content/abstract/160/2/ 765. [58] Mark Kirkpatrick. How and why chromosome inversions evolve. PLoS Biol, 8 (9) , 2010. doi: 10.1371/journal.pbio.1000501 . URL http://www.ncbi.nlm. nih.gov/pubmed/20927412. [59] Mark Kirkpatrick and Brian Barrett. Chromosome inversions, adaptive cas settes and the evolution of species' ranges. Molecular Ecology, 2015. ISSN 1365-294X. doi: 10.1111/mec.13074. URL http://dx.doi.org/10.1111/ mec. 13074. [60] Vessela Nedelcheva Kristensen, Dimitris Kelefiotis, Tom Kristensen, and Anne-Lise B0rresen-Dale. High-throughput methods for detection of genetic variation. Biotechniques, 30(2):318- 332, 2001. [61] Joseph B Kruskal and Myron Wish. Multidimensional Scaling, volume 11. Sage, 1978. [62] Justin B Lack, Charis M Cardeno, Marc W Crepeau, William Taylor, Rus sell B Corbett-Detig, Kristian A Stevens, Charles H Langley, and John E Pool. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ances tral range population. Genetics, 199(4):1229- 1241, 2015. [63] C H Langley, K Stevens, C Cardeno, Y C Lee, D R Schrider, J E Pool, S A Langley, C Suarez, R B Corbett-Detig, B Kolaczkowski, S Fang, P M 101 Nista, AK Holloway, AD Kern, C N Dewey, Y S Song, MW Hahn, and DJ Begun. Genomic variation in natural populations of Drosophila melanogaster. Genetics, 192(2):533- 598, October 2012. doi: 10.1534/genetics.112.142018. URL http://www. ncbi. nlm. nih. gov /pubmed/22673804. [64] Daniel T Larose. k-nearest neighbor algorithm. Discovering Knowledge in Data: An Introduction to Data Mining, pages 90-106, 2005. [65] Susanna KP Lau, Patrick CY Woo, Kenneth SM Li, Yi Huang, Hoi-Wah Tsoi, Beatrice HL Wong, Samson SY Wong, Suet-Yi Leung, Kwok-Hung Chan, and Kwok-Yung Yuen. Severe acute respiratory syndrome coronavirus like virus in Chinese horseshoe bats. Proc. Natl. Acad. Sci. US A , 102(39): 14040-14045, 2005. [66] Thomas Lenormand. Gene flow and the limits to natural selection. Trends in Ecology f3 Evolution, 17( 4):183 - 189, 2002. ISSN 0169-5347. doi: DOI:10.1016/S0169-5347(02)02497-7. URL http://www. sciencedirect. com/science/article/pii/S0169534702024977. [67] Wendong Li, Zhengli Shi, Meng Yu, Wuze Ren, Craig Smith, Jonathan H Epstein, Hanzhong Wang, Gary Crameri, Zhihong Hu, Huajun Zhang, et al. Bats are natural reservoirs of SARS-like coronaviruses. Science, 310(5748): 676-679, 2005. [68] Laurence Loewe. Genetic mutation. Nature education, 1(1):113, 2008. [69] P R Loh, G Tucker, B K Bulik-Sullivan, B J Vilhjalmsson, H K Finucane, RM Salem, DI Chasman, PM Ridker, BM Neale, B Berger, N Patterson, and AL Price. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet, 47(3):284-290, March 2015. doi: 10.1038/ ng.3190. URL https: / /www. ncbi. nlm. nih. gov /pubmed/25642633. [70] Ben Longdon, Michael A Brockhurst, Colin A Russell, John J Welch, and Francis M Jiggins. The evolution and genetics of virus host shifts. PLoS Pathog. , 10(11):e1004395, 2014. [71] Yang Young Lu, Kujin Tang, Jie Ren, Jed A Fuhrman, Michael S Water man, and Fengzhu Sun. CAFE: accelerated alignment-free sequence analysis. Nucleic Acids Res., 45(gkx351):W554- W559, 2017. [72] J Ma and C I Amos. Investigation of inversion polymorphisms in the human genome using principal components analysis. PLoS One, 7(7) , 2012. doi: 10. 1371/journal.pone.0040224. URL http://www. ncbi. nlm. nih. gov /pubmed/ 22808122. 102 [73] Trudy F. C. Mackay, Stephen Richards, Eric A. Stone, Antonio Barbadilla, Julien F. Ayroles, Dianhui Zhu, Sonia Casillas, Yi Han, Michael M. Mag wire, Julie M. Cridland, Mark F. Richardson, Robert R. H. Anholt, Maite Barron, Crystal Bess, Kerstin Petra Blankenburg, Mary Anna Carbone, David Castellano, Lesley Chaboub, Laura Duncan, Zeke Harris, Mehwish Javaid, Joy Christina Jayaseelan, Shalini N. Jhangiani, Katherine W. Jor dan, Fremiet Lara, Faye Lawrence, Sandra L. Lee, Pablo Librado, Raquel S. Linheiro, Richard F. Lyman, Aaron J. Mackey, Mala Munidasa, Donna Marie Muzny, Lynne Nazareth, Irene Newsham, Lora Perales, Ling-Ling Pu, Car son Qu, Miquel Ramia, Jeffrey G. Reid, Stephanie M. Rollmann, Julio Rozas, Nehad Saada, Lavanya Turlapati, Kim C. Worley, Yuan-Qing Wu, Akihiko Yamamoto, Yiming Zhu, Casey M. Bergman, Kevin R. Thornton, David Mittelman, and Richard A. Gibbs. The Drosophila melanogaster genetic ref erence panel. Nature, 482(7384) :173-178, February 2012. ISSN 00280836. URL http:/ /dx. doi . org/10. 1038/nature10811. [7 4] Jose V Manj6n, Pierrick Coupe, Luis Concha, Antonio Buades, D Louis Collins, and Montserrat Robles. Diffusion weighted image denoising using overcomplete local PCA. PloS one, 8(9):e73021, 2013. [75] Simon Henry Martin, Markus Moest, Wiliam J Palmer, Camilo Salazar, W. Owen McMillan, Francis M Jiggins, and Chris D Jiggins. Natural selec tion and genetic diversity in the butterfly Heliconius melpomene. Genet ics, 203(1):525-541, May 2016. doi: 10.1101/042796. URL http://www. genetics.org/content/203/1/525. [76] Gil McVean. A genealogical interpretation of principal components analysis. PLoS Genet, 5(10) :e1000686, 2009. [77] P Menozzi, A Piazza, and L Cavalli-Sforza. Synthetic maps of human gene frequencies in Europeans. Science, 201( 4358):786-792, September 1978. URL http://www.ncbi.nlm.nih.gov/pubmed/356262. [78] Michael W Nachman. Single nucleotide polymorphisms and recombination rate in humans. TRENDS in Genetics, 17(9):481- 485, 2001. [79] Michael W Nachman and Susan L Crowell. Estimate of the mutation rate per nucleotide in humans. Genetics, 156(1):297- 304, 2000. [80] N J Nadeau, A Whibley, RT Jones, J W Davey, K K Dasmahapatra, S W Baxter, MA Quail, M Joron, RH ffrench Constant, ML Blaxter, J Mallet, and C D Jiggins. Genomic islands of divergence in hybridizing Heliconius butterflies identified by large-scale targeted sequencing. Philos Trans R Soc Land B Biol Sci, 367(1587):343-353, February 2012. doi: 10.1098/rstb.2011. 0198. URL http://www . ncbi . nlm. nih. gov /pubmed/22201164. 103 [81] M R Nelson, K Brye, K S King, A Indap, A R Boyko, J Novembre, L P Briley, Y Maruyama, D M Waterworth, G Waeber, P Vollenweider, J R Oksenberg, SL Hauser, HA Stirnadel, JS Kooner, JC Chambers, B Jones, V Mooser , CD Bustamante, AD Roses, DK Burns, M G Ehm, and EH Lai. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am J Hum Genet, 83(3): 347- 358, September 2008. doi: 10.1016/j.ajhg.2008.08.005. URL http:// www.ncbi.nlm.nih.gov/pmc/articles/PMC2556436. [82] Rasmus Nielsen, Joshua S Paul, Anders Albrechtsen, and Yun S Song. Geno type and snp calling from next-generation sequencing data. Nature Reviews Genetics, 12(6):443, 2011. [83] John Novembre and Matthew Stephens. Interpreting principal component analyses of spatial population genetic variation. Nature genetics, 40(5):646- 649, 2008. [84] John Novembre, Toby Johnson, Katarzyna Brye, Zoltan Kutalik, Adam R Boyko, Adam Auton, Amit Indap, Karen S King, Sven Bergmann, Matthew R Nelson, et al. Genes mirror geography within europe. Nature, 456(7218):98-101 , 2008. [85] Peter J Oefner and Peter A Underhill. Dna mutation detection using dena turing high-performance liquid chromatography ( dhplc). Current protocols in human genetics, 19(1):7-10, 1998. [86] US Department of Health, Human Services, et al. 14th report on carcinogens (roe) , 2016. [87] Timothy Paape, Peng Zhou, Antoine Branca, Roman Briskine, Nevin Young, and Peter Tiffin. Fine-scale population recombination rates, hotspots, and correlates of recombination in the M edicago truncatula genome. Genome Biology and Evolution, 4(5):726- 737, 2012. doi: 10.1093/gbe/ evs046. URL http://gbe.oxfordjournals.org/content/4/5/726.abstract. [88] P Pamilo and M Nei. Relationships between gene trees and species trees. Mal Biol Evol, 5(5):568- 583, September 1988. doi: 10.1093/ oxfordjournals. molbev.a040517. URL https: / /www. ncbi. nlm. nih. gov/pubmed/3193878. [89] Nick Patterson, Alkes L Price, and David Reich. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190, 12 2006. doi: 10.1371/ journal.pgen.0020190. URL http:/ /dx. plos. org/10. 1371%2Fjournal. pgen. 0020190. 104 [90] J B Pease and M W Hahn. More accurate phylogenies inferred from low-recombination regions in the presence of incomplete lineage sorting. Evolution, 67(8):2376- 2384, August 2013. doi: 10.1111/evo.12118. URL http://www.ncbi.nlm.nih.gov/pubmed/23888858. [91] James B. Pease, David C. Haak, Matthew W. Hahn, and Leonie C. Moyle. Phylogenomics reveals three sources of adaptive variation during a rapid radiation. PLOS Biology, 14(2):1-24, 02 2016. doi: 10.1371/journal.pbio. 1002379. URL http:/ /dx. doi. org/10 .1371%2Fjournal. pbio .1002379. [92] Tanya N. Phung, Christian D. Huber, and Kirk E. Lohmueller. Determining the effect of natural selection on linked neutral divergence across species. PLOS Genetics, 12(8):1-27, 08 2016. doi: 10.1371/journal.pgen.1006199. URL https: //doi. org/10 .1371/journal .pgen.1006199. [93] John E Pool. The mosaic ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster reference genome reveals a network of epistatic fit ness interactions. Molecular Biology and Evolution, 32(12) :3236- 3251, 2015. doi: 10.1101/014837. URL http: //mbe. oxfordjournals. org/content/ 32/12/3236.abstract. [94] John E. Pool, Russell B. Corbett-Detig, Ryuichi P. Sugino, Kristian A. Stevens, Charis M. Cardeno, Marc W. Crepeau, Pablo Duchen, J. J. Emerson, Perot Saelao, David J. Begun, and Charles H. Langley. Pop ulation genomics of sub-Saharan Drosophila melanogaster: African diver sity and non-African admixture. PLoS Genet, 8(12):1- 24, 12 2012. doi: 10.1371/journal.pgen.1003080. URL http: //dx. doi. org/10. 1371% 2Fjournal.pgen.1003080. [95] Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8): 904-909, 2006. [96] Ji Qi, Hong Luo, and Bailin Hao. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res., 32(suppl_ 2):W45-W47, 2004. [97] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman. Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. , 16(12):1615- 1634, 2009. [98] Jie Ren, Kai Song, Minghua Deng, Gesine Reinert, Charles H Cannon, and Fengzhu Sun. Inference of Markovian properties of molecular sequences from 105 NGS data and applications to comparative genomics. Bioinformatics, 32(7): 993- 1000, 2015. [99] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323- 2326, 2000. ISSN 0036-8075. doi: 10.1126/science.290.5500.2323. URL http: //science. sciencemag.org/content/290/5500/2323. [100] Fabian Sievers and Desmond G Higgins. Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol. Biol., pages 105- 116, 2014. [101] M Slatkin and J L Pollack. The concordance of gene trees and species trees at two linked loci. Genetics, 172(3):1979- 1984, March 2006. doi: 10. 1534/ genetics.105.049593. URL https: / /www. ncbi. nlm. nih. gov /pubmed/ 16361238. [102] Montgomery Slatkin. Linkage disequilibrium- understanding the evolution ary past and mapping the medical future. Nature Reviews Genetics, 9(6): 477, 2008. [103] Temple F Smith and Michael S Waterman. Identification of common molec ular subsequences. J. Mol. Biol, 147(1):195-197, 1981. [104] Sean Stankowski, Madeline A Chase, Allison M Fuiten, Peter L Ralph, and Matthew A Streisfeld. The tempo of linked selection: Rapid emergence of a heterogeneous genomic landscape during a radiation of monkeyflow ers. bioRxiv, 2018. doi: 10.1101/342352. URL https: //www.biorxiv.org/ content/early/2018/06/21/342352. [105] Fabian Staubach, Anna Lorenc, Philipp W. Messer, Kun Tang, Dmitri A. Petrov, and Diethard Tautz. Genome patterns of selection and introgression of haplotypes in natural populations of the house mouse (Mus musculus). PLoS Genet, 8(8):e1002891, 08 2012. doi: 10.1371/journal.pgen.1002891. URL http:/ /dx. doi. org/10. 1371%2Fjournal. pgen. 1002891. [106] Daniel G Streicker, Amy S Turmelle, Maarten J Vonhof, Ivan V Kuzmin, Gary F McCracken, and Charles E Rupprecht. Host phylogeny constrains cross-species emergence and establishment of rabies virus in bats. Science, 329(5992):676- 679, 2010. [107] Haibao Tang, Vivek Krishnakumar, Shelby Bidwell, Benjamin Rosen, Agnes Chan, Shiguo Zhou, Laurent Gentzbittel, Kevin L Childs, Mark Yandell, Heidrun Gundlach, et al. An improved genome release (version mt4. 0) for the model legume Medicago truncatula. BMC genomics, 15(1):1, 2014. 106 [108] Qin Tang, Yulong Song, Mijuan Shi, Yingyin Cheng, Wanting Zhang, and Xiao-Qin Xia. Inferring the hosts of coronavirus using dual statistical models based on nucleotide composition. Sci. Rep., 5 (1 7155), 2015. [109] TL Turner, MW Hahn, and S V Nuzhdin. Genomic islands of speciation in Anopheles gambiae. PLoS Biol, 3(9), September 2005. doi: 10.1371/journal. pbio.0030285. URL https: / /www. ncbi. nlm. nih. gov /pubmed/16076241. [110] Benjamin Vernot and Joshua M. Akey. Resurrecting surviving neandertal lineages from modern human genomes. Science, 2014. doi: 10.1126/science. 1245938. URL http://www. sciencemag. org/ content/ early /2014/01/ 28/science.1245938.abstract. [111] Lin Wan, Gesine Reinert, Fengzhu Sun, and Michael S Waterman. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J. Comput. Biol., 17(11):1467-1490, 2010. [112] Ian J. Wang and Gideon S. Bradburd. Isolation by environment. Molecular Ecology, 23(23):5649-5662, 2014. ISSN 1365-294X. doi: 10.1111/mec.12938. URL http:/ /dx. doi. org/10. 1111/mec. 12938. [113] Andreas Weingessel and Kurt Hornik. Local PCA algorithms. Neural Net works, IEEE Transactions on, 11(6):1242-1250, 2000. [114] Sewall Wright. The genetical structure of populations. Annals of Eugenics, 15 (1):323-354, 1949. ISSN 2050-1439. doi: 10.1111/j.1469-1809.1949.tb02451. x. URL http:/ /dx. doi. org/10. 1111/j. 1469-1809. 1949. tb02451. x. [115] Jian Yang, Noah A Zaitlen, Michael E Goddard, Peter M Visscher, and Alkes L Price. Advantages and pitfalls in the application of mixed-model association methods. Nat Genet, 46(2):100- 106, February 2014. ISSN 10614036. URL http:/ /dx. doi. org/10. 1038/ng. 2876. [116] Chi Yu Zhang, Ji Fu Wei, and Shao Heng He. Adaptive evolution of the spike gene of SARS coronavirus: changes in positively selected sites in different epidemic groups. BMC Microbial. , 6(1):88, 2006. [117] Yun Zhang, Brian D Aevermann, Tavis K Anderson, David F Burke, Gwe naelle Dauphin, Zhiping Gu, Sherry He, Sanjeev Kumar, Christopher N Larsen, Alexandra J Lee, et al. Influenza research database: An integrated bioinformatics resource for influenza virus research. Nucleic Acids Res. , 45 (D1):D466- D474, 2016. 107
Abstract (if available)
Abstract
The Human Genome Project opened an era of deciphering the mystery of the genome. Then, Next Generation Sequencing (NGS) technologies enabled us to obtain massive high-quality sequencing data and revolutionized genomic research. Meanwhile, the development and extensive application of machine learning show its impressive competence in the big data era. In this dissertation, I implement statistical and machine learning tools to extract information from the genetic variation in genome sequences and make predictions for future outcomes of interest. ❧ I first present a study using unsupervised learning methods to investigate the variation of population structure along the genome and try to identify the underlying biological causes. In this work, I use principal component analysis (PCA) to see the effect of population structure on each segment of the genome, followed by multidimensional scaling (MDS) to describe the variation across the genome. Exploration in three species and validation from simulations reveal significant variation in the effect of population structure on the scale of only a few megabases. In modern humans, polymorphic chromosomal inversions are the main factors that cause the localized variation. In Medicago truncatula, the variation is correlated with local gene density, and is perhaps due to linked selection. In Drosophila melanogaster, both inversions and gene density contribute to the variation. ❧ Next, I present another work comparing various supervised learning methods to predict the hosts of viruses based on viral sequences. In this work, I compared alignment and alignment-free based distances followed by K nearest neighbors (KNN), and k-mer frequency feature engineering followed by support vector machine (SVM) to predict hosts for rabies virus, coronavirus, and influenza A virus datasets. I showed that alignment-free methods could be a promising substitute for alignment-based methods and their performances are highly correlated with sample size, evenness of sampling and sequence conservation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Machine learning of DNA shape and spatial geometry
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
3D modeling of eukaryotic genomes
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Understanding the 3D genome organization in topological domain level
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
Asset Metadata
Creator
Li, Han
(author)
Core Title
Application of machine learning methods in genomic data analysis
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/11/2019
Defense Date
08/26/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment-free distance measure,genomic variation,KNN,machine learning,MDS,OAI-PMH Harvest,PCA,population structure,SVM
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Ehrenreich, Ian (
committee member
), Ralph, Peter (
committee member
)
Creator Email
hli465@usc.edu,lyriclh@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-225606
Unique identifier
UC11674955
Identifier
etd-LiHan-7859.pdf (filename),usctheses-c89-225606 (legacy record id)
Legacy Identifier
etd-LiHan-7859.pdf
Dmrecord
225606
Document Type
Dissertation
Rights
Li, Han
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alignment-free distance measure
genomic variation
KNN
machine learning
MDS
PCA
population structure
SVM