Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis of genomic polymorphism in Arabidopsis thaliana
(USC Thesis Other)
Analysis of genomic polymorphism in Arabidopsis thaliana
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANALYSIS OF GENOMIC POLYMORPHISM INARABIDOPSISTHALIANA by Yu Huang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2011 Copyright 2011 Yu Huang Dedication tofamilyandfriends ii Acknowledgments I would like to thank my advisor, Prof. Magnus Nordborg for accepting me into the lab when I might drop out of the department (with the help of Prof. Michael Waterman and Prof. Simon Tavar´ e), guiding me through early clueless stage of research with his wisdom, knowledge and patience, and helping me learn the vital skills of both verbal and written communication; committee members, Prof. Simon Tavar´ e and Prof. David Conti, who have provided valuable insights during discussions; numerous lab colleagues who fostered a very happy environment for both research and social life. I would also like to thank my lovely girlfriend, Kongkong, who has drastically mitigated the pain and stress out of the ordeal towards final Ph.D. completion. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures vii Abstract xii Chapter 1: Detecting Polymorphism via Microarrays inA.thaliana 1 1.1 SNP Genotype Calling . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Modifications to the Oligo SNP Calling Algorithm . . . . . . . 2 1.1.3 Datasets Used in Filtering . . . . . . . . . . . . . . . . . . . . 2 1.1.4 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.5 Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Deletion Calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.1 Raw Data and Array Filtering . . . . . . . . . . . . . . . . . . 9 1.2.2 Quality control datasets . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3 Normalization Procedure . . . . . . . . . . . . . . . . . . . . . 11 1.2.4 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2.5 Deletion Calling by SVM . . . . . . . . . . . . . . . . . . . . 19 1.2.6 Post-calling Processing . . . . . . . . . . . . . . . . . . . . . . 20 1.2.7 Estimating False Dicovery Rate and False Negative Rate . . . . 20 1.2.8 Polarizing Deletions against OutgroupArabidopsisLyrata . . . 21 1.2.9 Deletion-GWAS Details . . . . . . . . . . . . . . . . . . . . . 21 1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 2: SNP-based Genome Wide Association 26 2.1 Confounding by Population Structure . . . . . . . . . . . . . . . . . . 26 2.2 Enrichment ofapriori Candidate Genes in GWAS . . . . . . . . . . . 28 iv 2.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Genome-wide Insertion/Deletion Polymorphism and GWAS Using the Deletion Polymorphism 40 3.1 Overview of Deletion Data . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Genome-wide Deletion Frequency Correlated with SNP Polymorphism Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 The Ongoing Genome Shrinkage inA.thaliana . . . . . . . . . . . . . 47 3.4 Deletion-based GWAS . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 4: Polymorphism-GWAS Database and Web Portal 53 4.1 Database Architecture and Content . . . . . . . . . . . . . . . . . . . . 54 4.1.1 Collection of Genetic Polymorphism Data . . . . . . . . . . . . 54 4.1.2 Meta-information ofArabidopsisthaliana Ecotypes in DB stock 59 4.1.3 Phenotype and Association Results in DB stock 250k . . . . . . 60 4.2 Functionality of the Web Portal . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Implementation of the Web Portal and Improvements Over Existing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Bibliography 75 v List of Tables 3.1 Top 10 countries by the number of ecotypes in deletion mapping . . . . 43 4.1 Summary of Genetic Polymorphism Data . . . . . . . . . . . . . . . . 54 vi List of Figures 1.1 Distribution of error estimates of 75 lines before the imputation. Aver- age 1.396% with standard deviation 0.00772%. . . . . . . . . . . . . . 7 1.2 Distribution of error estimates of 75 lines after the imputation. Average is 1.595% with standard deviation 0.00834%. . . . . . . . . . . . . . . 8 1.3 Error estimates before the imputation (X-axis) vs. after the imputation (Y-axis). The red line denotes equality. . . . . . . . . . . . . . . . . . . 8 1.4 median intensity of array (X-axis) vs. mean of its correlations with 5 Col arrays (Y-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Example illustrates the complex re-arrangements in the alignment between Ler contigs and Col genome, which makes it difficult to infer the dele- tions. The ultra fragmentation of Ler contigs is also on display. This region is around chr 1, 7.97Mb, plotted in the coordinates ofCol genome. Multiple contigs, each occupying one panel with a title, are shown here to have perfect-match probes, represented as arrow, within this region. The TE gene, between 7,970,000bp and 7,975,000bp, is missing in all Ler contigs. However it is not claimed as a deletion for QC because it is immediately surrounded by two differentLer contigs, which together reflect one deletion, one translocation and one inversion. The complex re-arrangements depicted here could come from contig assembly errors. 12 1.6 Intensity of two arrays from the same ecotype Col, which is the refer- ence genome. a, array 1. b, array 43. X-axis and Y-axis correspond to the position on the chip. The chip is divided into roughly 100X100 hexagons, each containing around 220 probes. Plotted is the median intensity of probes in each hexagon. The chip is a hybrid with the tiling probes occupying the upper 3=5 of the chip and the SNP probes occu- pying the bottom2=5 of the chip, separated aroundY = 620. . . . . . . 13 vii 1.7 Histogram of euclidean distance on the 1612X1612 chip of chromosome- adjacent probes with median around 750. It shows two probes adjacent on the chromosome are randomly placed on the chip. . . . . . . . . . . 14 1.8 2d intensity plot of two arrays for Col after WABQN. The SNP probes in the bottom 2=5th are not shown. The blank parts are the excluded blocks which have too high intra-block intensity variance. . . . . . . . . 16 1.9 Probe intensity from array 3, 4, 41, and 150, all from ecotypeLer plotted against array 1, Col, the reference. The data is from the same block of chip. The dense cloud around the diagonal represents the bulk of probes that do not change between reference and target arrays. The less-dense cloud underneath the diagonal represents the deletions inLer, which would be the portion trimmed out in LTS. On the other hand, few dots are above the diagonal, meaning few duplications in Ler. This is not an exception, rather a recurring theme for other ecotypes, which is one reason we chose to focus on deletions. Another observation is the linearity between target and reference intensity, not so much in array 4 and 41 (b and c), which turn out to be the low-quality arrays with median intensity below 250. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.10 Separation of intensity distribution between deleted probes and normal probes in one Ler array after different normalization procedures were applied. Copy=1 represents normal segments because A. thaliana is naturally an inbred line and thus homozygous in most part of genome. Copy=0 has some normal segments contained due to the loose crite- ria to derive normal segments. a, only quantile normalization across all arrays; b, WABQN and block-by-block LTS; c, WABQN, block- by-block LTS, and four-probe window median smoothing; d, WABQN, block-by-block LTS, and nine-probe window median smoothing. The extent of separation is measured in terms of the D-statistic (ks) of Kolmogorov- Smirnov test (bigger is better) and rank sum statistic (wlcx) of Wilcoxon test. p-values are not used because they are always zero due to the large sample size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.11 a, The deleted fraction of a segment (by color) depends on the ampli- tude, average probe intensity, X-axis, and the number of probes (logged) in that segment, Y-axis. b, The deleted fraction of a segment (by color) depends on the amplitude, average probe intensity, X-axis, and the num- ber of probes per kb (logged) in that segment, Y-axis. . . . . . . . . . . 25 viii 1.12 The deleted fraction of a segment (by color) depends on the amplitude, average probe intensity, X-axis, and the number of probes (logged) in that segment, Y-axis. a, array 3. The median intensity of array 3 is 668. b, array 41. The median intensity of array 41 is 157. These two arrays come from the same ecotype Ler. Although the rough trend agrees between two arrays, distribution of array 3 is shifted to the lower intensity end more than that of array 41 and so is the separation between the normal and deleted segments. . . . . . . . . . . . . . . . . . . . . 25 2.1 a, Genome-wide p-values for flowering time under 10 C environment from Wilcoxon test. b, Histogram ofp-values froma. . . . . . . . . . . 27 2.2 QQ plot shows a significant departure from the null model. Association result comes from flowering time under long day environment using Wilcoxon test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 The upper panel shows the geographic distribution of the sampled A. thaliana ecotypes with the focus on Europe. Each dot represents an eco- type, colored according to the log(flowering time under long day envi- ronment). The lower panel is the PCA plot of ecotypes by the first two principle components. Ecotypes are again represented as colored dots, connected to their geographic counterparts in the upper panel through dashed lines. The PCA clustering roughly reflects the geographic rela- tionship between the ecotypes. The key point is the strong correlation between the second principal component (PC2) and the flowering phe- notype, which is a sign of the existence of the confounding issue. . . . . 34 2.4 Enrichment statistic vs log(p-value) cutoff. Flowering time under long day environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Decay of enrichment statistic vs cutoff (-log10(p-value)) for all flower- ing phenotypes, using the Wilcoxon approach. . . . . . . . . . . . . . . 36 2.6 Decay of enrichment statistic vs cutoff (-log10(p-value)) for all flower- ing phenotypes, using EMMA. EMMA has reduced p-values to a more reasonable range while enabling more enrichment ofapriori candidates genes among its top associations than Wilcoxon (Figure 2.5). . . . . . 37 2.7 Candidate SNPs are over-represented among strong associations. GWA analysis of the phenotype of flowering time at 10 C: P-values from the Wilcoxon rank-sum test are plotted against those from EMMA. Points corresponding to SNPs within 20 kb of a candidate gene are shown in red; the rest are shown in blue. The enrichment of the former over the latter in different parts of the distribution is as indicated. . . . . . . . . 38 ix 2.8 Candidate SNPs are over-represented among strong associations. Enrich- ment ratios across all flowering-related phenotypes, using the same significance- thresholds as in Figure 2.7. The blue surface is enrichment ratio=1. . . . 39 3.1 Geographic distribution of 917 ecotypes from which deletions have been mapped. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Length distribution of 17,699 discovered deletion loci. . . . . . . . . . 42 3.3 a, FDR vs frequency. b, FDR vs number of probes. . . . . . . . . . . . 43 3.4 a, Functional impact of deletions based on population frequency. b, Fraction of each gene class impacted by rare and common deletions. . . 44 3.5 a, Distribution of maximum r 2 between deletion and its nearby SNPs (within 20kb). b,r 2 depends on allele frequency and local SNP density. c,r 2 does not depend on deletion size. . . . . . . . . . . . . . . . . . . 45 3.6 Multi-allelic deletion locus around RFL1 and RPS5. Each red bar in the bottom deletion-haplotype panel denotes a deletion for a particular accession. RFL1 has a deletion allele within its promoter apart from the other deletion allele that covers most part of the gene. RPS5 has two overlapping deletion alleles with one more common than the other. . . . 46 3.7 Genome-wide pattern of deletion frequency correlated with nucleotide diversity level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.8 a, GWAS from seven phenotypes are plotted here in fourteen panels, two panels, SNP-GWAS and deletion-GWAS respectively, for each pheno- type. b, Gene RPS5 in deletion-GWAS is tagged by SNPs nearby or within. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.9 TE found under the peak where there is a paucity of SNP loci in flower- ing time (long day environment) association. . . . . . . . . . . . . . . 51 4.1 The relationship among tables in database stock 250k. . . . . . . . . . 55 4.2 The relationship among tables in database at. . . . . . . . . . . . . . . 56 4.3 The relationship among tables in database stock. . . . . . . . . . . . . 57 4.4 The relationship among tables in database chip, dbsnp, and genome. . 58 4.5 Map showing the location of 7075 ecotypes genotyped in 149SNP dataset. 59 x 4.6 Relationship among phenotypes is plotted by the first and second prin- cipal components from PCA (Principal Component Analysis). Each dot represents one phenotype, colored according to the category it belongs to. 61 4.7 a, Ecotype search interface, with auto-completion in action and support of regular-expression search. b, Table of ecotypes returned. . . . . . . . 67 4.8 One tab for each category of phenotypes. Under each tab is a dynamic table filled with phenotypes. Clicking on each phenotype shows a popup with links to detailed phenotype information, GWAS plots (Figure 4.10), and genes adjacent to top associations. . . . . . . . . . . . . . . . . . . 68 4.9 a, Phenotype for each ecotype, represented by dots, plotted by longitude and latitude. The plot is visualized through Google Motion Chart. b, Change the two axes in a to the first two principal components from PCA analysis on the genotype matrix. Mouse-over or clicking each dot in the chart displays the information of that ecotype. . . . . . . . . . . . 69 4.10 Genome-wide association p-values plot. Mouse-over each dot shows the position and p-value of the SNP. Clicking each dot results an 80kb zoom-in SNP page, (Figure 4.11). This visualization is done by Google Visualization Scatter Chart. . . . . . . . . . . . . . . . . . . . . . . . . 70 4.11 View for the SNP selected in Figure 4.10. There are five tabs in the view. First tab, GBrowse, is close-up of the association around the SNP. The “SNP Summary Info” tab lists the position, p-value, annotation of the SNP. The “All Phenotypes in which SNP is significant” tab lists other phenotypes in which this SNP is among the top 1000 associations. The “Ecotype Allele Phenotype BarChart” is described in Figure 4.12. The “Ecotype Allele Phenotype Table” is a table listing which ecotype car- ries which allele and the corresponding phenotype. . . . . . . . . . . . 71 4.12 Bar chart depicting the correspondence between ecotype allele and phe- notype. In this bar chart, each bar represents an ecotype and is colored according to the allele the ecotype carries. The height of each bar cor- responds to the phenotype of the ecotype. All ecotypes are sorted hor- izontally according to their phenotype. Mouse-over each bar displays the information of that ecotype. . . . . . . . . . . . . . . . . . . . . . . 72 4.13 a, Search GWAS by gene name with auto-completion in action. b, Table returned from the search. . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.14 Overall architecture of the web portal. . . . . . . . . . . . . . . . . . . 74 xi Abstract This thesis is about several studies related to GWAS (Genome-wide Association Study) and genomic features of Arabidopsis thaliana. The main genetic data used through- out are 1000 ecotypes genotyped on a hybridization microarray that harbors 1 million probes [Kim et al., 2007, Clark et al., 2007] for 250k SNPs and 1.4 million genome tiling probes. First we adapted a genotype calling method to call 250k SNPs, which form the genotype end of SNP-GWAS. We also developed a novel normalization scheme for tiling probes and a pipeline to call deletions robustly for nearly 1000 accessions. Second, we present work on the issue of population structure confounding encoun- tered during SNP-GWAS for flowering time traits. Particularly, development and results of an enrichment statistic are described in detail. Next, we introduce a deletion polymor- phism dataset and the GWAS based on it. Along the way, we find deletion data suitable to address some interesting population genetics questions. For example, why does A. thaliana self-fertilize while all its close relatives are outcrossers? Why is its genome much smaller? What is causing this? Lastly, in order to manage and visualize the large-scale mapping and population genetics results, we created an innovative framework which combines traditional rela- tional database and modern web 2.0 technologies. The powerful framework enabled us to navigate through results with ease and publish our findings to a wide-ranging audi- ence. xii Chapter 1 Detecting Polymorphism via Microarrays inA.thaliana This chapter is about the technical work behind the genotype calling for both the SNP and deletion polymorphisms. 1.1 SNP Genotype Calling The genotype calls are the foundation of all the following work. The accuracy of geno- type calls has a fundamental impact on all the analyses, results and conclusions reached. The final version reflects the best that we have done within our capacity and resources. The whole genotyping pipeline consists of running a modified Oligo pack- age [Carvalho et al., 2007], detailed in section 1.1.2, filtering the dataset in several steps to remove arrays and SNPs deemed of bad quality, imputation and other downstream analysis. We did not solely rely on the probabilities outputted by Oligo because the 20-ecotype dataset [Clark et al., 2007] used in training Oligo is a very limited representation of more than 1000 ecotypes. Oligo occasionally assigns high confidence to the calls in certain arrays that are found to be of very high mismatch rates through QC. After excluding the possibility of ecotype ID mis-labeling, we found out it is because of their peculiar intensity distribution, which is quite different from the arrays used in the training. 1 1.1.1 Platform The SNP genotyping was performed with the Affymetrix 250K SNP-tiling array (AtSNPtile1). One SNP-tiling array contains probe sets for 248,584 pre-selected SNPs and each SNP is represented by a probe set of four probes, i.e., combinations of two alleles and two strands (sense and anti-sense). Samples were processed at the microar- ray core facilities at the University of Chicago and the Children Hospital of Los Angeles affiliated with the University of Southern California. 1.1.2 Modifications to the Oligo SNP Calling Algorithm The Oligo [Carvalho et al., 2007] package was designed to call genotypes with the Affymetrix Human Mapping array sets. The calling algorithm summarizes each strand across arrays to estimate the cluster centers/spread for each SNP based on a training data set. Our modifications are as follows: (1) we modified the readcel.R file in the package such that it could read in the intensities of the 250k array’s custom probe sets (i.e., four perfect match probes for each SNP); (2) we only consider calling homozygotes since the Arabidopsis thaliana lines are close to 99% homozygous across the genome due to its selfing nature; (3) we used the Perlegen data, [Clark et al., 2007], as the training set, i.e., 20 ecotypes in 31 arrays (with duplicates). With a leave-one-out approach, we esti- mated the cluster parameters and applied logistic regression on predicted calls to get the confidence intervals for the two homozygous clusters. 1.1.3 Datasets Used in Filtering We have four Arabidopsis thaliana datasets from previous projects, all of which are leveraged in the QC step to help achieve high quality data. 2 The 2010 dataset, [Nordborg et al., 2005], is done by de-novo Sanger-sequencing technology of 1500 fragments randomly picked throughout the whole genome in 96 globally-sampled ecotypes to study population structure. Each fragment is about 700bp long. The 384 dataset is done by mass-spectrometry at 384 random loci genome-wide in a 96-sample different from the 2010 dataset. The Perlegen dataset, [Clark et al., 2007], is done by a whole-genome tiling array with four different probes for each position in 20 ecotypes picked from the 96-sample in 2010 dataset, from which all the SNP positions in the current project are chosen. The 149SNP dataset (unpublished), done by a mass-spectrometry-based technology from SEQUENOM Inc. at 149 loci picked out of the polymorphic loci in the 2010 dataset with minor allele frequency near 0.5. This dataset covers over 6000 global eco- types. The 2010 dataset is presumably of highest quality, followed by Perlegen dataset, then 384, 149SNP. The former two are of considerably better quality than the latter two. Due to its genome-wide coverage, the Perlegen dataset is used in SNP filtering while a dataset merging the other three, 2010, 384, 149SNP (in the order of precedence one dataset is taken over the other when calls for the same ecotype and same SNP differ in two datasets.) datasets, referred to as 2010-384-149, is used in whole-array filtering. 1.1.4 Pipeline Run Oligo Oligo was applied to all arrays with the training parameters, detailed in section 1.1.2. After running Oligo, we retained the calls whose posterior probability is above or equal 3 to 0.85 and mark the rest as missing, then we applied the following filters in sequential order. Filter Arrays by Mismatch Rate Any array whose mismatch rate from a comparison with the 2010-384-149 dataset is above 10% is discarded. The moderate criterion, 10%, is chosen because except for the 92 lines in the 2010 dataset (around 1500 overlapping loci), the other lines only have 60-150 loci from either the 149SNP or 384 dataset to compare with. Besides that, the error estimate for the 149SNP or 384 dataset is around 5%. The confidence interval of their mismatch rates is thus fairly wide. Filter SNPs by Mismatch Rate Any SNP whose mismatch rate from a comparison with the Perlegen dataset is above 10% is discarded. The moderate criterion, 10%, is chosen because the Perlegen dataset covers only 20 lines and the confidence interval is also wide. Filter SNPs by Missing Rate Any SNP whose missing rate is above 25% is discarded. This filter is applied because first, we are not very confident about the mismatch rates from the step above. Second, the missing rate is found to be correlated with the mismatch rate. A high missing rate suggests possible copy-number variants or other biological processes of interest. Third, the imputation algorithm, NPUTE [Roberts et al., 2007], used in the last step imputes every missing call without assigning confidence and letting NPUTE impute missing SNPs based on too few calls is thus not ideal. 4 Substitute Calls with Corresponding Ones in the Perlegen Dataset Due to the high-quality and whole-genome coverage of the Perlegen dataset, for the lines from the genotype data above which have corresponding Perlegen lines, we decided to substitute their calls with corresponding non-missing Perlegen calls. Remove Monomorphic SNPs This is a step to make sure that NPUTE [Roberts et al., 2007] will not impute SNPs that are monomorphic as they are not interesting in the association mapping. Impute All Remaining Missing Calls NPUTE [Roberts et al., 2007], a sliding-window nearest-neighbor searching algorithm, is chosen based on accuracy and performance. Its accuracy is comparable to fast- PHASE [Scheet and Stephens, 2006] while being considerably faster. One disadvantage is that it does not assign probability to the imputed calls. We address that by removing SNPs with too many missing calls in an earlier step. The only parameter of this algo- rithm, window size, which is defined as the number of SNPs within the sliding window, has almost no effect on the quality of the imputed calls once it is over 10. We tried 10, 20, 30 till 90 and settled upon 30. Remove Redundant Arrays and Untrusted Lines Redundant arrays are the ones that use the DNA from the same line. In the beginning, we did a couple of repeats to test reproducibility both within and between the core facilities. This step removed them all except the one with lowest mismatch rate compared against the 2010-384-149 dataset. We also removed 3 lines (Uod-2, Blh-1, Santa Clara) which we suspect are contaminants either because the genotype data do not agree among the 5 individual 2010, 384 and 149 datasets or our collaborators notify us that the phenotype does not match what is reported in literature. Insert SNPs/Deletions from FRI and CLF Region in the 2010 Dataset Both FRI and CLF are known to be involved in flowering time and the regions around them were fully sequenced as recorded in the 2010 dataset. For FRI, we incorporated the two known functional deletions and another newly discovered deletion in the 3’UTR region into the 250k data. For CLF, which is less known about its functional sites, all the SNPs around it were incorporated into the 250k data. Since the 2010 dataset only covers about half the lines in 250k data, we applied NPUTE [Roberts et al., 2007] to impute the alleles at these sites for the rest of the lines. Substitute Calls with Corresponding Ones in the 2010-384-149 Dataset Because the 2010-384-149 dataset is of better quality than the 250k calls, despite its limited genome coverage (at most 1500 or so loci for the 2010 lines), we substituted the 250k calls with their non-missing counterparts from the 2010-384-149 dataset. This step has limited improvement on the final association result due to the small coverage of the 2010-384-149 dataset Remove Monomorphic SNPs Again This step makes sure that the association mapping in the next step does not use any monomorphic SNPs. 1.1.5 Error Estimates Because of the technology (de novo Sanger-sequencing) adopted and more than 1000 overlapping loci with the 250k array, the 96 lines in the 2010 dataset, 6 Figure 1.1: Distribution of error estimates of 75 lines before the imputation. Average 1.396% with standard deviation 0.00772%. [Nordborg et al., 2005], serves as gold standard to estimate the error rates. All but one are genotyped by the 250k array, among which 20, with their calls substituted by the Perlegen dataset, do not reflect the quality of the 250k array and thus are excluded from the error estimate. Since no heterozygous genotype is called in the 250k array, all het- erozygous genotypes in the 2010 dataset are ignored. For the remaining 75 lines used in the error estimate, before the imputation, the overall error estimate is 1.388%. The error estimate averaged across 75 lines is 1.396% with standard deviation 0.00772%. Figure 1.1 shows the error estimates of 75 lines before the imputation. After the impu- tation, the overall error estimate is 1.595%. The error estimate averaged across 75 lines is 1.595% with standard deviation 0.00834%. Figure 1.2 shows the error estimates of 75 lines after the imputation. Furthermore, as shown in Figure 1.3, the increase of error rate by the imputation is below 0.5% except one line, Mr-0, whose error estimate is increased by 0.96%. 7 Figure 1.2: Distribution of error estimates of 75 lines after the imputation. Average is 1.595% with standard deviation 0.00834%. Figure 1.3: Error estimates before the imputation (X-axis) vs. after the imputation (Y- axis). The red line denotes equality. 8 1.2 Deletion Calling 1.2.1 Raw Data and Array Filtering A little more than 1.4 million probes unique in the reference genome were put on the chip with a slight focus on genic regions, especially exons. Given that the reference genome of ecotype Col is 120Mb, it is roughly one probe every 100 base. By design, the tiling array will not be able to detect any insertion relative to the reference genome in non-Col ecotypes. We started with 1266 arrays and 1179 unique ecotypes from around the world. There are five copies of Col, the reference genome, for QC purpose. Based on the QC done in the SNP genotype-calling in [Atwell et al., 2010], the median intensity of an array is correlated with the quality of an array. The higher the median intensity is, the lower the error rate is. Because the majority of the probes are of normal (copy=2) segments and do not represent change from the reference genome even for a distant ecotype, we use the correlation of probe intensity between one ecotype and each of the five Col arrays as a quality indicator. As shown in Figure 1.4, which plots the relationship between the median intensity of each array and mean of its correlations with 5 Col arrays, the quality of an array increases with the median intensity and then plateaus when median intensity is around 250, which is the cutoff to separate arrays of good and bad quality. We did not choose arrays based on the correlation because low correlation withCol will not be able to distinguish between arrays of ecotypes genetically distant from Col and bad arrays. A total of 119 arrays were removed because of low median intensity. We removed duplicates among the remaining 1147 arrays by retaining only the one with highest median intensity and ended up with 917 unique ecotypes. The phenotype data used in deletion-GWAS are same as published in [Atwell et al., 2010]. The sample size is either 96 or 192. 9 Figure 1.4: median intensity of array (X-axis) vs. mean of its correlations with 5 Col arrays (Y-axis) 1.2.2 Quality control datasets The 95Mb Ler contig sequences obtained by Sanger-sequencing (http://www. arabidopsis.org/browse/Cereon/index.jsp) are an independent source for us to get bona-fide deletions. We first aligned the Ler contig sequences to the ref- erence genome by Mummer [Kurtz et al., 2004] (parameters: maxmatch, maxgap=40, mincluster=60). Due to low ( 80% assuming Ler genome size is similar to Col) cov- erage of the Ler genome, simply calling any reference genome segment missing in Ler contig sequences a deletion inLer will result in an estimated 20% false positives, so we decided to focus on missing segments that are immediately surrounded by the sameLer 10 contig on the two ends in the alignment. However, most contigs are very small. Out of 81306 contigs, 1449 are of size>=5kb; only 144 are of size>=10kb. Also compli- cated re-arrangements (Figure 1.5) forced us to discard contigs that potentially contain deletions. In the end, we discovered 1835 deletions whose size is equal to or more than 300bp (726 if 500bp). There would be almost no false positives but the false negative rate should be very high. The high quality of deletions makes it good for comparing the intensity distribution of deletions versus normal segments. But the high false negative rate rules out the possibility for it to be used as a training dataset in tiling-array deletion calling. Due to the lack of high-quality genome-wide deletion data, we decided to use paired- end (PE) data from 65 lines as another quality control dataset. These lines were sequenced at 20X coverage using the Illumina Genome Analyzer, with fragment length of 70bp. The data are groomed for discovery of structural variation such as deletions, inversions, translocations, duplications. However, output by extant structural-variation detecting algorithms demonstrated that better algorithms need to be developed to call deletions along with inversions, translocations, etc. We used a simple and stringent cri- terion to derive deletions from PE data. All reads were aligned to the reference genome allowing 4% mismatches of the read length to get the read count, which was smoothed into 100bp-window coverage data and any 100bp fragment below coverage one was regarded as deleted. Adjacent deletions were merged to create the PE-derived deletion set. This deletion set is primarily used in training of the SVM deletion calling algorithm. 1.2.3 Normalization Procedure The normalization procedure consists of three steps. First, we corrected for the within-array spatial effect. Due to the non-uniform dis- tribution of reagents, DNA, fluorescence on a microarray chip, part of the chip would 11 Figure 1.5: Example illustrates the complex re-arrangements in the alignment between Ler contigs and Col genome, which makes it difficult to infer the deletions. The ultra fragmentation of Ler contigs is also on display. This region is around chr 1, 7.97Mb, plotted in the coordinates of Col genome. Multiple contigs, each occupying one panel with a title, are shown here to have perfect-match probes, represented as arrow, within this region. The TE gene, between 7,970,000bp and 7,975,000bp, is missing in all Ler contigs. However it is not claimed as a deletion for QC because it is immediately sur- rounded by two different Ler contigs, which together reflect one deletion, one translo- cation and one inversion. The complex re-arrangements depicted here could come from contig assembly errors. have systematically higher or lower intensity than the rest. Spatial effects are a common problem for most microarray technologies and is different from the long-range spatial correlation of probe intensities along the chromosome [Marioni et al., 2007]. It is par- ticularly pressing in our case (Figure 1.6) since we have arrays processed at two different core facilities in a two-year period. Instead of removing the long-range spatial effect via 12 Figure 1.6: Intensity of two arrays from the same ecotype Col, which is the reference genome. a, array 1. b, array 43. X-axis and Y-axis correspond to the position on the chip. The chip is divided into roughly 100X100 hexagons, each containing around 220 probes. Plotted is the median intensity of probes in each hexagon. The chip is a hybrid with the tiling probes occupying the upper3=5 of the chip and the SNP probes occupying the bottom 2=5 of the chip, separated aroundY = 620. a spline fitting [Marioni et al., 2007], we took a non-parametric approach to remove this effect. Because probes are randomly placed on the array, the order of probes on the chromo- some is not correlated with their order on the chip (Figure 1.7). Two large-enough ran- dom blocks within a chip should contain similar amounts of normal, deleted and dupli- cated probes. Hence they have the same intensity distribution and this warrants quantile normalization among the blocks within one array. Detailed algorithm is described in 1.1. Block size and how much overlapping exist between them is critical to the algorithm. A block size too small runs the risk of violating the same-distribution assumption because the number of deleted or duplicated probes might start differing too much. Small block size also increase the computational cost quadratically. However, block size can not be too big because spatial effect would be reflected in each block. Overlapping among 13 Figure 1.7: Histogram of euclidean distance on the 1612X1612 chip of chromosome- adjacent probes with median around 750. It shows two probes adjacent on the chromo- some are randomly placed on the chip. blocks is designed to get rid of the boundary effect. By comparing the intensity dis- tribution of normal probes and deleted probes known from PE-dataset, we chose the block size to be 100 and overlap to be 50. Figure (intensity 2D plot after spatial effect removal) shows the chip intensity after the correction. Second step: reference-target block-by-block LTS (Least Trimmed Squares) [Cheng and Li, 2005]. Our perspective is to find a transformation that matches the dis- tributions of hybridization levels of those probes corresponding to normal chromosome segments between arrays. We address three important issues. First, chip-specific spatial patterns exist due to uneven hybridization and measurement process. The spatial pattern 14 Algorithm 1.1 Algorithm to remove spatial effect for one array blockSize=100 blockOverlap=50 blocks = partition-array-space(blockSize, blockOverlap) m = maximum number of probes with a blocks blockIntensityList = [] for block in blocks do intensityList = block probe intensity in this array n = length(intensityList) ifn< 1=5m then (too few probes in this block) skip this block end if ifn<m then (make each block of even size) resampleList = re-sample (m-n) from intensityList add resampleList into intensityList end if append intensityList to blockIntensityList end for (carry out quantile normalization) blockIntensityList = quantile-normalize(blockIntensityList) probeIntensityList = [] for intensityList in blockIntensityList do add intensity of one probe to probeIntensityList if one probe belongs to more than one blocks then take median of its intensity end if end for return probeIntensityList to be corrected here refers to the pattern that is consistent across multiple arrays, which is specific to the chip or core facility. After first step of within-array-block-quantile- normalization, there is little but still visible spatial pattern remaining (Figure 1.8). Sec- ond, probe-sequence-specific effect exists because of varying binding affinity of differ- ent nucleotides, base position in the probe, etc. Instead of creating 101 (= 25X4 + 1) parameters to model the position-specific nucleotide effect [Carvalho et al., 2007], we 15 Figure 1.8: 2d intensity plot of two arrays for Col after WABQN. The SNP probes in the bottom 2=5th are not shown. The blank parts are the excluded blocks which have too high intra-block intensity variance. removed it via regressing the intensity against that of the same probe from the reference array, which has no genomic change. The probe intensity on the reference array acts as a surrogate for the sequence-specific effect. We did not use direct subtraction because the slope between target and reference intensity deviates quite a bit from 1 for some arrays, as shown in Figure 1.11. Third, in some cases a substantially large portion of genome are changed between a target and a reference array. For the purpose of normalization, we need to identify a subset that exclude those probes corresponding to the changing part of genome and/or abnormal probes due to experimental variation. LTS (Least Trimmed Squares) is a natural choice to achieve this goal. Substantial genomic change is pro- tected in LTS by setting an appropriate trimming fraction. The trimming fraction is set at 40%, which is enough for the genomic changes amongA.thaliana ecotypes, as shown in Figure 1.9. To take into account the cross-array spatial pattern of hybridization, we divide each array into sub-arrays and normalize probe intensities within each sub-array. Detailed algorithm is described in 1.2. 16 Algorithm 1.2 Algorithm to remove probe-sequence-specific effect of target-array based on ref-array-list blockSize=100 blockOverlap=50 blocks = partition-array-space(blockSize, blockOverlap) m = maximum number of probes with a blocks targetArrayProbeIntensityByMultiRefArrays = [] for ref-array in ref-array-list do targetArrayProbeIntensityListByOneRefArray = [] for block in blocks do target = probe intensity from the block of the target-array ref = intensity of probes from the same block of this ref-array run LTS (Least Trimmed Squares) as ref = a + b*target (The trimming fraction is set at 40%.) targetProbeIntensity = a+b*target-ref add targetProbeIntensity into targetArrayProbeIntensityListByOneRefArray if one target probe belongs to more than one block then take median of its intensity in targetArrayProbeIntensityListByOneRefArray end if append targetArrayProbeIntensityListByOneRefArray to targetArrayProbeIn- tensityByMultiRefArrays end for end for targetArrayProbeIntensityList = median of each probe in targetArrayProbeIntensity- ByMultiRefArrays return targetArrayProbeIntensityList Lastly, we carried out a smoothing step by assigning the median intensity in nine- probe windows to the center probe. This step further reduced the noise inherent in the probe intensity and increased the separation between deletion-probe intensity distribu- tion and normal-probe intensity distribution (Figure 1.10). This median smoothing step also brings the intensity distribution closer to the normal distribution modeled in the segmentation algorithm we used, GADA[Pique-Regi et al., 2008]. 17 Comparison with Quantile Normalization Quantile normalization or its variants, i.e. q-spline normalization, has become adefacto standard for normalization across multiple arrays. Here we compared our procedure with quantile normalization by looking at the intensity distribution of probes that are within the high-quality Ler-contig-derived deletions and probes that are of copy one according to Ler contig sequences (Figure 1.10). We advise extra caution if quantile normalization is used as the sole normalization procedure as it is clearly unable to get rid of the spatial effect and probe-sequence-specific effect. 1.2.4 Segmentation After normalization, we carried out segmentation for each array by running GADA [Pique-Regi et al., 2008] using the options “-M 5 -T 10 - 2.5”. GADA exploits the use of piecewise constant (PWC) vectors to represent genome copy number and sparse Bayesian learning (SBL) to detect copy number alteration breakpoints. It is extremely fast for breakpoint finding. The method consists of two primary steps. The first step is a Bayesian learning process which generates a list of candidate breakpoints and segment means while trying to strike an optimal balance between model fit (measured as residual sum of squares) and model sparseness (the number of breakpoints). The Bayesian learn- ing process is driven by a prior parameter,, which summarizes the user’s prior belief in the appropriate degree of segmentation (larger leads to greater segmentation). After this initial segmentation process, a t statistic is calculated for each breakpoint, which is a function of the mean and variance of probe intensity from two adjacent segments. The second step is a backwards elimination process which removes unconfident breakpoints with a level of significance (t statistic) less than some user-defined threshold, T. We tried different values of T and and found the two parameters compliment each other. “-T 10 18 - 2.5” produced enough segments but not too many, on the order of tens of thousands, for further pruning. Parameter M is the minimum number of probes for each segment and 5 probes correspond to the segment length in the range of hundreds of nucleotides, which is the minimum deletion length we are interested and confident about. 1.2.5 Deletion Calling by SVM After segmentation by GADA, we need to determine which segments are normal and which are deleted. We looked at how likely the segments are to be truly deleted by comparing them with PE-derived deletion set and observed that it depends on the mean intensity (amplitude), the number of probes, and the number of probes per kb in a seg- ment (Figure 1.11). Note the boundary between the true and false deletions is not linear. Based on this, we decided to use Support Vector Machine (SVM) to train the model and use those three features as the input. Figure 1.12 shows that despite all the normalization we put forth, cross-array hetero- geneity still exists. The boundary shifts but stays similar among arrays of similar median probe intensity. Thus for each of 65 arrays included in the training set, we trained the model separately and derived 65 SVM models. For each test array, we chose a set of models from the training arrays whose median intensity is either within 250 of that of test array or among the top 5 in terms of the distance between the two array’s median intensity. The latter criterion is to avoid the deletions of one test array to be called based on one or two SVM models. In case the test array is included in the training, its model will be excluded in the set of models to call its own deletions. The eventual normal or deletion call is the majority vote by different models. We used the LIBSVM package [Chang and Lin, 2001] and set radial basis function as the kernel, C = 10, = 1=3 (default). The prediction error is defined as the ratio of the known segments (both nor- mal and deleted) labeled wrongly by the model. Different parameter settings were found 19 not to change the prediction error much. Under the chosen parameter setting, the pre- diction error is around 33% according to 10-fold cross-validation. 1.2.6 Post-calling Processing Due to the noisiness of array intensity, an elevated level of probe intensity is often observed in the middle of a large deletion, which results in the deletion being called into two different ones. We devised an algorithm aiming to reduce these unnecessary breaking points. After SVM calling, adjacent deletions in one array are merged if both: the distance between deletions is less than 5kb, and it is less than 10% of the size of the larger of the two deletions. As segmentation is done one sample at a time, we required an approach to identify and combine deletions from different accessions that correspond to the same mutation event, equivalent to a deletion locus. The process is done through a graph clustering algorithm. Each deletion from one accession is a node in the graph. One edge connects two deletions from two different accessions if the overlap between them is more than 80% of the size of the smaller of the two deletions. All the deletions from different accessions in every connected component of this graph are considered as rising from one mutation event, and hence one deletion locus. The boundary of the locus is the median of the boundaries of all deletions included. 1.2.7 Estimating False Dicovery Rate and False Negative Rate For every CNV deletion, it is called a true discovery if more than 40% of it is covered by one or more deletions from the PE-derived deletion set and a false discovery otherwise. In case more than one deletion from the PE-derived set overlaps with the CNV deletion, the total overlap length is calculated cumulatively in order to get the overlap ratio. 20 False Discovery Rate (FDR) is estimated by the number of false discoveries over the total number. Binning is applied to plot FDR versus number of probes or frequency. False Negative Rate (FNR) is defined similarly using the PE-derived deletion set. Dele- tions within PE-derived deletion set first undergo a merging procedure which merges adjacent ones if both: distance between the two is less than 5kb, and is less than 10% of the size of the larger of the two. 1.2.8 Polarizing Deletions against OutgroupArabidopsisLyrata We received the alignment data between A. lyrata and A. thaliana (Hu et al. 2010, submitted) from Tina Hu. If a CNV deletion has 80% not covered by the alignment, it is regarded as a derived insertion relative to the outgroup. If a CNV deletion has 80% covered by the alignment, it is a derived deletion. All other deletions are ambiguous and excluded from outgroup-based analysis. This is by no means a perfect way to polarize the CNV deletions. The derived insertions that happened in non-Col ecotypes of A. thaliana will not be detected because the array is based on Col genome, and the same thing for the derived deletions. Also wrong assignments exist because we only have the sequence from one A. lyrata individual and some of these insertions or deletions could still be segregating in A. lyrata populations. Nevertheless, this is what we can do with the data available. 1.2.9 Deletion-GWAS Details In deletion polymorphism, there are only two states for one locus, either deleted or not, which makes it simple to apply GWAS machineries developed for SNP data. However, due to heterogeneity in the A. thaliana ecotypes sampled, in the same region of the genome we see deletions with different boundaries in different ecotypes. The different deletion loci are likely of independent origins. For the sake of GWAS, the overlapping 21 loci were split into non-overlapping ones so that each locus becomes bi-allelic, after which association methods used in [Atwell et al., 2010] were applied. 1.3 Discussion Turning raw data into genotype data has always been the most time-consuming part of a project, yet also the most important one. Our case has been helped by the availability of independent high-quality QC datasets so that we do not have to develop a pure statistical framework from scratch. We have to point out that our calling algorithm/pipeline is far from perfect although the errors are controlled so that results are only marginally affected. On the other hand, the improvement of technology outpaces that of calling algorithm and usually makes these algorithms obsolete very fast. 22 Figure 1.9: Probe intensity from array 3, 4, 41, and 150, all from ecotype Ler plotted against array 1, Col, the reference. The data is from the same block of chip. The dense cloud around the diagonal represents the bulk of probes that do not change between reference and target arrays. The less-dense cloud underneath the diagonal represents the deletions inLer, which would be the portion trimmed out in LTS. On the other hand, few dots are above the diagonal, meaning few duplications in Ler. This is not an exception, rather a recurring theme for other ecotypes, which is one reason we chose to focus on deletions. Another observation is the linearity between target and reference intensity, not so much in array 4 and 41 (b and c), which turn out to be the low-quality arrays with median intensity below 250. 23 Figure 1.10: Separation of intensity distribution between deleted probes and normal probes in one Ler array after different normalization procedures were applied. Copy=1 represents normal segments because A. thaliana is naturally an inbred line and thus homozygous in most part of genome. Copy=0 has some normal segments contained due to the loose criteria to derive normal segments. a, only quantile normalization across all arrays; b, WABQN and block-by-block LTS; c, WABQN, block-by-block LTS, and four- probe window median smoothing; d, WABQN, block-by-block LTS, and nine-probe window median smoothing. The extent of separation is measured in terms of the D- statistic (ks) of Kolmogorov-Smirnov test (bigger is better) and rank sum statistic (wlcx) of Wilcoxon test. p-values are not used because they are always zero due to the large sample size. 24 Figure 1.11: a, The deleted fraction of a segment (by color) depends on the amplitude, average probe intensity, X-axis, and the number of probes (logged) in that segment, Y- axis. b, The deleted fraction of a segment (by color) depends on the amplitude, average probe intensity, X-axis, and the number of probes per kb (logged) in that segment, Y- axis. Figure 1.12: The deleted fraction of a segment (by color) depends on the amplitude, average probe intensity, X-axis, and the number of probes (logged) in that segment, Y- axis. a, array 3. The median intensity of array 3 is 668. b, array 41. The median intensity of array 41 is 157. These two arrays come from the same ecotype Ler. Although the rough trend agrees between two arrays, distribution of array 3 is shifted to the lower intensity end more than that of array 41 and so is the separation between the normal and deleted segments. 25 Chapter 2 SNP-based Genome Wide Association Based on SNP polymorphism data from the previous chapter, we carried out a SNP- based Genome-Wide Association Study (GWAS) for 107 phenotypes by applying two methods, non-parametric Wilcoxon rank-sum test and structure-correction software EMMA [Kang et al., 2008]. Contrary to the usual sparsity of peaks in human GWAS [Consorti, 2007], GWAS in Arabidopsis thaliana has yielded numerous peaks beyond the conservative Bonferroni threshold (Figure 2.1 a) despite having a much smaller sam- ple size. A key issue is to prove true loci exist among the numerous peaks. Because of the diversity in the ecotypes sampled, the population structure confounding has become a big issue. In this chapter, we describe the confounding issue, how we dealt with it, and the development of an enrichment statistic to prove true signals are enriched in the top associations despite the confounding. This work has been published in [Atwell et al., 2010]. 2.1 Confounding by Population Structure The conventional wisdom is that low p-value of a SNP suggests the locus around it is involved in the phenotype. The distribution of p-values is uniform under the null model that there is no association between genotype and phenotype. If the proportion of low p-value SNPs is more than expected, (Figures 2.1 and 2.2), it further increases the probability that GWAS is finding significant associations. However, such an excess of low p-value SNPs could also be attributed to factors other than the causal relationship 26 Figure 2.1: a, Genome-wide p-values for flowering time under 10 C environment from Wilcoxon test. b, Histogram ofp-values froma. between genotype and phenotype. Population structure is one such example. The fact that the recombinant inbred lines used inArabidopsisthaliana GWAS come from around the world exacerbates the issue. Local adaptation is pervasive as A. thaliana occupies a broad range of environments, from tropical island with frequent storms to northern Europe, covered in snow for as long as half the year. A local population is forced to adapt key traits, such as the flowering time, to its surrounding environment in order to survive. Simultaneously, the populations are differentiating as they are increasingly isolated from each other geographically. The end result of the two processes is the 27 Figure 2.2: QQ plot shows a significant departure from the null model. Association result comes from flowering time under long day environment using Wilcoxon test. strong correlation between population structure and the phenotype, Figure 2.3, which could also generate significant associations between genotype and phenotype; hence, the problem of confounding by population structure. 2.2 Enrichment ofapriori Candidate Genes in GWAS Due to population structure confounding in several phenotypes, a p-value distribution highly skewed towards zero (Figure 2.1 b), is no longer a good indicator for the presence of true signals. We need to show the most associated peaks contain true signals in spite of population structure confounding. Even with methods that correct population struc- ture, a skewedp-value distribution is no guarantee that true causal loci have been found. 28 Ultimately, only biological experiments can verify that. However, most phenotypes are not entirely unknown upon GWAS scrutiny. A number of genes have been implicated in the literature through experiments such as gene knockout and transformation, which are defined as candidate genes. It is not a complete list but an enrichment of those genes around highly significant SNPs would be an indication that GWAS is working. We focus on the phenotypes that have enough peaks and a priori candidate genes for us to test if there is a statistically significant enrichment. The significant enrichment serves as a validation of the GWAS results and more importantly gives support to the peaks without any a priori candidate gene underneath. By replacing a priori candidate genes with groups from GeneOntology, we can discover potential new categories of genes involved in a phenotype. 2.2.1 Method We collect lists of candidate genes specific to each phenotype by either brows- ing through the relevant literature or from trusted sources such as TAIR (http: //arabidopsis.org/) that have curators browse through literature constantly to maintain such lists. The enrichment statistic we introduce here tests whether, given a list of a priori candidate genes, the SNPs near the candidate genes are more likely to be in the lower end of the p-value distribution than other SNPs. The null hypothesis is that there is no enrichment and thep-value distribution does not differ between the two groups of SNPs, which are equally likely to be found in the lower end. First, an arbitrary cutoff is setup to separate the distribution into the lower end and the upper one. Second, for the group of candidate-near SNPs, we count the number of SNPs below the cutoff, n c 1 , and the number above it,n c 2 , and divide the two numbers to getn c 1 =n c 2 . We do the same thing for 29 the group of non-candidate-near SNPs to getn r 1 =n r 2 . In the end, the enrichment statistic is the ratio between the two. E = n c 1 =n c 2 n r 1 =n r 2 (2.1) Under the null hypothesis, the p-value distribution does not differ between the two groups of SNPs and the numerator and denominator of the enrichment statistic should be same regardless of the cutoff. Thus, the enrichment statistic should always be one regardless of the cutoff. However the distribution of the statistic is unclear. In order to get the correct distribution of the enrichment statistic, we designed a permutation model whose null model reflects the real data. The key conditions of the data are the population structure among ecotypes and linkage disequilibrium between nearby SNPs. The first condition is achieved by re-sampling p-values from the original data. The second condition is met by making sure simulated p-values of nearby SNPs are re- sampled from SNPs in the same region. The re-sampling algorithm involves three steps. First, arrange the originalp-values in the order of SNPs. Second, simulate a uniformly- distributed integer, x, between 1 and the number of total SNPs. Third, assign the p- value of one SNP to the target SNP that is (x-1)-SNPs away on the left. For the first x SNPs, their (x-1)-target-SNPs are located in the end of the genome. We effectively treated a genome of multiple linear chromosomes as one circular genome. Finally, a p-value is evaluated by comparing the observed value of the enrichment statistic from one association result with the probability distribution of the statistic over 10000 re- samplings or more until at least three re-samplings produce the statistic equal or larger than the observed one. A slightly improved re-sampling procedure aiming to get over the boundary effect between chromosomes adds a step of chromosome-order re-shuffling in the beginning but that does not change the distribution of the enrichment statistic significantly. 30 The permutation scheme we designed is also robust to the variable number of genes in different candidate gene lists or overlapping genes or different gene size because it maintains the same gene list for each re-sampling. 2.2.2 Results The first result is the enrichment statistic versus -log(p-value) cutoff for association anal- ysis obtained by using Wilcoxon test on Long-day flowering time (Figure 2.4). As the p-value decreases towards zero, the enrichment statistic converges to one as expected. The enrichment statistic is generally above 3 if thep-value is below10 3 . The dip in the most significant portion is because of the small number of SNPs in that top range. The area chart in green in Figure 2.4 is the 95% confidence interval of the enrichment statistic from 10000 re-samplings. We also compared the p-value for enrichment statistic from the permutation-test to those from a hypergeometric test which assumes independence between all SNPs. Because of the wrong assumptions, the hypergeometric test produced far more significantp-values for the same enrichment statistic than permutation test. The enrichment statistic shows varying success in different phenotypes (Figure 2.5). We believe two explanations contributed to the observation. The lack of prior knowledge or molecular experiments for some phenotypes have made it difficult to gather a list of trustworthy candidate genes. Second, some phenotypes are simple traits involving potentially one or two genes, especially in disease resistance phenotypes. The tiny gene list makes it practically impossible to capture the enrichment through statistics. We observed that the enrichment statistic in association results run by EMMA is stronger than Wilcoxon (Figure 2.6), which is expected as EMMA has more power because of population structure correction. However, the observation comes with a caveat. When we compared the results from EMMA and Wilcoxon, the portion of SNPs that are deemed significant in Wilcoxon but not in EMMA also shows enrichment of 31 a priori candidate genes (Figure 2.7). The enrichment is not as high as that of the portion of SNPs that are only significant in EMMA but the observation exposes the less well-known side of EMMA, that it kills some signal while it is striving to get rid of population structure confounding. Unsurprisingly, enrichment is greatest among the SNPs that are strongly associated in both methods. This pattern is consistent across all flowering-related phenotypes (Figure 2.8), and it is thus clear that both methods have merits. In conclusion, despite confounding by population structure,apriori candidate genes are over-represented among the strongly associated SNPs regardless of whether we cor- rect for population structure or not. 2.3 Discussion The linear mixture model has been in use to correct population structure for a long time [Yu et al., 2006]. It uses a random effect to take away the population structure from the association. But clearly the method is not perfect. In dealing with the uncor- rected strong associations whose variation patterns mimic the population differentiation, instead of singling out those that are causal, it removes these associations altogether, which explains the enrichment ofapriori candidate genes among the the SNPs in which only Wilcoxon test is significant (Figure 2.7). Presumably, the linear mixed model itself or the EMMA implementation has some problems. For example, estimating the variance of random effect for every SNP in EMMA contradicts the notion that it represents the population structure effect for this accession. This problem has been addressed in the recent [Kang et al., 2010] paper. The other problem, however, is more pressing. Plot- ting likelihood over the variance estimates showed that EMMA is mostly picking the 32 parameter value on the boundary of its search range. This reflects the poor robustness of EMMA. Further research needs to be done to address these problems. 33 Figure 2.3: The upper panel shows the geographic distribution of the sampled A. thaliana ecotypes with the focus on Europe. Each dot represents an ecotype, colored according to the log(flowering time under long day environment). The lower panel is the PCA plot of ecotypes by the first two principle components. Ecotypes are again rep- resented as colored dots, connected to their geographic counterparts in the upper panel through dashed lines. The PCA clustering roughly reflects the geographic relationship between the ecotypes. The key point is the strong correlation between the second prin- cipal component (PC2) and the flowering phenotype, which is a sign of the existence of the confounding issue. 34 Figure 2.4: Enrichment statistic vs log(p-value) cutoff. Flowering time under long day environment. 35 Figure 2.5: Decay of enrichment statistic vs cutoff (-log10(p-value)) for all flowering phenotypes, using the Wilcoxon approach. 36 Figure 2.6: Decay of enrichment statistic vs cutoff (-log10(p-value)) for all flowering phenotypes, using EMMA. EMMA has reduced p-values to a more reasonable range while enabling more enrichment ofapriori candidates genes among its top associations than Wilcoxon (Figure 2.5). 37 Figure 2.7: Candidate SNPs are over-represented among strong associations. GWA analysis of the phenotype of flowering time at 10 C: P-values from the Wilcoxon rank- sum test are plotted against those from EMMA. Points corresponding to SNPs within 20 kb of a candidate gene are shown in red; the rest are shown in blue. The enrichment of the former over the latter in different parts of the distribution is as indicated. 38 Figure 2.8: Candidate SNPs are over-represented among strong associations. Enrich- ment ratios across all flowering-related phenotypes, using the same significance- thresholds as in Figure 2.7. The blue surface is enrichment ratio=1. 39 Chapter 3 Genome-wide Insertion/Deletion Polymorphism and GWAS Using the Deletion Polymorphism Insertion/deletion (indel) polymorphism is an important contributor to genetic variation, often with functional consequences. This study sets out to identify indel polymorphism among 917 global ecotypes of Arabidopsis thaliana by using a whole genome tiling array that is based on the reference genome. Due to the noisiness of the data, we limit our study to deletions and exclude duplications, relative to the reference genome. These are abundant and often have large phenotypic effects. This study provides a comprehensive overview of deletion loci, their frequency spectrum and functional roles through GWAS on 107 phenotypes [Atwell et al., 2010]. Note that by design our study is not able to discover insertions relative to the reference genome. Insertions described in this chapter are the sequences of the reference genome that could not be found in the outgroup Arabidopsislyrata. After polarizing the polymorphisms against theA.lyrata genome, we found that 85% of deletions relative to the reference genome are also deletions relative to the outgroup species; the remaining 15% are actually insertions in the reference genome. This is consistent with the apparent shrinkage of the A. thaliana genome relative to A. lyrata, and demonstrates that the process is still ongoing and is probably driven by selection. 40 Much of the indel polymorphism is due to transposable elements, which are still very active in the genome. In contrast to human CNV studies, most of the deletions relative to the reference genome in A. thaliana are not tagged by nearby SNPs withr 2 0:8, which makes the deletion-based GWAS complimentary, rather than similar to the SNP-GWAS. In some cases, for example, simple disease phenotypes with one or two causal loci, deletion- GWAS confirmed and narrowed down to the true causative loci reported previously [Atwell et al., 2010]. For complex traits like flowering time, new loci were found, pri- marily in regions where SNPs are missing. Several transposable elements found under the new association peaks appear to have regulatory effects on nearby flowering loci. The indel polymorphism and results of deletion-GWAS are viewable at http:// banyan.usc.edu:5000/CNV/index. 3.1 Overview of Deletion Data Deletions have been called for a total of 917 ecotypes collected from 30 countries; Table 3.1 and Figure 1.4. The whole pipeline of deletion-calling is described in Chap- ter 1.2. We discovered 17,699 distinct deletion loci with median size around 975bp. The average number of deletions per ecotype is 1010, equivalent to roughly 1Mb of the genome. The frequency of more than 62% of deletions is below 10%, which comprises 72% of total deletions in terms of accumulated size. Comparison with an independent deletion data for 65 ecotypes, (details in Chapter 1.2.7) shows an FDR around 24.9% and it decreases quickly as the frequency or the number of probes of the deletion increases; Figure 3.3. Among the rare deletions, < 10% AF (Allele Frequency), we found that around 32% of deletions overlapped with protein coding genes, 29% with TEs (Transposable 41 Figure 3.1: Geographic distribution of 917 ecotypes from which deletions have been mapped. Figure 3.2: Length distribution of 17,699 discovered deletion loci. 42 Table 3.1: Top 10 countries by the number of ecotypes in deletion mapping Rank Country Number of Ecotypes 1 Sweden 297 2 France 219 3 UK 126 4 Germany 91 5 Czech 39 6 USA 32 7 Netherlands 18 8 Spain 17 9 Italy 12 10 Russia 11 Figure 3.3: a, FDR vs frequency. b, FDR vs number of probes. Elements), 3% in other types of genes, the rest in intergenic regions; Figure 3.4 a. When it comes to common deletions,> 10% AF, the fraction of TEs expanded to 40% at the expense of the protein coding genes. Thus, protein-coding genes are more strongly associated with rare deletions while TEs are enriched in common deletions; Figure 3.4 b. The key parameter for linkage-disequilibrium-based studies, the proportion of dele- tions that can be tagged by nearby SNPs (within 20kb), depends on the deletion allele frequency and local SNP density, but not on deletion size (Figure 3.5). Overall, only 43 Figure 3.4: a, Functional impact of deletions based on population frequency. b, Fraction of each gene class impacted by rare and common deletions. 25.4% of deletions 10% AF (Allele Frequency) are captured withr 2 0:8, whereas only 2% of deletions< 10% AF are similarly tagged. This is in stark contrast with the number reported in human CNV data, [Conrad et al., 2009], in which 77% of deletions >= 5% MAF (Minor Allele Frequency) and 23% of deletions< 5% MAF are captured under the same criteria. One explanation would be TEs being the major component of deletion polymorphism inA.thaliana. 44 Figure 3.5: a, Distribution of maximumr 2 between deletion and its nearby SNPs (within 20kb). b,r 2 depends on allele frequency and local SNP density. c,r 2 does not depend on deletion size. We observed a number of deletion loci harboring multiple alleles. An example is shown at one of the pathogen-resistance genes, Figure 3.6. Pathogen-resistance genes are a very dynamic component of the genome due to the arms race between the host and pathogen. Theory regarding how deletions arise is not well developed and it is not clear whether the mechanism allows independent origin of different deletions at the same locus more often than the SNP mutation mechanism. Further analysis is required to answer these questions. 45 Figure 3.6: Multi-allelic deletion locus around RFL1 and RPS5. Each red bar in the bottom deletion-haplotype panel denotes a deletion for a particular accession. RFL1 has a deletion allele within its promoter apart from the other deletion allele that covers most part of the gene. RPS5 has two overlapping deletion alleles with one more common than the other. 46 3.2 Genome-wide Deletion Frequency Correlated with SNP Polymorphism Level In [Clark et al., 2007], a nonrandom distribution of SNP polymorphism levels across the genome was observed. Specifically, regions of high polymorphism extend from the centromeres to beyond the pericentromeric regions. It was attributed to three expla- nations: selection, variation of mutation rates and biases in array-based resequencing. Besides the three explanations, the fraction of missing data was mentioned to be signifi- cantly correlated with estimated polymorphism level. Here we show a strong correlation between polymorphism level, especially the intergenic one, and frequency of deletion (Figure 3.7). This suggests the more deletion one region contains, the more SNPs it will have. One possibility is that these are highly variable regions that contain active transposable elements and/or clusters of NB-LRR genes. As a result, the probability of having SNP or deletion in these regions is high. Alternatively, these SNPs are within the deletions and thus they are wrongly called in the accessions carrying the deletions. This could happen because the SNP-calling algorithm we employ assumes two clusters due to lack of heterozygotes in A. thaliana. Intensity of within-deletion allele could form a cluster distinct from that of the reference allele. The end result is probably a mixture of the two scenarios. 3.3 The Ongoing Genome Shrinkage inA.thaliana A. thaliana separated from its closest outgroup species A. lyrata about 5 million years ago. A.lyrata is an outcrossing species with a 200Mb genome (2n=16) whileA.thaliana self-fertilizes 97% of the time [Platt et al., 2010], with a 120Mb genome (2n=10). Based on the data from otherArabidopsis species, the ancestral state of the two species 47 Figure 3.7: Genome-wide pattern of deletion frequency correlated with nucleotide diver- sity level. 48 is probably close to A. lyrata. It is a mystery about what has caused the A. thaliana genome to shrink by almost half since the species split. We polarized the deletion polymorphisms against the A. lyrata genome, and found that 85% of deletions relative to the reference genome are also deletions relative to the outgroup species; the remaining 15% are actually insertions in the reference genome, which suggests the genome ofA.thaliana is still shrinking. Looking at the composition of deletions reveals most of the deletions occur within or near transposable elements (TEs) throughout the genome (Figure 3.4). The haphazard jumping of TEs are mostly deleterious because of gene disruption effects. 3.4 Deletion-based GWAS Because of the large sample size and abundant deletions genome-wide, deletion poly- morphisms make an ideal candidate for a GWAS scan. We ran Wilcoxon and EMMA [Kang et al., 2008] on 107 phenotypes [Atwell et al., 2010] after converting all deletion loci into bi-allelic forms (see Chapter 1.2.9). Deletions with a frequency< 10% were excluded due to their high FDR rate. In addition, association methods we used have low power to detect association in loci with low allele frequencies. The high density of 250K SNPs used in [Atwell et al., 2010] had convinced us that every causal locus has been tagged and deletion-based GWAS would not yield anything new until we found that only 25% deletions with frequency at least 10% are tagged by nearby SNPs within 20kb,r 2 0:8. Among the phenotypes in which deletion-GWAS finds the same peaks as the SNP- GWAS, the deletion loci under the peaks are tagged by the SNPs nearby. By looking at deletion-GWAS, we can now directly confirm the gene as the true causal locus; Fig- ure 3.8. Several disease-resistance phenotypes fall into this category. Complex traits 49 Figure 3.8: a, GWAS from seven phenotypes are plotted here in fourteen panels, two panels, SNP-GWAS and deletion-GWAS respectively, for each phenotype. b, Gene RPS5 in deletion-GWAS is tagged by SNPs nearby or within. such as flowering time are more complicated. Deletion-GWAS would find identical peaks and miss some, while also finding some new peaks. Paucity of deletion loci in certain regions usually explains why deletion-GWAS did not find some peaks from SNP- GWAS; Figure 3.9. Transposable elements, which make up the largest fraction among deletion loci, are often found under the new peaks from deletion-GWAS; Figure 3.9. Looking at the nearby genes suggests that these transposable elements probably affect the trait through regulatory effects. 50 Figure 3.9: TE found under the peak where there is a paucity of SNP loci in flowering time (long day environment) association. 51 3.5 Discussion The correlation between genome-wide nucleotide diversity and deletion frequency needs further clarification to find out if there are many SNP calling errors underlying those frequently deleted regions. Cases have been found where the 250k SNP calling pipeline still made calls for SNPs within deletions. This could alter the SNP-GWAS result if such errors are widely found. To prove that some deletion loci are truly multi-allelic, merely checking the bound- ary is not enough and we need to check the geographic distribution of different alleles at the same locus. Quantifying how many deletion loci are actually multi-allelic is also not trivial. This could be linked to selection studies to find loci that are under local selec- tion. The ongoing shrinkage part requires the frequency spectrum of the two classes to prove convincingly that deletions are more likely to be fixed than insertions and thus the process is still ongoing. In the case of complex traits for which the deletion-GWAS produces new TE- embedded peaks, an enrichment analysis is needed to show theapriori candidate genes are enriched near the TEs. 52 Chapter 4 Polymorphism-GWAS Database and Web Portal As large-scale genomic data described in prior chapters has become the norm in biolog- ical studies, the challenge has been shifted from how to generate large datasets to how to store, integrate, view, and search them. In this chapter, we describe the development of an Arabidopsis thaliana database hosting the geographic information, genetic poly- morphism data for over 6000 accessions and genome-wide association study (GWAS) results for 107 phenotypes [Atwell et al., 2010]. It is the largest collection ofArabidop- sis polymorphism data and GWAS results by far. We take advantage of the latest web technology, such as Ajax (Asynchronous JavaScript and XML), GWT (Google-Web-Toolkit), MVC (Module-View-Controller) web framework, and Object Relationship Mapper to create a web portal for the database to offer an integrated and dynamic view of geographic information, genetic polymor- phism and GWAS results. Essential search functionalities are also incorporated into the web-portal to aid reverse genetics research. The database and its web portal have proven to be a valuable resource to the Arabidopsis community. Its visualization scheme serves as an example for how biological data, especially GWAS, are presented and accessed through the web. In the end, we illustrate the potential to gain new insights by some examples of querying a database through the web interface and how it can be used to facilitate both the forward and reverse genetics research. Database web-interface URL:http://arabidopsis.usc.edu/ 53 Table 4.1: Summary of Genetic Polymorphism Data Dataset Name DB Name No. of Ecotypes Avg. No. of SNPs per Ecotype 2010 at 249 21485 384 dbsnp 96 384 Perlegen chip 20 1126230 149SNP stock 5788 149 250kSNP stock 250k 199 216132 MySQL dump download: http://arabidopsis.usc.edu/db_dump. sql.gz 4.1 Database Architecture and Content The whole database is made up of one sub-database hosting all GWAS data and its affiliated genotype and phenotype data (Figure 4.1) , four sub-databases that host low- throughput genetic polymorphism data generated by past technologies, used primarily for QC, and one genome database storing the genome sequences, gene annotations, etc (Figure 4.2,4.3,4.4). We make the dump of the whole database available on the web to facilitate bioinformatics access. 4.1.1 Collection of Genetic Polymorphism Data Over the past decade, several batches of genome-wide genetic polymorphism data of Arabidopsis thaliana have been generated by different technologies. The latest 250kSNP dataset [Atwell et al., 2010] supersedes all previous datasets in the sense of both marker density and the number of ecotypes sampled, Table 4.1. However, some single previous dataset still has more markers or more ecotypes. We make all of them available in the hope that researchers could find them useful for different purposes. 54 Figure 4.1: The relationship among tables in database stock 250k. The 2010 dataset, [Nordborg et al., 2005], stored in db at, was done by de novo Sanger-sequencing technology of 1500 fragments randomly picked throughout the whole genome in 96 globally-sampled ecotypes to study population structure. Each fragment is about 700bp long. The database has both raw fragments and processed polymorphism (SNP and small indels) data. The 384 dataset, in db dbsnp, was done by mass-spectrometry at 384 random loci genome-wide in a 96-ecotype sample that is different from that of the 2010 dataset. The Perlegen dataset [Clark et al., 2007], in db chip, was done by a whole-genome tiling array with four different probes for each position in 20 ecotypes, which is a subset 55 Figure 4.2: The relationship among tables in database at. of the 96-sample in 2010 dataset. It discovered 1,126,230 SNPs in the process, which forms the basis for the the latest 250k SNPs used in the GWAS of 107 phenotypes. The 149SNP [Platt et al., 2010], in db stock, is a low-density SNP dataset carried out by a mass spectrometry technology from SEQUENOM Inc. The 149 genome-wide loci were picked out of the polymorphic loci in 2010 dataset with minor allele frequency near 0.5. This dataset covers over 7000 global ecotypes and serves as a barcoding resource. The 250kSNP [Atwell et al., 2010], in db stock 250k, was carried out by the custom-designed chip described in the first chapter. The genotyped samples include a core set of 95 lines which have measures in all published phenotypes, plus an extra set 56 Figure 4.3: The relationship among tables in database stock. of 96 included in only phenotypes related to flowering time. Because slightly different sets were used in different experiments, the total number of lines genotyped is 199. The 2010 dataset is of highest quality, followed by Perlegen dataset, then 384, 149SNP. The former two are of considerably better quality than the latter two. Due to its genome-wide coverage, the Perlegen dataset is used in SNP filtering while a dataset merging the other three, 2010, 384, 149SNP, in the order of precedence one dataset is taken over the other when calls for the same ecotype and same SNP differ in two datasets. datasets, referred to as 2010-384-149, is used in quality-control for 250kSNP genotype-calling, [Atwell et al., 2010]. This database holds all of the data and provides the searching functionality regard- ing which strain is tied to which type of polymorphism data. This facilitates researchers to choose the follow-up based on the density of the polymorphism data available (Fig- ure 4.7). 57 Figure 4.4: The relationship among tables in database chip, dbsnp, and genome. 4.1.2 Meta-information of Arabidopsis thaliana Ecotypes in DB stock Over 7000 ecotypes were collected by various groups throughout the world (Figure 4.5) to be genotyped at 149 SNP loci. We have kept various meta-information related to an accession, such as IDs used in stock centers, alias, stock-parent, collector name, date of collection, GPS location in table ecotype to facilitate population genetic and ecological studies. Due to the scope of the project, tracking each individual ecotype has become an issue. GPS coordinates for a considerable number of ecotypes were wrong in the begin- ning. After original collectors were contacted to validate them, some data remains uncertain. Poor maintenance of prior records, human errors in recording and copying 58 Figure 4.5: Map showing the location of 7075 ecotypes genotyped in 149SNP dataset. the labels and machine errors in sequencing are all to blame. We came up with a way to discover the errors by checking the concordance between the genetic clustering based on 149SNP and geographic clustering. After filtering out poor-quality 149SNP genotype data, 5788 ecotypes were grouped into 1819 haplotype groups assuming the genotyp- ing error rate to be 5%, [Platt et al., 2010]. Contamination is declared when geographic discordance is found within a haplotype group. The obvious ones are ecotypes from the same haplotype group found to be in more than one continent. This includes the famous lab strains,Col andLer. It is also easy to spot when one or two ecotypes from the group shows up in a location far away from where the majority of the haplotype group are found. These cases happen when the ecotypes escaped from the lab into the wild, got mixed up in the growth chamber, got labeled wrongly, spread via human activity (boots, trains), etc. Correcting them requires not just genotyping data, but more importantly, the biological knowledge from scientists who do experiments with specific ecotypes. We make this public in an attempt to enlist the researchers worldwide to help correct the remaining errors. 59 These data also serve as a potential barcoding resource. It is no longer the case that researchers work with only one or two strains in the lab. Some strains have very similar names with the difference of one letter or hyphen, such as Tsu-1 and Tu-1. Inconsistency arises in research papers that refer to the same strain by different names or different strains by the same name. Genotyping at 149SNP loci is a cost-effective way to eliminate ambiguity. 4.1.3 Phenotype and Association Results in DB stock 250k The 107 phenotypes used in the recentArabidopsisthaliana GWAS [Atwell et al., 2010] are all available in the stock 250k database, courtesy of our numerous collaborators. The phenotypes fall into four categories: twenty-three of them are flowering time measures under different environmental conditions; another twenty-three are related to defense, such as response to specific bacterial pathogens, trichome density on the leaf disks, etc.; eighteen are element concentrations measured using inductively-coupled plasma mass spectroscopy (“ionomics“); and the remaining forty-three belong to a loosely defined group of developmental traits, including dormancy and plant senescence. The diversity and similarity between phenotypes is demonstrated in a principal compo- nent analysis plot (Figure 4.6).The sample size is around 96 for all phenotypes except certain flowering time phenotypes which include an additional sample of 96 ecotypes. A total of 214 association results, 107 phenotypes by two association methods, Wilcoxon and EMMA, based on the 250kSNP dataset can be found in the database. The full content of each association result is stored in the file and the database table results method has a link to the file. The top 1000 associations from each are stored directly in table results with the position of each SNP, the score or p-value, allele fre- quency, effect size (EMMA only), variance explained (EMMA only). Another table, 60 Figure 4.6: Relationship among phenotypes is plotted by the first and second princi- pal components from PCA (Principal Component Analysis). Each dot represents one phenotype, colored according to the category it belongs to. results gene, records the top 1000 associations in a gene-centric way by linking associ- ation scores, ranks to the genes within 20kb of SNPs. The specific relationship between all SNPs and their adjacent genes are stored in table snps context. More detailed SNP annotation such as whether it is in UTR, exon, intron, or intergenic; whether it is syn- onymous, non-synonymous, splice-donor, splice-acceptor or premature-stop-codon, is stored in table snp annotation. The basic information of the genes, including descrip- tion, type of gene, position and the associated GeneOntology terms, were fetched from NCBI Gene Database and stored in the genome database to ease cross-linking. 61 4.2 Functionality of the Web Portal Despite its strength in maintaining data integrity, a database without an easy-access interface fails to realize its potential in helping researchers with their work. The non- human-friendly way the data are stored, and the complex relationship among tables are just two factors making the database hard to interact with. We leave the technology behind the web portal to the next section while focusing on the functionality made pos- sible through the innovative interface. The interface has three main entry points. The ecotype page, http://arabidopsis.usc.edu/Accession/ (Fig- ure 4.7), provides entrance to the meta-information of ecotypes stored in the db. To facilitate the communication, we setup an ID system which uniquely identifies each ecotype. Each ID is affiliated with all sorts of names, i.e. nativename, name, alias, that have been used. The ecotype “By Name” page allows the researcher to search through all ecotypes while remembering only one or two letters of their names through the adoption of auto-completion and regular expression. The server would return a table of all search candidates with ID, polymorphism information, stock parent and geo- graphic information. A interactive map of where all candidates are located would also come up, implemented via the Google Map API. The entries in the table and map are linked. Clicking one in either would render the corresponding one in the other widget to be highlighted, which helps to identify ecotypes quickly. The “149SNP Haplo-Group” column in the returned table contains the clickable hap- lotype group [Platt et al., 2010] ID. Upon click, a new window pops up and displays all the ecotypes found identical to the selected ecotype. It gives researchers a visual impres- sion of how far one clone spreads or how likely contamination has occurred. Some of those clones might just be close relatives due to the low resolution of the 149SNP data. The phenotype page, http://arabidopsis.usc.edu/ DisplayResults/ (Figure 4.8), is the entry point to all GWAS results. It has 62 four tabs under which one category of phenotypes is displayed in one table. The table, with each row containing information relevant to a phenotype, is generated dynamically based on the data stored in the database. Content change inside the database is reflected immediately on the web. Clicking on the column head enables the user to sort all phenotypes based on the value of that column. Clicking on a row renders a small popup with three links. The first link leads to a window containing more detailed information and a histogram of the phenotype. This phenotype-only window also includes an interactive motion chart to view the ecotypes in four dimensions (any four combinations of phenotype, latitude, longitude, PC1, PC2) simultaneously (Figure 4.9). This offers an easy way to visualize the extent of population structure confounding in different phenotypes. The second link leads to a window of interactive GWAS plots (Figure 4.10) besides the phenotype tab. The last one leads to tables of genes adjacent to (within 20kb) top associated SNPs organized in different tabs by methods. Clicking any dot in the interactive GWAS plots would lead to a single SNP view, (Figure 4.11), which has five tabs: close-up of the association in GBrowse [Stein et al., 2002], a table summarizing the information related to the SNP, a table of associations from other phenotypes in which this SNP is ranked in the top 1000, an interactive motion chart displaying the ecotypes sorted by their phenotype or GPS val- ues (Figure 4.12), and a table displaying allele and the phenotype value for each ecotype. The “Search GWAS By Gene Name” page,http://arabidopsis.usc.edu/ DisplayResultsGene/geneForm/ Figure 4.13, allows a user to search for asso- ciations of one gene through all GWAS results. A gene that is close to a significant SNP is far more interesting to researchers than the SNP itself. Due to the complex LD struc- ture, the most significant SNP is usually not the true causal locus but just a marker and the distance to the true locus could be as far as 20kb. However, in a compact genome of A. thaliana, 20kb is a big distance in which tens or even hundreds of genes could 63 exist and they all have the same rank as a result. It is often the case that genes known to be involved in phenotypes through prior molecular biology work are buried deep in the list of top 1000-ranked associations. Although their ranks might not always be near the top, they tend to appear consistently in the associations of a similar group of pheno- types. This interface allows the researchers to do exactly that. They could put a gene they suspect to be involved in a phenotype into the cross-phenotype search. It returns a page, Figure 4.13, including phenotype, SNP, rank, score, MAF, MAC, beta, variance explained (the latter two for EMMA results only). In another step to bridge the gap between the database and the human memory, auto completion and regular expression functionality has been added. 4.3 Implementation of the Web Portal and Improve- ments Over Existing Architecture Our motto in implementing the web portal is to let the user be in charge of the data. To a user who is involved in genetic research of a certain trait, the ability to navigate between a genomic overview and a closeup, to search through all information as easily as possible, and to view the data in an interactive manner is what matters the most. With the goal of presenting the machine-friendly database in an integrative and dynamic manner suitable for human interaction, we apply the latest web technology, so-called web2.0, to our portal. We are technology-agnostic in the sense that any technology that can help achieve these goals would be utilized. The technology side of the project is a mixture of old and new ones. In terms of programming languages, it involves python, java, javascript, sql, html. However, four major parts stand out (Figure 4.14): a relational database (MySQL or PostgreSQL), a 64 SQL Toolkit and Object Relational Mapper (sqlalchemy, elixir), a MVC (Module-View- Controller) server (pylons), an Ajax-filled (Google Web Toolkit, Google Visualization Toolkit) client end. The relational database is nothing new. However, most current GWAS interfaces interact with the database through raw SQL query. The SQL Toolkit and Object Rela- tional Mapper (ORM), sqlalchemy, lets the user skip manual construction of complex SQL query that oftentimes involves dozens of tables while still giving us the full power and flexibility of a relational database. Its data abstraction layer allows construction and manipulation of SQL expressions in a platform agnostic way, enables a graph of objects and their dependencies to be loaded on demand, and offers entire graphs of object changes to be committed in one step. It significantly reduced the programming overhead of dealing with a database. MVC (module-view-controller) framework, unlike PhP or cgi-based frameworks, separates the logic and presentation of data. The module is the ORM described above, in charge of accessing and managing the raw data. The controller is the collection of server-end functions that process the mostly un-presentable raw data from the module and then send them to the client or browser for final display. GWT (Google Web Toolkit), a development toolkit for building and optimizing com- plex browser-based applications, brings the portal much closer to a desktop application with tooltips, popup dialogs, and especially auto-completion. Auto-completion has been such an essential part for any human-computer interaction. Blank filling without any hint is so quick to turn a potential user off and prompt him/her to leave the website. The Google visualization toolkit, which is tightly connected to GWT, enables fast and inter- active plots in the browser on the fly. The end product of integrating all these different technologies is a powerful user interface for viewing GWAS data. 65 4.4 Discussion However innovative our framework is, we still need to improve a couple of things. Access control of different data through user-authentication is one. The QTL data, which offers a complimentary picture to GWAS, needs to be integrated into the database. A more ambitious goal is to offer phenotype submission and full GWAS pipeline through the web. 66 Figure 4.7: a, Ecotype search interface, with auto-completion in action and support of regular-expression search. b, Table of ecotypes returned. 67 Figure 4.8: One tab for each category of phenotypes. Under each tab is a dynamic table filled with phenotypes. Clicking on each phenotype shows a popup with links to detailed phenotype information, GWAS plots (Figure 4.10), and genes adjacent to top associations. 68 Figure 4.9: a, Phenotype for each ecotype, represented by dots, plotted by longitude and latitude. The plot is visualized through Google Motion Chart. b, Change the two axes in a to the first two principal components from PCA analysis on the genotype matrix. Mouse-over or clicking each dot in the chart displays the information of that ecotype. 69 Figure 4.10: Genome-wide association p-values plot. Mouse-over each dot shows the position and p-value of the SNP. Clicking each dot results an 80kb zoom-in SNP page, (Figure 4.11). This visualization is done by Google Visualization Scatter Chart. 70 Figure 4.11: View for the SNP selected in Figure 4.10. There are five tabs in the view. First tab, GBrowse, is close-up of the association around the SNP. The “SNP Summary Info” tab lists the position,p-value, annotation of the SNP. The “All Phenotypes in which SNP is significant” tab lists other phenotypes in which this SNP is among the top 1000 associations. The “Ecotype Allele Phenotype BarChart” is described in Figure 4.12. The “Ecotype Allele Phenotype Table” is a table listing which ecotype carries which allele and the corresponding phenotype. 71 Figure 4.12: Bar chart depicting the correspondence between ecotype allele and pheno- type. In this bar chart, each bar represents an ecotype and is colored according to the allele the ecotype carries. The height of each bar corresponds to the phenotype of the ecotype. All ecotypes are sorted horizontally according to their phenotype. Mouse-over each bar displays the information of that ecotype. 72 Figure 4.13: a, Search GWAS by gene name with auto-completion in action. b, Table returned from the search. 73 Figure 4.14: Overall architecture of the web portal. 74 Bibliography [Atwell et al., 2010] Atwell, S., Huang, Y . S., Vilhjalmsson, B. J., Willems, G., Horton, M., Li, Y ., Meng, D., Platt, A., Tarone, A. M., Hu, T. T., Jiang, R., Muliyati, N. W., Zhang, X., Amer, M. A., Baxter, I., Brachi, B., Chory, J., Dean, C., Debieu, M., de Meaux, J., Ecker, J. R., Faure, N., Kniskern, J. M., Jones, J. D., Michael, T., Nemri, A., Roux, F., Salt, D. E., Tang, C., Todesco, M., Traw, M. B., Weigel, D., Marjoram, P., Borevitz, J. O., Bergelson, J., and Nordborg, M. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature, 465(7298):627–631. [Carvalho et al., 2007] Carvalho, B., Bengtsson, H., Speed, T. P., and Irizarry, R. A. (2007). Exploration, normalization, and genotype calls of high-density oligonu- cleotide SNP array data. Biostatistics, 8:485–499. 10.1093/biostatistics/kxl042. [Chang and Lin, 2001] Chang, C.-C. and Lin, C.-J. (2001). LIBSVM:alibraryforsup- portvectormachines. Software available athttp://www.csie.ntu.edu.tw/ ˜ cjlin/libsvm. [Cheng and Li, 2005] Cheng, C. and Li, L. M. (2005). Sub-array normalization subject to differentiation. NucleicAcidsRes, 33(17):5565–73. automatic medline import. [Clark et al., 2007] Clark, R. M., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthmann, N., Hu, T. T., Fu, G., Hinds, D. A., Chen, H., Frazer, K. A., Huson, D. H., Scholkopf, B., Nordborg, M., Ratsch, G., Ecker, J. R., and Weigel, D. (2007). Common sequence polymorphisms shaping genetic diversity inArabidopsis thaliana. Science, 317:338–342. 10.1126/science.1138632. [Conrad et al., 2009] Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y ., Aerts, J., Andrews, T. D., Barnes, C., Campbell, P., Fitzgerald, T., Hu, M., Ihm, C. H., Kristiansson, K., MacArthur, D. G., MacDonald, J. R., Onyiah, I., Pang, A. W., Robson, S., Stirrups, K., Valsesia, A., Walter, K., Wei, J., Tyler-Smith, C., Carter, N. P., Lee, C., Scherer, S. W., and Hurles, M. E. (2009). Origins and functional impact of copy number variation in the human genome. Nature, advance online publication. 10.1038/nature08516. 75 [Consorti, 2007] Consorti, T. W. T. C. C. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–78. [Kang et al., 2010] Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.- Y ., Freimer, N. B., Sabatti, C., and Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42(4):348–54. automatic medline import. [Kang et al., 2008] Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J., and Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics, 178:1709–1723. [Kim et al., 2007] Kim, S., Plagnol, V ., Hu, T. T., Toomajian, C., Clark, R. M., Ossowski, S., Ecker, J. R., Weigel, D., and Nordborg, M. (2007). Recombina- tion and linkage disequilibrium in Arabidopsis thaliana. Nat Genet, 39:1151–1155. 10.1038/ng2115. [Kurtz et al., 2004] Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. (2004). Versatile and open software for comparing large genomes. GenomeBiology, 5(2):R12. [Marioni et al., 2007] Marioni, J. C., Thorne, N. P., Valsesia, A., Fitzgerald, T., Redon, R., Fiegler, H., Andrews, T. D., Stranger, B. E., Lynch, A. G., Dermitzakis, E. T., Carter, N. P., Tavare, S., and Hurles, M. E. (2007). Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. GenomeBiol, 8(10):R228. automatic medline import. [Nordborg et al., 2005] Nordborg, M., Hu, T. T., Ishino, Y ., Jhaveri, J., Toomajian, C., Zheng, H., Bakker, E., Calabrese, P., Gladstone, J., Goyal, R., Jakobsson, M., Kim, S., Morozov, Y ., Padhukasahasram, B., Plagnol, V ., Rosenberg, N. A., Shah, C., Wall, J. D., Wang, J., Zhao, K., Kalbfleisch, T., Schulz, V ., Kreitman, M., and Bergelson, J. (2005). The pattern of polymorphism inArabidopsisthaliana. PLoSBiology, 3:e196 EP –. [Pique-Regi et al., 2008] Pique-Regi, R., Monso-Varona, J., Ortega, A., Seeger, R. C., Triche, T. J., and Asgharzadeh, S. (2008). Sparse representation and bayesian detec- tion of genome copy number alterations from microarray data. Bioinformatics, 24:309–318. [Platt et al., 2010] Platt, A., Horton, M., Huang, Y . S., Li, Y ., Anastasio, A. E., Mulyati, N. W., ˚ Agren, J., Bossdorf, O., Byers, D., Donohue, K., Dunning, M., Holub, E. B., Hudson, A., Le Corre, V ., Loudet, O., Roux, F., Warthmann, N., Weigel, D., Rivero, L., Scholl, R., Nordborg, M., Bergelson, J., and Borevitz, J. O. (2010). The scale of population structure inArabidopsisthaliana. PLoSGenet, 6(2):e1000843. 76 [Roberts et al., 2007] Roberts, A., McMillan, L., Wang, W., Parker, J., Rusyn, I., and Threadgill, D. (2007). Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows. Bioinformatics, 23(13):i401–7. [Scheet and Stephens, 2006] Scheet, P. and Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. AmJHumGenet, 78(4):629–44. [Stein et al., 2002] Stein, L. D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J. E., Harris, T. W., Arva, A., and Lewis, S. (2002). The generic genome browser: a building block for a model organism system database. GenomeRes, 12(10):1599–610. automatic medline import. [Yu et al., 2006] Yu, J., Pressoir, G., Briggs, W. H., Vroh Bi, I., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S., and Buckler, E. S. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. NatGenet, 38:203–208. 10.1038/ng1702. 77
Abstract (if available)
Abstract
This thesis is about several studies related to GWAS (Genome-wide Association Study) and genomic features of Arabidopsis thaliana. The main genetic data used throughout are 1000 ecotypes genotyped on a hybridization microarray that harbors 1 million probes [Kim et al., 2007, Clark et al., 2007] for 250k SNPs and 1.4 million genome tiling probes. First we adapted a genotype calling method to call 250k SNPs, which form the genotype end of SNP-GWAS. We also developed a novel normalization scheme for tiling probes and a pipeline to call deletions robustly for nearly 1000 accessions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Comparative genomics of arabidopsis thaliana
PDF
Natural variation in Arabidopsis thaliana
PDF
Association mapping in Arabidopsis thaliana
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
Linkage disequilibrium and its application to mapping in Arabidopsis thaliana
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
A population genomics approach to the study of speciation in flowering columbines
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Computational design for analysis of SNP association studies
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Genetic association studies of age-related macular degeneration from candidate gene to whole genome
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Transplantation and genetics of human pancreatic islets in diabetes: Approaches in translational medicine and statistics
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
New tools for whole-genome analysis of DNA replication timing and fork elongation in saccharomyces cerevisiae
PDF
A novel design of integrin α2β1 targeting peptide probe for molecular imaging in prostate cancer
PDF
Understanding the 3D genome organization in topological domain level
Asset Metadata
Creator
Huang, Yu (author)
Core Title
Analysis of genomic polymorphism in Arabidopsis thaliana
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
02/16/2011
Defense Date
10/30/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
genome wide association studies,OAI-PMH Harvest,population genetics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nordborg, Magnus (
committee chair
), Conti, David V. (
committee member
), Tavaré, Simon (
committee member
)
Creator Email
polyactis@gmail.com,polyactis@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3664
Unique identifier
UC1417184
Identifier
etd-Huang-4155 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-427625 (legacy record id),usctheses-m3664 (legacy record id)
Legacy Identifier
etd-Huang-4155.pdf
Dmrecord
427625
Document Type
Dissertation
Rights
Huang, Yu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
genome wide association studies
population genetics