Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploring the genetic basis of complex traits
(USC Thesis Other)
Exploring the genetic basis of complex traits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLORINGTHEGENETICBASISOFCOMPLEXTRAITS by QuanChen ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTATIONALBIOLOGYANDBIOINFORMATICS) December2014 Copyright 2014 QuanChen Tomyparents. ii Acknowledgments FirstandforemostIwouldliketoexpressmysinceregratitudetomyadvisorProfessor FengzhuSun. ThroughallmyPh.D.years,Ihaveneverceasedtobeamazedbyhisvast and profound erudition as a scholar, his ingenuity and passion as a researcher, and his genuine care and insightful guidance as an educator. I feel extremely fortunate to have him as my academic advisor and as my constant inspiration in life. The work would neverhavebeenpossiblewithouthishelpandsupportallthroughtheyears. IwouldalsoliketothankProfessorXianghongZhouforinspiringthegeneticover- lap project, and plenty of useful discussion and advices along the way. Special thanks go to Professor Larry Goldstein, who offered numerous brilliant ideas on the genetic overlapproject. Iamhonoredtohavetheopportunitytoworkwithhim,thoughonlyfor a short period of time. It was certainly one of my most wonderful experiences here. I wouldalsoliketothankProfessorMichaelWatermanforhelpfulcommentsanddiscus- sion. Inaddition,IwouldliketothankProfessorTingChenforguidanceonmystudies and insightful suggestions on this work. The work has been supported by the National Institutes of Health (P50HG002790 and 1 U01 HL108634) and partially supported by theViterbiFellowship. Finally I wish to acknowledge my friends and colleagues for their generous help and support through the years. I am particularly grateful to Dr. Lin Wan who has been tremendous help to me in my early years as a Ph.D. student. I would also like to thank iii Dr. Li Xia, Dr. Wenhui Wang and all my group mates for their valuable advice and supportonthiswork. InadditionIwouldliketothankmypianoteacherChristineKim, my dear friends Yang Fu, Jing Zhang, Wangshu Zhang, Mengge Zhang, Jianghan Qu, Qingjiao Li, Pei Zhang, Min Xu, Shihua Zhang, Wenyuan Li, Chao Dai, Qiang Song, Ke Gong, Tianying Zhou, Che-yu Lee, Emad Bahrami-Samani, Misagh Bagherian and many more I cannot enumerate here. Your company has made this journey truly won- derful. iv Contents Dedication ii Acknowledgments iii ListofFigures viii ListofTables xiii Abstract xv 1 Introduction 1 1.1 ExploringtheGeneticBasisofComplexTraits . . . . . . . . . . . . . 1 1.2 ExploringtheCommonGeneticBasisofMultipleComplexTraits . . . 4 2 A Unified Approach for Allele Frequency Estimation, SNP Detection and AssociationStudiesBasedonPooledSequencingDataUsingEMAlgorithms 8 2.1 TheGeneralSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 MaterialsandMethods . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 ComputationalMethods . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 SimulationStudies . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.4 APooledSequencingDataSetRelatedtoType1Diabetes . . . 18 2.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 TheEffectsofMinorAlleleFrequency,SequencingErrorRate, NumberofIndividualsinthePoolsandNumberofPoolsonthe AccuracyofAlleleFrequencyEstimation . . . . . . . . . . . . 19 2.3.2 ResultsonthePowerofSNPCallingUsingtheLikelihoodRatio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v 2.3.3 Results on the Power of Association Studies Using the Likeli- hoodRatioTest . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.4 ResultsontheAnalysisoftheType1DiabetesData . . . . . . 25 2.3.5 IdentifyingSNPsAssociatedwithType1Diabetes . . . . . . . 29 3 FindingGeneticOverlapsamongDiseasesBasedonRankedGeneLists 32 3.1 TheGeneralSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 TestingwhetherGeneticOverlapExitsUsingScanStatistics . . 33 3.2.2 TestingwhetherGeneticOverlapExitsUsingWeightedSums . 37 3.2.3 EstimatingtheNumberofOverlappingDiseaseGenes . . . . . 39 3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 SimulationResultsSummary . . . . . . . . . . . . . . . . . . . 41 3.3.2 GeneticOverlapofOMIMDiseases . . . . . . . . . . . . . . . 42 4 SummaryandFutureDirections 49 4.1 A Unified Approach for Allele Frequency Estimation, SNP Detection and Association Studies Based on Pooled Sequencing Data Using EM Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 FindingGeneticOverlapsamongDiseasesBasedonRankedGeneLists 51 4.3 FutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 ReferenceList 58 AppendixA SupplementaryMaterialsfor“AUnifiedApproachforAlleleFre- quencyEstimation,SNPDetectionandAssociationStudiesBasedonPooled SequencingDataUsingEMAlgorithms” 68 A.1 SupplementaryMethods . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.1.1 DetailsofSimulationStudiesonEstimatingthePowerofAsso- ciationTestsUsingEM . . . . . . . . . . . . . . . . . . . . . . 68 A.2 SupplementaryResults . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.2.1 ResultsontheAccuracyof ^ f em UsingtheModifiedMeasuresof MSEandCg . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.2.2 The Effects of the Number of Chromosomes K in a Pool, the NumberofPoolsG,andtheNumberofReadsnontheEstima- tionAccuracyof ^ f em . . . . . . . . . . . . . . . . . . . . . . . 69 A.3 SupplementaryTables . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.4 SupplementaryFigures . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Appendix B Supplementary Materials for “Finding Genetic Overlaps among DiseasesBasedonRankedGeneLists” 86 B.1 SupplementaryResults . . . . . . . . . . . . . . . . . . . . . . . . . . 86 vi B.1.1 GeneticOverlapofOMIMDiseases . . . . . . . . . . . . . . . 86 B.1.2 GeneticOverlapofDiseasesintheNIHGWASCatalog . . . . 95 B.1.3 SimulationStudies . . . . . . . . . . . . . . . . . . . . . . . . 99 B.2 SupplementaryMethods . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.2.1 DistributionofK . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.2.2 DistributionofK r . . . . . . . . . . . . . . . . . . . . . . . . 109 B.2.3 DeterminingP-ValueofS M . . . . . . . . . . . . . . . . . . . 110 B.2.4 DeterminingP-Valueofp m . . . . . . . . . . . . . . . . . . . . 110 vii ListofFigures 2.1 Comparisonofperformancesbetween ^ f em and ^ f avg ,wheref =1%; start = 1%;n = 3000;K = 100; and G = 10. The two box plots on the left compare ^ f avg and ^ f em as an estimate of f and ^ f frac , where ^ f avg shows a considerable tendency of over-estimation compared to ^ f em . The upper right histogram shows how ^ f em deviates from f in 1000 simulations, whichappearstoberoughlyunbiasedlydistributedaroundf. Thelower right bar plot is a summary plot of the MSE for these two methods as estimatesoff andf frac ,showingsuperiorityof ^ f em bothasanestimator off andoff frac ,intermsofMSEandCg. . . . . . . . . . . . . . . . . 21 2.2 The power of detecting true SNPs at a type I error of 0.05, varying one parameter at a time while fixing all other parameters at default values. Default: (K;G;f;;)=(100;20;1%;0:5%;1:2). . . . . . . . . . . . 24 2.3 The power of detecting associated SNPs at a type I error of 0.05. Each subplot displays the effect of one parameter and the number of readsn (black indicatesn = 1000, and blue indicatesn = 3000) on the power of the test, while fixing all other parameters at default values. Default: (K;G;f;)=(100;10;1%;0:5%). . . . . . . . . . . . . . . . . . . . 26 2.4 dbSNPratioandTi/Tv(transition/transversion)ratioofthetop100vari- ants called by EM-SNP and SNVer, whose minor allele frequencies are lessthanthecorrespondingthresholdlabeledbythex-axis. . . . . . . . 28 3.1 Outlineofthemethods. GiventwodiseasesAandB,weobtainforeach disease a list of genes ranked by their strength of association to the dis- ease, regardless of the types of experiments that produce these ranks. Then based on our proposed genetic similarity measures, we can com- pute the value of dissimilarity between diseases A and B, and similarly for all other disease pairs, giving a dissimilarity matrix. The dissimi- laritymatrixprovidesaquantitativeestimateofthegeneticoverlapping pattern among a set of diseases, which is readily applicable to further analysis. Furthermore we can identify the genes underlying the genetic overlapsandprovideanoverlappingpatternforasetofdiseasesofinter- est. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 viii 3.2 PerformanceofWeiSumEexplainingtheMimMinerphenotypesimilar- ity,comparedwithtwoothersimplermethods#OMIMandHyperPwith differentcutoffpositions. Theaveragefoldchangeisusedasameasure for performance here, detailed definitions and explanations are in Sec- tion B.1.1. In particular, HyperP similarity performance is listed with varyingcutoffpositions10,20,50,100,200,500,1000. . . . . . . . . . . 44 3.3 A heatmap of the WeiSumE dissimilarity quantile matrix among 5 eye- relateddiseases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 A Venn diagram of the genetic overlap among 5 eye-related diseases. This Venn diagram shows the number of overlapping genes among any subsetofdiseasesfromthe5eye-relateddiseases. . . . . . . . . . . . . 48 4.1 TheTypeIerrorratesandpoweroftheweightedsumstatisticsv.s. their two dimensional extensions, taking d = 10;n = 25;000;o = 0 under nullando=1underalternative. . . . . . . . . . . . . . . . . . . . . . 56 4.2 TheTypeIerrorratesandpoweroftheweightedsumstatisticsv.s. their two dimensional extensions, taking d = 20;n = 25;000;o = 0 under nullando=1underalternative. . . . . . . . . . . . . . . . . . . . . . 57 A.1 The log values of the relative errors (RE) of ^ f avg and ^ f em for the 288 simulations. The four sections (72 simulations each) divided by the solid lines correspond to simulations with f = 0:1%;0:5%;1%;5%, respectively. For ^ f avg the decrease by section indicates the significant effect of f on RE. Within each section the two subsections (36 simu- lations each) divided by the dashed line correspond to simulations with n = 1000;3000, respectively. For ^ f avg the indifference between sub- sections indicate no effects of n on RE. Within each subsection the four quarters (9 simulations each) correspond to simulations with = 0:05%;0:1%;0:5%;1%, respectively. For ^ f avg the step-like shape indi- cateslittleeffectofG;K onRE,comparedtosignificanteffectof. . . 74 A.2 The effects off and on the accuracy of ^ f em as an estimator off. The histograms of ^ f em under different combinations of f and are shown, (K;G;n)=(100;10;3000). . . . . . . . . . . . . . . . . . . . . . . . 75 A.3 The effects of f and on the accuracy of ^ f em as an estimator of f frac . The histograms of ^ f em −f frac under different combinations of f and areshown,(K;G;n)=(100;10;3000). . . . . . . . . . . . . . . . . . 76 A.4 The effects of(K;G) on the estimation accuracy of ^ f em as an estimator off,(n;f; start )=(1000;0:01;0:01). . . . . . . . . . . . . . . . . . . 77 A.5 The effects of(K;G) on the estimation accuracy of ^ f em as an estimator off frac ,(n;f; start )=(1000;0:01;0:01). . . . . . . . . . . . . . . . . . 78 A.6 The effects of(K;G) on the estimation accuracy of ^ f em as an estimator off,(n;f; start )=(3000;0:01;0:01). . . . . . . . . . . . . . . . . . . 79 ix A.7 The effects of(K;G) on the estimation accuracy of ^ f em as an estimator off frac ,(n;f; start )=(3000;0:01;0:01). . . . . . . . . . . . . . . . . . 80 A.8 The power of SNP detection at a type I error of 0.05 using cases and controlstogether,(K;G;f;;)=(100;20;1;0:5%;1:2). . . . . . . . 81 A.9 Allele frequency spectrum of the top 100 variants called by EM-SNP andSNVer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.10 dbSNP proportion and Ti/Tv proportion of the top 100 variants called byEM-SNPandSNVer,groupedbyallelefrequencylevel. . . . . . . . 83 A.11 Allele frequency spectrum of the top 150 variants called by EM-SNP andSNVer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.12 The dbSNP ratio and Ti/Tv ratio of the top 150 variants called by EM- SNPandSNVer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 B.1 ScatterplotsofMimMiner,HPO,Lagediseasepairsimilarityranksver- sus each other for the 89 OMIM diseases. Each plot contains only dis- easepairspresentinbothphenotypesimilaritymatrices. . . . . . . . . 87 B.2 DistributionoftheNumberofOverlappingGenesbetweenDiseasePairs. 87 B.3 Distributionofthenumberofdiseasesassociatedwithpleiotropicgenes. For each gene, we refer to the number of diseases that it is identified as an overlapping gene with another disease as a measure of its extent of “pleiotropy”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 B.4 A histogram of the WeiSumE dissimilarity measure values between all diseasepairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 B.5 Disease pair rank by its negative MimMiner phenotype similarity v.s. rank by its genetic dissimilarity (WeiSumE, taking = 0), so that the more similar a pair is in its phenotype or genetic measure, the higher it is ranked by the corresponding meausre. Since 22 diseases from the 89 OMIM diseases we study are not included in MimMiner, this figure displaysallpairsoftheremaining67diseases. . . . . . . . . . . . . . . 89 B.6 The complete density fold change curve. The dashed line marks the value 1. The solid line is the density fold change curve of Fig B.5, representing the fold change of density relative to the density of the entireplot. Thedensityisdefinedbythenumberofdiseasepairswhose phenotypicandgeneticranksarebothlessorequaltothecorresponding xvalue,dividedbyx 2 ,i.e. thenumberofpointsintheleftbottomsquare inFigB.5,dividedbytheareaofthesquarewithsidelengthequaltothe correspondingx value. The density of the entire scatter plot in Fig B.5 is one over the total number of disease pairs, which is also the density ofthecasewhereonesetofranksisarandompermutationoftheother. 91 x B.7 The density fold change curve for disease pairs ranked in top 100 by both rankings. The dashed line marks the value 1. The solid line is the density fold change curve of Fig B.5, representing the fold change of density relative to the density of the entire plot. The density is defined asthenumberofdiseasepairswhoseranksarebothlessorequaltothe corresponding x value, divided by x 2 , i.e. the number of points in the left bottom square in Fig B.5, divided by the area of the square with sidelengthequaltothecorrespondingxvalue. Thedensityoftheentire scatter plot in Fig B.5 is one over the total number of disease pairs, which is also the expected density in the case that one set of ranks is a randompermutationoftheother. . . . . . . . . . . . . . . . . . . . . 92 B.8 TheblacklinesaretheaveragedensityfoldchangefromFigB.6corre- spondingtodifferentr or values. TheredlinesareforHPOsimilarity andthegreenlinesareforLagesimilarity. . . . . . . . . . . . . . . . . 93 B.9 Hierarchical clustering result of 89 OMIM diseases. A dendrogram for the hierarchical clustering result of 89 diseases in the OMIM database basedonWeiSumEgeneticsimilarity. . . . . . . . . . . . . . . . . . . 96 B.10 An example of the overlap vectors against random permutations of two disease pairs. Disease pair 1 is a closely related pair: Hypotrichosis 2 (HYPT2) and Peeling skin syndrome; Disease pair 2 is a distant pair: Hypotrichosis 2 (HYPT2) and Fibrochondrogenesis 2 (FBCG2). The blacklinesareobtainedbypermutingthefirstrankedlistfor1000times as 1000 the second ranked list, then taking average of each element of theresulting1000overlapvectors. . . . . . . . . . . . . . . . . . . . . 97 B.11 Hierarchical clustering result on 122 diseases in the catalog of GWAS datausingK 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 B.12 The effect of 2 on the Type I error rates of K r and S M under the null hypothesis and their power under the alternative hypothesis, tak- ing = 0:05, n = 11000, d = 10, o = 0 under null and o = 5 under alternative hypothesis. The vertical yellow lines indicate the true num- berofoverlappingassociatedgenes. . . . . . . . . . . . . . . . . . . . 101 B.13 Effect ofn: the Type I error rates ofK r andS M under the null hypoth- esis and their power under the alternative hypothesis, taking 2 = 0:1, =0:05,d =10,o=0undernullando=5underalternativehypoth- esis. The vertical yellow lines indicate the true number of overlapping associatedgenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 B.14 Effect ofd: the Type I error rates ofK r andS M under the null hypoth- esis and their power under the alternative hypothesis, taking 2 = 0:1, = 0:05, n = 11000, o = 0 under null and o = 5 under alterna- tive hypothesis. The vertical yellow lines indicate the true number of overlappingassociatedgenes. . . . . . . . . . . . . . . . . . . . . . . . 103 xi B.15 Effect ofo: the Type I error rates ofK r andS M under the null hypoth- esis and their power under the alternative hypothesis, taking 2 = 0:1, = 0:05, n = 11000, d = 10, o = 0 under null and o = 1;5;10 under alternative hypothesis. The vertical yellow lines indicate the true numberofoverlappingassociatedgenes. . . . . . . . . . . . . . . . . . 104 B.16 The power of the three sets of weighted sums, taking 2 = 0:05;d = 10;n=25;000;o =1. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B.17 The Type I error rates and power of the three sets of weighted sums, taking 2 = 0:05;d = 10;n = 25;000;o = 0 under null and o = 1 underalternative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.18 The Type I error rates and power of the minimum p-value statistic p m from each of the three sets of weighted sums, taking 2 = 0:05;d = 10;n=25;000;o =0undernullando=1underalternative. . . . . . . 106 B.19 The number of overlapping genes determined by ^ o 1 ;^ o 2 and ^ o 3 , taking 2 = 0:1, = 0:05, n = 25000, d = 10 and o = 5. The horizontal yellowlineindicatesthetruenumberofoverlappingassociatedgenes. . 107 xii ListofTables 2.1 Scenarios where MSE em > MSE avg or Cg em > Cg avg out of 18 total scenariosforeachcell. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 The average RE of ^ f avg and ^ f em for different values of minor allele fre- quencyf andsequencingerrorrate. . . . . . . . . . . . . . . . . . . 22 2.3 The mean squared error (MSE) and Cg of ^ f em as an estimator for f or f frac forvariouscombinationsoff and,(K;G;n)=(100;10;3000). 23 2.4 Testing for association between common SNPs ( ^ f 0 ≥ 1%) and T1D with Fisher’s preliminary p-value at moste−5. In the table, ^ f 0 and ^ f 1 aretheestimatedminorallelefrequenciesusingtheEMalgorithminthe controlsandcases,respectively,andn i =⌊960× ^ f i ⌋; i=0;1. Fisher’s p-valueisthepreliminaryp-valueusingFisher’sexacttestasin[1]and the EM p-value is the p-value calculated based on the likelihood ratio statisticinequation2.8. . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 Comparisonofdifferentstatisticsinsimulationstudies. Foreachlength of gene list n, the number of parameter settings out of a total of 144 differentsettingsthatamethodshowsthehighestpower,includingties, takingp-valuethresholdof0.05. . . . . . . . . . . . . . . . . . . . . . 41 3.2 The mean squared error (MSE) of different estimators of the number of overlapping genes when 2 = 0:1. 2 is a parameter describing the accuracy of the ranking algorithm. The larger 2 is, the less accurate the ranking algorithm is. n×1000 is the total number of genes in the genome. M isthecutoffpositionfora ^ o 1 ,whereonlyuptoM overlap- pinggenesareconsidered. . . . . . . . . . . . . . . . . . . . . . . . . 42 A.1 The values of 2 m (5) (defined in terms of MSE and Cg) for ^ f em as an estimator of f or f frac for different values of f and , (K;G;n) = (100;10;3000). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.2 The values of 2 m (10) (defined in terms of MSE and Cg) for ^ f em as an estimator of f or f frac for different values of f and , (K;G;n) = (100;10;3000). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.3 ThevaluesofMSEandCgfor ^ f em asanestimatoroff andf frac ,respec- tively,fordifferentvaluesof(K;G),(n;f; start )=(1000;0:01;0:01). 71 xiii A.4 ThevaluesofMSEandCgfor ^ f em asanestimatoroff andf frac ,respec- tively,fordifferentvaluesof(K;G),(n;f; start )=(3000;0:01;0:01). 71 A.5 The modified mean-squared-error 2 m (5) and 2 m (10) of the estimated valuesoff fordifferentnumbersofchromosomesinapoolK andnum- bersofpoolsG,(n;f; start )=(1000;0:01;0:01). . . . . . . . . . . . . 72 A.6 The modified mean squared error 2 m (5) and 2 m (10) of the estimated valuesoff fordifferentnumbersofchromosomesinapoolK andnum- bersofpoolsG,(n;f; start )=(1000;0:01;0:01). . . . . . . . . . . . . 72 A.7 The total number of SNPs (2nd and 5th columns), the number of SNPs in the dbSNP database (3rd and 6th columns), and the number of novel SNPs (4th and 7th columns) among the top a) 100 and b) 150 EM- SNP or SNVer called SNPs with minor allele frequency at most 0.2% accordingtoeitherf em orf avg . ThedbSNPratiofortheSNPscalledby EM-SNPismuchhigherthanthatforSNVer. . . . . . . . . . . . . . . 72 A.8 Thenumberoftransitions(Ti)(2nd,4thand6thcolumns)andthenum- ber of transversions (Tv) (3rd, 5th and 7th columns) among the top a) 100 and b) 150 EM-SNP or SNVer called SNPs with minor allele fre- quencyatmost0.2%accordingtoeitherf em orf avg . TheTi/Tvratiofor theSNPscalledbyEM-SNPismuchhigherthanthatforSNVer. . . . . 73 B.1 Themeansquarederror(MSE)ofdifferentestimatorsforthenumberof overlappinggenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xiv Abstract Thisthesisaimstoexplorethegeneticbasisofcomplextraits. Itaddressesthisproblem intwoparts. The first part stems from the general difficulties encountered in previous efforts to explain complex traits with discovered associated genetic variants. Accordingly, increasing attention has been paid to rare variants that have been overlooked in tradi- tional methods. To accommodate the special properties of rare variant studies, pooled sequencing is often used as a cost-effective design. Based on pooled sequencing data, we develop a unified approach for rare variants detection, allele frequency estimation and association studies. The method is thoroughly evaluated on both simulation data andarealre-sequencingdataset,andhasshownsuperiorperformance. Thesecondpartfurtherstudiestheproblemoffindingcommongeneticbasisbetween traits. We propose a set of test statistics to detect genetic overlap between diseases, and a set of estimators to measure the size of genetic overlap. Performance of these methods under various scenarios is evaluated in simulation studies. Based on this, we choose the most appropriate method to study the relationship between diseases from the OMIM database and diseases in the catalog of GWAS. Both applications demon- stratetheeffectivenessofourmethodsinprovidinginsightsintotherelationshipsamong diseases based on their shared genetic mechanisms, and in identifying common genes underlyingthegeneticoverlapbetweendiseases. xv Chapter1 Introduction 1.1 ExploringtheGeneticBasisofComplexTraits Finding genomic variants associated with complex traits is one of the most important problems in modern genomics. Genome-wide association studies (GWAS) based on common variants have been the dominant approach to achieve this objective [2]. How- ever,thegenomicvariantsidentifiedinGWASstudiesoftenexplainonlyasmallportion of the phenotypic variation related to heritable human diseases, a phenomenon known as “missing heritability” in genomics literature [3]. This missing heritability problem has led to increasingly skeptical views of the common disease-common variant (CD- CV)hypothesiswhichpredictsthatcommondisease-causingalleles,orvariants,willbe foundinallhumanpopulationsthatmanifestagivendisease. Ontheotherhand,interest instudiesonrarevariantswithminorallelefrequencieslessthan1%isgrowing[4,5]. Studies of rare variants are complicated by the low minor allele frequencies of rare variants. The development of next generation sequencing (NGS) technologies such as Illumina and Roche 454 has made it possible to sequence a large number of reads eco- nomically. Despitethatprogress,sequencingalargenumberofindividualsseparatelyis stillcostly formost biological laboratories. One frequently adoptedapproach toreduce sequencing cost in the search of rare variants is pooled sequencing, where mixtures of geneticmaterialsfromseveralindividualsaregroupedtogethertoformapoolforasin- gle sequencing. While this design greatly lowers the sequencing cost, it also makes it hardtodistinguishtruegeneticpolymorphismsfromsequencingerrors, estimateminor 1 allele frequencies at the polymorphic loci, and perform association studies on the rare variants. Severalresearchgroupshaveusedpooledsequencingtodetectrarevariantsthatare associatedwithcomplextraitssuchasretinitispigmentosa,diabetes,cancer,andinflam- matoryboweldisease[1,6–8]. Therearegenerallytwotypesofpoolingdesigns. Oneis poolingoftaggedsampleswitheachindividualtaggedbyauniqueshortbarcode. Inthis design, the genomic origins of the reads can be identified. However, barcoding many individualsanddistinguishingthesebarcodesfromeachothercanstillbeachallenging task. Thesecondtypeofpoolingistomixthegeneticmaterialsfromdifferentindividu- alswithouttagging,andthengeneratereadsfromthemixtureofgeneticmaterialsusing NGS. With this design, the identities of the individuals from whom the reads originate cannot be identified. In this dissertation, we concentrate on the second type of pooling design. Several groups have developed SNP calling methods based on pooled sequencing data [8–12]. Out et al. [8] modeled the number of sequencing errors as a Poisson ran- dom variable and identified SNPs by comparing the number of minor alleles within the reads with the Poisson distribution. For rare variants with minor allele frequencies similar to or lower than the sequencing error rate, this approach could miss many true variants if the pool size is relatively large. Druley et al. [9] developed a SNP identifica- tion method, SNPSeeker, that can be applied to large pools by using control sequences without SNPs. In many studies, control sequences may not be available, making this approach impractical. Also, the program can only be used to analyze Illumina data. Bansal et al. [10] developed a method called CRISP to identify rare variants by com- paring the minor allele frequencies across multiple pools using contingency tables. It wasshownthatCRISPoutperformsSNPSeekerintermsofaccuracy,butCRISPismore computationally demanding. Altmann et al. [13] improved the computational speed of 2 CRISPandidentifiedSNPsasthevariantswithdifferentminorallelefrequenciesacross at least two pools. Wei et al. [11] developed a statistical tool, called SNVer, for variant identification. Foreachpool,SNVerfirstdefinedap-valuebytestingthehypothesisthat the minor allele frequency is above a given threshold and then combined the p-values for individual pools to give an overall p-value using Simes methods as in [12]. This algorithmmakesitpossibletoranktheobservedvariantssothatthetoprankedonesare more likely to be true SNPs. However, the algorithm needs a pre-specified sequencing error rate which can be difficult to do because the sequencing error rates can be posi- tion dependent. In the above studies, the investigators are mainly concerned with the detectionofSNPs;theydonotaimtoestimateminorallelefrequencies. Inordertoestimateminorallelefrequenciesinpoolingstudies,severalgroupsdevel- oped statistical models for the sampling of individuals and the sampling of reads from the individuals in the pools [14,15]. These studies assumed a pre-defined constant sequencingerrorrateacrossdifferentloci. However,sequencingerrorratescanvaryfor differentlocidependingonthenucleotidecontentsofthesurroundinggenomicregions. The effects of mis-specifying the sequencing error rate on minor allele frequency esti- mation, SNP detection and power of association studies are not clear. Using similar models, Lee et al. [16] studied an optimal design in pooling studies. This involved the number of individuals in each pool and the number of pools. Recently, Chen et al. [17] considered more complex issues such as uneven sampling of individuals, different cov- erage of the minor and major alleles due to either PCR amplification or reads mapping, andreadsqualityscores. InChapter2,wedevelopnewmethodsforestimatingminorallelefrequencies,SNP detection, and association studies using pooled sequencing data based on the models in [14–16]. In contrast to methods developed in previous studies, we do not assume that the sequencing error rate is constant. Instead, we estimate the sequencing error 3 rate for each position together with the minor allele frequency based on the minor and major allele counts for all the pools using an expectation-maximization (EM) algo- rithm. We show that the naive estimation of the allele frequency by the fraction of minorallelesinthereadscanbesignificantlyinflated,especiallyforrarevariants,while the EM approach can give an unbiased estimate of the minor allele frequency in all sit- uations we studied. The estimation accuracy of the EM algorithm increases with the number of reads and the number of pools, but decreases with the number of chromo- somesineachpool. Basedontheallelefrequencyestimation,wedevelopaSNPcalling method, EM-SNP, and an association test using likelihood ratio statistics. The likeli- hood ratio statistic used in EM-SNP is then used to rank candidate polymorphic loci to determine if they are true polymorphisms. Using a real re-sequencing dataset, we showthat,forrarevariantswithminorallelefrequencieslowerthan1%,thefractionof dbSNPs (http://www.ncbi.nlm.nih.gov/projects/SNP/) among the SNPs called by EM- SNP is higher than that of SNVer. Similarly, the transition/transversion ratio of rare variants called by EM-SNP is higher than that of SNVer. These observations show that EM-SNP outperforms SNVer at calling rare variants with minor allele frequencies less than1%. 1.2 Exploring the Common Genetic Basis of Multiple ComplexTraits In addition to the variants identified by genome-wide association studies (GWAS) that are associated with a single trait, these findings have also provided an unprecedented insight into the biological mechanism of common genetic variants underlying multiple complex traits, especially human diseases. Susceptibility genes common to different 4 related diseases are found in numerous studies [18–28], providing evidence in neurol- ogy, psychiatry, and some other autoimmune diseases that several different diseases mayshareacertainextentofgeneticoverlap. Blonigenetal.[23]foundgeneticoverlap between psychopathic personality traits and composites of internalizing and external- izing. A review study in autism, schizophrenia and bipolar disorder discussed various studiesthatprovidedstrongevidencethattheseneurodevelopmentaldisordersmayshare common pathogenic mechanisms, so specific genetic loci and alleles may contribute to thesusceptibilityofdifferentdiseases[24]. Dammeetal.[25]conductedacase-control study of amyotrophic lateral sclerosis and spinocerebellar ataxia type 2 that revealed genetic overlap between the two diseases. Another twin study in neurocognition and schizophrenia showed that 92% of the phenotypic correlation between intelligence and schizophrenia is due to common additive genetic factors [26]. There was also evidence ofcommongeneticfactorslinkedtobothAlzheimerdiseaseandvasculardementia[27]. In another study of three autoimmune diseases, the variation within the TAGAP gene wasfoundtobeassociatedwithallthreediseases[28]. These studies suggest that exploring the common genetic risk factors for related diseases or phenotypes can help advance our understanding of disease etiologies [23]. The genetic overlap can also help identify new disease genes and suggest important commonbiologicalpathwaysthatcouldbeinformativeinidentifyingtherapeutictargets formultiplediseases[28]. Ontheotherhand,therisinginterestinidentifyingpleiotropicgenesfromthegenetic findingsofrelateddiseases[29]callsforefficientmethodstoidentifydisease-associated variantscommontomultiplediseases. Tostudythegeneticoverlapamongdifferentphe- notypes,Rzhetskyetal.[30]proposedaprobabilisticmodeltodescribethelinkageofan individual’s genetic variation to multiple disease status at different stages of one’s life, 5 and built a phenotypic disease network consisting of 657 diseases. This analysis indi- cated that genetic overlap among different disease phenotypes is so prevalent that most disorders studied are rooted in these shared genetic variations one way or another. Cot- sapas et al. [31] developed a method called Cross Phenotype Meta-Analysis (CPMA) which detects the association of a SNP to multiple phenotypes, and clustered SNPs potentially associated with multiple immune diseases. Suthram et al. [32] used gene expression data of several diseases and a protein interaction network to study the com- mon“functionalmodules”fordiseases. Foreachdisease,theydefinedascorebasedon thedifferencebetweencasesandcontrolsforeachgene,calledgeneactivityscore;then they defined the “module response score” as the average of gene activity score; finally these module scores consist a vector for each disease. Based on these feature values, the authors presented the hierarchical clustering of the diseases, the network presenta- tion of the diseases, and drug repositioning. Linghu et al. [33] studied the association between diseases by their mutual predictability, a quantitative measure of how genes related to one disease can predict the genes related to the other disease. Another study ingeneticoverlapisbetweenCrohn’sDisease(CD)andPsoriasis(PS)[34]. Indications of overlapping genetic or environmental factors between these two diseases include the observationthatthetwodiseasesoccurtogethermorefrequentlythanexpected,andthat individual GWAS studies for each of them find common SNPs that are associated with bothdiseases. Moreover, when there are multiple studies for each disease, the study of genetic overlap among diseases may involve ranking genes for each study first then using data integration methods to integrate multiple ranked lists [35–41]. Therefore, it is neces- sary to consider the problem of finding genetic overlap among diseases under a more general setting, based on ranked lists of genes that could be obtained from any hetero- geneous types of studies, rather than from a specific type of data or a homogeneous 6 study. A closely related paper [42] studied the problem of list-intersection test for gene lists, including its control of Type I error, within-set FDR and sensitivity. Several other studies [38,43–49] also addressed related topics of detecting overlap between ranked gene lists. However, most methods either use a fixed cutoff position and only consider intersection between lists above that position, or use a weighting parameter to impose a higher weight on the top part of lists; or using varying cutoffs to select the one that produces the most significant result then evaluate the significance of that result, which inducesheaviercomputationburden. InChapter3,wedevelopnovelmethodstostudydiseaserelationshipsbasedontheir geneticsimilarity. Weproposeaparameter-freestatisticandanestimatorofthegenetic overlap size that show relatively good performance in simulations, indicating strong medicalsignificanceinapplication. ThenweapplyourproposedWeiSumEgeneticsim- ilarity measure to detect genetic overlaps between diseases from the OMIM database. In addition, we include another application on diseases in the catalog of GWAS from NationalHumanGenomeResearchInstituteinAppendixB. 7 Chapter2 AUnifiedApproachforAllele FrequencyEstimation,SNPDetection andAssociationStudiesBasedon PooledSequencingDataUsingEM Algorithms 2.1 TheGeneralSetup 2.1.1 Background Genome-wide association studies (GWAS) have identified many common polymor- phisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for “missing” heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often cost prohibitive to sequencethousandsofindividualsthataregenerallyrequiredforassociationstudies. A 8 more efficient and cost-effective design would involve pooling the genetic materials of multiple individuals together and then sequencing the pools, instead of the individuals. Thispooledsequencingapproachhasimprovedtheplausibilityofassociationstudiesfor rare variants, while, at the same time, posed a great challenge to the pooled sequencing dataanalysis,essentiallybecauseindividualsampleidentityislost,andNGSsequencing errorscouldbehardtodistinguishfromlowfrequencyalleles. 2.1.2 Results A unified approach for estimating minor allele frequency, SNP calling and association studiesbasedonpooledsequencingdatausinganexpectationmaximization(EM)algo- rithm is developed in this chapter. This approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, number of individ- uals in each pool, and the sequencing depth on the estimation accuracy of minor allele frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give an unbiased estimate of the minor allele frequency under all scenarios studied in this chapter. A SNP calling approach, EM-SNP,forpooledsequencingdatabasedontheEMalgorithmisthendevelopedand comparedwithanotherrecentSNPcallingmethod,SNVer. WeshowthatEM-SNPout- performs SNVer in terms of the fraction of dbSNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to study the associationbetweenvariantsandtypeIdiabetes. 9 2.1.3 Conclusions TheEM-basedapproachfortheanalysisofpooledsequencingdatacanaccuratelyesti- mate minor allele frequencies, call SNPs, and find associations between variants and complextraits. Thisapproachisespeciallyusefulforstudiesinvolvingrarevariants. 2.2 MaterialsandMethods 2.2.1 Notations Consideralocusalongthegenome. Letf bethefrequencyoftheminorallele(denoted as“1”),and1−f bethefrequencyofthemajorallele(denotedas“0”)inapopulation. Wealsoconsiderthefollowingpotentialsequencingerrorrates: P(1|0)=; P(0|1)=: AssumethatatotalofGpoolsofindividualsaresequencedandeachpoolcontainsK=2 individuals (K chromosomes). For each pool g, a total of n g reads are mapped to this locus, withn 0g reads carrying the major allele andn 1g reads carrying the minor allele. Thusn g =n 0g +n 1g . LetC be the number of chromosomes carrying the minor allele among theK chro- mosomesinapool. ThenC followsabinomialdistribution,i.e P{C =k}=Bin(k;K;f)= ( K k ) f k (1−f) K−k ; k =0;1;2;··· ;K: ConditionalonC =k,theprobabilitythatasequencereadcoveringavariantcarries theminoralleleis p k = k K (1−)+ K−k K : 10 Thus,theprobabilityofobservingthedatafortheg-thpoolis, P g (n 0g ;n 1g |f;;)= K ∑ k=0 Bin(n 1g ;n g ;p k )Bin(k;K;f): (2.1) Since the pools can be considered independent, the likelihood of observing the data forallthepoolsis L(f;;)= G ∏ g=1 P g (n 0g ;n 1g |f;;): Given the above likelihood expression and the data {(n 0g ;n 1g );g = 1;2;··· ;G}, our objectivesareasfollows • Findthemaximumlikelihoodestimateof(f;;). • DeterminewhetheranobservedvariantisatrueSNPornot,i.e. SNPcalling. • Findthevariantsassociatedwithaphenotypeofinterest. 2.2.2 ComputationalMethods Anexpectation-maximization(EM)approachforallelefrequencyestimation Based on the likelihood function, an approximate solution to the maximum likelihood estimation of the parameters can be obtained using the EM algorithm. We consider the followingmissingdata: • C g : thenumberofchromosomescarryingtheminoralleleintheg-thpool; • I gi : the true underlying allele state (major (0) or minor (1)) of read i in the g-th pool; • r gi : the observed allele state (major (0) and minor (1)) of thei-th read in theg-th pool. 11 Wealsousethefollowingnotation: C = ∑ G g=1 C g , T g = ∑ ng i=1 I gi , T 11 = ∑ G g=1 ∑ ng i=1 I gi r gi , T 10 = ∑ G g=1 ∑ ng i=1 I gi (1−r gi ), T 01 = ∑ G g=1 ∑ ng i=1 (1−I gi )r gi , T 00 = ∑ G g=1 ∑ ng i=1 (1−I gi )(1−r gi ). Basedontheabovenotation,thecompletelog-likelihoodis: logP(T (g) ij ;C g ;n 1g ;n 0g ;g =1;··· ;G|f;;) =Clogf +(GK−C)log(1−f)+ G ∑ g=1 { log ( K Cg )( ng n 1g ) ( n 1g T (g) 01 )( n 0g T (g) 10 ) +T (g) 1: log C g K +(n g −T (g) 1: )log(1− C g K ) } +T 11 log(1−)+T 10 log+T 01 log+T 00 log(1−): (2.2) Supposethatthevalueof=(f;;)atthet-thiterationis (t) =(f (t) ; (t) ; (t) ). Themaximization(M)-stepgives: f (t+1) = E (t) (C) G×K ; (t+1) = E (t) (T 01 ) E (t) (T 01 )+E (t) (T 00 ) ; (t+1) = E (t) (T 10 ) E (t) (T 10 )+E (t) (T 11 ) : NotetheexpectationE (t) istakenwhentheparametersareat (t) . Theexpectation(E)-stepisformulatedasfollows: E (t) (C g |Data)= ∑ K k=0 k( K k )f k (1−f) K−k ( ng n 1g )(p k ) n 1g (1−p k ) n 0g ∑ K k=0 ( K k )f k (1−f) K−k ( ng n 1g )(p k ) n 1g (1−p k ) n 0g ; (2.3) 12 and E (t) (C|Data)= G ∑ g=1 E (t) (C g |Data); (2.4) wherealltheparametersintheequationsareofthevaluestakenatthet-thstep. FromEquations2.3and2.4,weareabletoobtaintherecursiveformulaforf. NextwecalculateE (t) (T 11 |Data). Notethat E (t) (I gi r gi |Data)= ∑ K k=0 k K (1−)Bin(n 1g −1;n g −1;p k )Bin(k;K;f) P((n 0g ;n 1g )) ; which does not depend on i, and we denote it as E(I g r g |(n 0g ;n 1g )). The denominator P((n 0g ;n 1g ))canbecalculatedfromEquation2.1. Thus, E (t) (T 11 |Data)= G ∑ g=1 n g E(I g r g |(n 0g ;n 1g )): (2.5) Similarly, we can derive the formulas for E (t) (T 10 |Data), E (t) (T 01 |Data) and E (t) (T 00 |Data),andtherecursiveformulasforand canbederivedfromthem. SNPidentificationusingEM Due to sequencing errors, the observed variants may contain a significant amount of false positives, i.e. loci that are not truly polymorphic. Thus, before testing for associ- ations with phenotypes, we need to determine the true polymorphic sites. This step is especiallyimportantinthecaseofrarevariantssincethesequencingerrorratesforNGS couldbeclosetoorevenhigherthantheminorallelefrequencies. Consideracase-controlstudywithagroupofcaseindividualsandanothergroupof control individuals. Letf 1 andf 0 be the minor allele frequencies at a locus among the cases and controls, respectively. Denote f = (f 0 ;f 1 ) and 0 = (0;0). We can test if an 13 observed variant is a true SNP using the likelihood ratio test for H 0 : f 0 = f 1 = 0 vs. H 1 :f 0 >0orf 1 >0: =2(l f̸=0 −l f=0 ) H 0 ∼ 1 4 I 0 + 1 2 2 1 + 1 4 2 2 ; (2.6) wherel f is the maximum log-likelihood of the observed data for both the cases and the controls. Note that the null hypothesis f = 0 is on the boundary of the region of the parameters of interest. Therefore, the asymptotic distribution of is 1 4 I 0 + 1 2 2 1 + 1 4 2 2 when the number of pools is large according to [50], where I 0 is the point mass at 0 and 2 i ; i = 1;2 are the chi-square distributions with i degrees of freedom. When the numberofpoolsisrelativelysmall,simulationapproachesforthenulldistributionof areneededtoobtaintheasymptoticdistribution. We can also test if an observed variant is a true SNP using cases or controls sep- arately. For the control pools, we conduct a likelihood ratio test for H 0 : f 0 = 0 vs. H 1 :f 0 >0. Similarly,wereplacef 0 byf 1 forthecasepools. Wethenusethestatistic i =2(l f i ̸=0 −l f i =0 ) H 0 ∼ 1 2 I 0 + 1 2 2 1 ; i=1;2; (2.7) to test each hypothesis, where l f 1 and l f 0 are the maximum log-likelihood of the data for the cases and controls, respectively. Because the null hypothesis f i = 0 is on the boundary of parameter region f i > 0, the statistic i has an asymptotic distribution 1 2 I 0 + 1 2 2 1 when the number of pools is large according to [50]. We refer to the above methodforSNPidentificationasEM-SNP. 14 TestingforassociationsbetweenaSNPandaphenotypeincase-controlstudies WetestifaSNPisassociatedwithaphenotypeofinterestusingthelikelihoodratiotest again. Here we test the alternative hypothesis H 1 : f 1 ̸= f 0 versus the null hypothesis H 0 :f 1 =f 0 . Thisassociationtestisconductedbythelikelihoodratioteststatistic: =2[l(unrestricted ^ f 0 ; ^ f 1 ;^ ; ^ )−l(restricted ^ f;^ ; ^ )] H 0 ∼ 2 1 : (2.8) Thisstatistichasanasymptoticchi-squaredistributionwith1degreeoffreedom. 2.2.3 SimulationStudies We use simulations to evaluate our approaches for allele frequency estimation, SNP detection and test for association. A large range of parameter space is considered to see how different parameters affect the performance of our methods. These parameters include minor allele frequency (f), sequencing error rate (), the number of chromo- somes in each pool (K), the number of pools (G) and the relative risk for a disease (). Pooleddatageneration In our simulations, we set = and choose four starting values of = = 0:05%;0:1%;0:5%;1% corresponding to different sequencing error rates ranging from low to high. The sequencing error rate for current NGS technologies is around 1% and weexpectthatitwilldecreaseasthetechnologiesimprove. Therefore,wealsoconsider much lower sequencing error rate in our studies. For the allele frequency, we choose four valuesf = 0:1%;0:5%;1%;5%. Loci with minor allele frequencies above 5% are consideredtobecommon. Wewanttostudyifitispossibletoestimatetheminorallele frequency even if it is lower than the sequencing error rate. We let the read number 15 n = 1000 and 3000, which is around the sequencing depth in [1]. To study the effect ofthenumberofchromosomes,weletK = 50;100;200. Thenumberofpoolsissetat G=10;20;50. Since the sequencing error rate can vary from locus to locus and from one pool to another, we generate 1000 i (= i ;i = 1;··· ;1000) valuesfrom a normal distribution with mean equal to the starting values of, and variance equal to 0.1 times the starting values. Finally, we generate 1000 pooled data sets with each combination of the five parameters(K;G;n;f;)basedon i (= i );i=1;··· ;1000. Measuringtheaccuracyoftheallelefrequencyestimation Foreachofthe4×4×2×3×3=288combinations,wedothefollowing: 1. In the i-th simulation, we use the EM algorithm derived above to estimate the parameters (f;;), denoted as ( b f (i) em ;b i ; b i ). We also consider a naive estimate for the minor allele frequency as the fraction of observed minor alleles in the observedreads, b f (i) avg = ∑ G g=1 n 1g ∑ G g=1 n g : (2.9) 2. RepeatStep1)forR=1000times. 3. Compute the mean squared error (MSE) of ^ f em and ^ f avg from the true population minorallelefrequencyf, MSE( ^ f em )= 1 R R ∑ i=1 ( b f (i) em −f) 2 ; MSE( ^ f avg )= 1 R R ∑ i=1 ( b f (i) avg −f) 2 : 16 4. Compute the MSE of ^ f em and ^ f avg from the fraction of chromosomes carrying minorallelesf frac =C=(KG)inthepools, Cg( ^ f em )= 1 R R ∑ i=1 ( b f (i) em −f frac ) 2 ; Cg( ^ f avg )= 1 R R ∑ i=1 ( b f (i) avg −f frac ) 2 : We use both MSE and Cg to compare the accuracy of the EM algorithm with the naive approachofestimatingf bythefractionofreadscarryingtheminorallele. Generatingcase-controldatatostudythepowerofSNPidentificationandassocia- tionstudiesusingEM In order to evaluate the power of SNP identification using EM-SNP and test for associ- ation, we simulate case-control data as follows. When generating the control data, we assumethattheminorallelefrequencyisf andthatthelocusisunderHardy-Weinberg equilibrium. Whengeneratingthegenotypesofthecaseindividuals,weassumethatthe penetrance(theprobabilityanindividualisaffected)ofthegenotypes00,01and11are g 0 =0:01;g 0 ;and 2 g 0 ,respectively. Inoursimulations,wechoose=1:2;2;and4. We can use the case or control samples separately or combine them for SNP detec- tionasinthe“SNPidentificationusingEM”subsection. Forexample,weconsiderboth thecasesandcontrolsjointly. Thelog-likelihoodratiostatistic(or i ifweusecaseor controlsamplesseparately)definedinequation2.6isusedtotestifanobservedvariant is a true polymorphic locus or not. For a given type I error (= 0.05 in our study), we claim that the variant is truly polymorphic if > t , where t is the threshold corre- sponding to type I error . If the threshold can not be found theoretically, we can do parametric bootstrap to find the threshold. Firstly, assume that the variant site is not 17 polymorphic,estimatetheallelefrequencyanderrorrateusingthemaximumlikelihood approach. Secondly, generate the reads data as in our model a large number of times (R = 1000), and obtain the empirical distribution of . For a given type I error , we findtheupper percentilet . Finally,thenullhypothesisisrejectedifthevalueofis atleastt . Foragivenrelativerisk,werepeatthesesteps1000timesandthepoweris thefractionoftimesthatthelocusiscalledaspolymorphic. Similar approaches can be used to study the power of association studies using the poolingdesign. Fordetails,seeAppendixA. 2.2.4 APooledSequencingDataSetRelatedtoType1Diabetes Weuseourmethodtostudythepooledsequencingdatarelatedtotype1Diabetesdataset (T1D) in [1] and compare the results with current methods for SNP identifiction [11]. The data was generated using DNA samples of 480 T1D patients and 480 healthy con- trols, arranged in 20 DNA pools, with 48 patients/controls in each pool. Roche 454 sequencingwasusedtosequence144targetregionsthatcoverexonsandsplicesitesof 10candidategenes. WeuseMOSAIK(http://bioinformatics.bc.ed u/marthlab/Mosaik) to map the reads to the human reference genome (hg19) with parameters -hs 15 -p 12 -mmp 0.05 -act 26 -mhp 100 -bw 51 as recommended in its documentation. MOSAIK is a widely used reference-guided assembler that hashes the whole reference genome and locate information in the hash table using a ‘jump database’ [51–53]. Then we use SAMtools (http://samtools.sourceforge.net/) [54] to pileup the reads onto the target regions. We also remove indels and keep loci that are covered by at least one read in each pool. Finally, we use ANNOVAR [55] to annotate theidentifiedSNPs. 18 2.3 Results We first present our results on the effects of various parameters on the estimation accu- racy of the minor allele frequency using the EM algorithm. We then present the results on the power of SNP detection and association studies. Finally, we present our results on the analysis of the data in [1] using the approaches in the “Materials and Methods” section. 2.3.1 The Effects of Minor Allele Frequency, Sequencing Error Rate,NumberofIndividualsinthePoolsandNumberofPools ontheAccuracyofAlleleFrequencyEstimation WecompareourEMestimate ^ f em withthenaiveestimate ^ f avg forminorallelefrequency f. Table 2.1 gives a brief summary of the comparisons between these two methods. Each cell corresponds to the number of scenarios that the mean squared error (using either MSE or Cg) of ^ f em exceeds ^ f avg . It shows that ^ f em consistently outperforms ^ f avg wheneverf ≤0:1%orwhenf ≤1%and≥0:5%,whichcoversthetypicalsituations of rare variant studies under current NGS technologies. Moreover, the advantage of the EM method increases as allele frequency f decreases and as sequencing error rate increases, which is reasonable since it becomes more difficult for a naive estimate such as ^ f avg to distinguish true minor alleles from sequencing errors as allele frequency decreasesandsequencingerrorrateincreases. Ontheotherhand,theEMmethodshows greater superiority since it is specifically designed for the purpose. However, when the sequencing error rate is very low, for example, less than one out of 2000 andf ≥ 1%, thesimplenaiveestimationmethodworksreasonablywell. Figure2.1givesanexampleofacommonpooledsequencingsettingof start = 1%, n = 3000,K = 100,G = 10, and a minor allele frequency off = 1%. The upper left 19 =0:05% =0:1% =0:5% =1% MSE Cg MSE Cg MSE Cg MSE Cg f =0:1%; 0 0 0 0 0 0 0 0 f =0:5% 9 5 4 3 0 0 0 0 f =1% 13 10 9 7 0 0 0 0 f =5% 17 16 17 16 12 12 7 7 Table2.1: ScenarioswhereMSE em > MSE avg orCg em > Cg avg outof18totalscenarios foreachcell. panel shows that ^ f avg suffers from an evident over-estimation of bothf andf frac , while ^ f em appearstobeanunbiasedestimateoff. Theupperrightpanelshowsthehistogram of ^ f em over1000simulations. Thelowerleftpanelshowstheboxplotof ^ f avg −f frac and ^ f em −f frac , respectively. It shows that ^ f em −f frac centers around 0, which suggests that the variance of f frac might be responsible for the majority of the variance of ^ f em . The bar plot of the MSE for both ^ f avg and ^ f em as estimates of f and f frac in the lower right panelquantitativelydemonstratesthesuperiorityof ^ f em over ^ f avg intermsoftheirmean squarederrors. TheRelativeErrorsof ^ f avg and ^ f em inEstimatingMinorAlleleFrequencyf We measure the bias of an estimator by the relative error (RE) defined asRE = 100× | ^ f −f|=f, where ^ f is the mean of the estimates of f across all replications. The log valuesoftheREofall288simulationsforboth ^ f avg and ^ f em aregiveninFigureA.1. The figure shows that the number of reads n in each pool, the number of chromosomes K andthenumberofpoolsGhavelittleeffectontheREof ^ f avg ,whiletheallelefrequency f and the sequencing error rate play a dominant role in affecting RE. We further demonstrate the average effects of f and on the RE by computing the average RE basedonthevaluesoff andacrossalldifferent(n;G;K)inTable2.2. Itisinteresting toobservethatfixing,theREof ^ f avg decreaseslinearlywithf;whilefixingf,theRE of ^ f avg increases linearly with . Thus, ^ f avg suffers most severely in the case of rare 20 f frac f avg f em 0.000 0.010 0.020 0.030 f= 0.01 α start =0.01 f em Frequency 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0 20 40 60 n= 3000 K= 100 G= 10 f avg - f frac f em - f frac −0.010 0.000 0.010 MSE avg MSE em Cg avg Cg em 0e+00 4e−05 8e−05 MSE avg =mean((f avg - f true ) 2 ) MSE em =mean((f em - f true ) 2 ) Cg avg =mean((f avg - f frac ) 2 ) Cg em =mean((f em - f frac ) 2 ) Figure 2.1: Comparison of performances between ^ f em and ^ f avg , wheref = 1%; start = 1%;n = 3000;K = 100; and G = 10. The two box plots on the left compare ^ f avg and ^ f em asanestimateoff and ^ f frac ,where ^ f avg showsaconsiderabletendencyofover- estimation compared to ^ f em . The upper right histogram showshow ^ f em deviates fromf in 1000 simulations, which appears to be roughly unbiasedly distributed aroundf. The lowerrightbarplotisasummaryplotoftheMSEforthesetwomethodsasestimatesof f andf frac ,showingsuperiorityof ^ f em bothasanestimatoroff andoff frac ,intermsof MSEandCg. polymorphisms and high sequencing error rate. It can be seen from Table 2.2 that the REof ^ f em inestimatingminorallelefrequencyf issignificantlylowercomparedtothat of ^ f avg forrarepolymorphismsatf ≤1%. Nextwepresentourresultsfortheeffectsof(K;G;n;f;)ontheestimationaccu- racyofminorallelefrequencyf usingtheEMalgorithm. Wenotethattheestimationof 21 =0:05% =0:1% =0:5% =1% f RE ^ favg RE ^ fem RE ^ favg RE ^ fem RE ^ favg RE ^ fem RE ^ favg RE ^ fem 0.1% 52.0 9.4 102.0 15.6 502.0 72.0 1000.0 146.0 0.5% 10.3 4.0 20.2 4.9 99.5 13.3 199.0 26.5 1% 5.0 3.3 10.0 3.7 49.2 5.7 98.3 9.5 5% 1.0 5.4 1.9 6.0 9.1 6.7 18.1 6.3 Table2.2: TheaverageREof ^ f avg and ^ f em fordifferentvaluesofminorallelefrequency f andsequencingerrorrate. =f(0|1)ishighlyunreliable(datanotshown). Thisphenomenoncanbeexplainedas follows. When minor allele frequency f is low, the expected number of chromosomes having the minor allele in each pool is also low. When the number of poolsG is small, the estimation of can be difficult with a small number of chromosomes carrying the minor allele. Thus, we do not show detailed results on the estimation of. Despite the factthat cannotbereliablyestimated,theothertwoparametersf andcanbereliably estimatedusingtheEMapproach. The Effects of Minor Allele Frequency f and Sequencing Error Rate on the EstimationAccuracyof ^ f em To study the effects of minor allele frequency f and sequencing error rate on the estimation accuracy of ^ f em , as an estimator of both f and f frac , we fix (K;G;n) = (100;10;3000). The histograms of ^ f em under each combination of f and are shown in Figure A.2. We observe that ^ f em is roughly unbiasedly distributed aroundf, but the variance of ^ f em as an estimator of f is relatively large. The source of this variance, however,islargelyduetothevarianceoff frac ,ratherthanthealgorithmitself,asshown inFigureA.3,wherethehistogramsof ^ f em −f frac istightlydistributedaround0,withthe majority of the variance shown in Figure A.2 disappeared. This is an explicit evidence that the variance of ^ f em consists mostly of the variance of f frac . Thus, ^ f em might serve 22 better as an estimator of f frac than of f. We also observe as a general trend that ^ f em appearstobearoughlyunbiasedestimatorforbothf andf frac ,anditsvarianceappears to be affected less by but significantly by f. This observation is also confirmed in Table2.3whereMSEandCgfordifferentcombinationsoff andareshown. =0:05% =0:1% =0:5% =1% f MSE Cg MSE Cg MSE Cg MSE Cg 0:1% 9.8e-7 4.3e-8 9.7e-7 1.2e-7 3.2e-6 2.8e-6 1.1e-5 1.1e-5 0:5% 5.5e-6 1.9e-7 5.4e-6 1.9e-7 6.7e-6 1.5e-6 1.0e-5 5.8e-6 1% 1.2e-5 8.6e-7 1.2e-5 9.6e-7 1.3e-5 2.3e-6 1.6e-5 5.6e-6 5% 5.5e-5 1.3e-5 5.9e-5 1.7e-5 8.3e-5 4.3e-5 1.0e-4 7.0e-5 Table 2.3: The mean squared error (MSE) and Cg of ^ f em as an estimator for f or f frac forvariouscombinationsoff and,(K;G;n)=(100;10;3000). Toreducetheeffectofafewoutliersof ^ f em ontheMSEandCgcalculation,wealso modifiedthedefinitionsofMSEandCgbyremovingthetopandbottom%ofitsvalues and recalculate the values of MSE and Cg. The results on the modified measures are presentedinAppendixAandthequalitativeresultsontheperformanceof ^ f em continue tohold. We also studied the effects of (K;G;n) on the estimation accuracy of ^ f em and the details of the simulation results are given in Appendix A. It was observed that the accuracyincreaseswithGandnasexpected. However,theaccuracydecreaseswiththe numberofindividualsK ineachpool. 2.3.2 Results on the Power of SNP Calling Using the Likelihood RatioTest We next study the effects of (K;G;n;f;) on the power of SNP detection using the likelihoodratioapproachforthecaseandthecontrolsamples,respectively. Thenumber of reads in each pool (n) is set at either 1000 or 3000 as in the above simulations. We 23 0.00 0.01 0.02 0.03 0.04 0.05 0.70 0.80 0.90 1.00 f power n=1000 control n=3000 control n=1000 case n=3000 case 0.002 0.004 0.006 0.008 0.010 0.90 0.92 0.94 0.96 0.98 1.00 α power n=1000 control n=3000 control n=1000 case n=3000 case 50 100 150 200 0.90 0.92 0.94 0.96 0.98 1.00 K power n=1000 control n=3000 control n=1000 case n=3000 case 10 20 30 40 50 0.90 0.92 0.94 0.96 0.98 1.00 G power n=1000 control n=3000 control n=1000 case n=3000 case Figure 2.2: The power of detecting true SNPs at a type I error of 0.05, varying one parameter at a time while fixing all other parameters at default values. Default: (K;G;f;;)=(100;20;1%;0:5%;1:2). start from default values of the parameters (K;G;n;f;) = (100;20;1:2;1%;0:5%). Then we change one of these parameters and keep all the other parameters at default. Figure 2.2 shows the results for such a study and the results for using case and control samplestogetheraregiveninFigureA.8. It can be seen from Figure 2.2 that at a type I error rate of 0.05, the power is con- sistentlywellabove0.9inallscenariosdemonstratedhereexceptfortheextremelyrare variant case off = 0:1%. The power tends to increase with the minor allele frequency 24 orthenumberofpools,whileitdecreaseswithsequencingerrorrateornumberofindi- viduals in each pool. The power also increases with the number of reads in each pool. We observe that the power using the case samples is slightly higher than that using the controlsamples. Thisobservationcanbeexplainedbythefactthatthefrequencyofthe minor allele in the cases is higher than that in the controls, resulting in higher power of SNPdetection. 2.3.3 Resultson the Power ofAssociation Studies Using the Likeli- hoodRatioTest We also study the effects of (K;G;n;f;) on the power of detecting associations betweenSNPsandphenotypesusingthelikelihoodratioapproachforthecaseandcon- trol samples together. The parameter setting is similar to that in the above subsection exceptthatherewealsolettherelativerisktobe1.2,2,and4,respectively. Figure2.3 showshoweachparameteraffectsthepowerofdetectingtheassociation. ItcanbeseenfromFigure2.3thatatatypeIerrorrateof0.05,thepowerincreases withandapproaches1asgoesupto4,whichhappensinallscenariosdemonstrated here except for the extremely rare variant scenario of f = 0:1%. The power increases with allele frequency, pool size or number of pools, while it seems robust with respect tochangesinsequencingerrorrate. 2.3.4 ResultsontheAnalysisoftheType1DiabetesData AlleleFrequencyEstimationandSNPCallingintheControlSamples We apply our approach to analyze the pooled sequencing data in [1]. First, we conduct SNPcallingusingbothourmethodEM-SNPandSNVer(parametersetting-bq20-a0-f 0-p1-t0)[11],aprogramthathasbeenshowntooutperformseveralotherprogramsfor 25 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 λ power f=0.001 f=0.005 f=0.01 f=0.05 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 λ power α=0.0005 α=0.001 α=0.005 α=0.01 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 λ power K=50 K=100 K=200 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 λ power G=10 G=20 G=50 Figure 2.3: The power of detecting associated SNPs at a type I error of 0.05. Each subplot displays the effect of one parameter and the number of readsn (black indicates n = 1000, and blue indicatesn = 3000) on the power of the test, while fixing all other parametersatdefaultvalues. Default: (K;G;f;)=(100;10;1%;0:5%). SNP calling including CRISP, SAMtools, and GATK. Unlike many previous programs calling variants as SNPs or not, SNVer ranks variants according to their potential of beingtrueSNPsusingthep-valuedefinedintheprogram. Asalikelihoodratiotest,EM- SNP can also rank the variants by the magnitude of the likelihood ratio. The estimated allele frequency spectrum of the top 100 called SNPs by either EM-SNP or SNVer is given in Figure A.9. Note that 5 variants have dominant non-reference alleles and they are excluded from both lists for a fair comparison. In the SNVer list, we also exclude the variants that are removed in the preprocessing stage of EM-SNP. Both frequency 26 spectrums of the variants called by EM-SNP and SNVer tend to concentrate on the lowerfrequencyrange. EvaluationoftheSNPCallingResults A standard approach to evaluate the effectiveness of a SNP calling method is to com- pare the fraction of dbSNPs [56] among the top ranked SNPs, defined as the dbSNP ratio. The rationale is that if a SNP calling method is reasonable, it should be able to detect the SNPs that are already in the dbSNP database because these SNPs have been reportedinpreviousstudies. Therefore,ahigherdbSNPratioindicatespotentiallybetter performance of the SNP calling method. Figure 2.4 shows the cumulative dbSNP ratio of the top 100 called variants whose minor allele frequencies are less than a threshold. To further demonstrate the effect of minor allele frequency on the performance of EM- SNP and SNVer, we also show the dbSNP ratio in different windows of minor allele frequenciesinFigureA.10. In terms of the dbSNP ratio for the top 100 called variants, EM-SNP consistently outperformsSNVerunderallallelefrequencythresholds,andEM-SNPdisplayssignifi- cantsuperiorityespeciallyforlowfrequencyvariants. InTableA.7,wegiveanexample of the dbSNP ratio among the top 100 SNP calls with f em ≤ 0:2% for the two meth- ods. Usingatotalof480controlsamples,EM-SNPidentifiesvariantswithminorallele frequency less than 0.2% with a high dbSNP ratio, which serves as an evidence of its superior performance in rare variant scenarios. On the other hand, the upper left panel of Figure A.10 shows that the relative performance of EM-SNP and SNVer based on dbSNP ratio depends on minor allele frequency. EM-SNP detects more rare variants and has higher dbSNP ratio at minor allele frequency less than 1%. Whereas this rel- ative performance of EM-SNP and SNVer is reversed for minor allele frequency above 1%. Thus,EM-SNPismostusefulindetectingrarevariants. 27 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 dbSNP ratio threshold dbSNP ratio with freq less than threshold EM−SNP SNVer 0.0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 Ti/Tv ratio of all variants called threshold Ti/Tv ratio for freq less than threshold EM−SNP SNVer 0.0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 Ti/Tv ratio of known variants called threshold Ti/Tv ratio for freq less than threshold EM−SNP SNVer 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.5 1.0 1.5 2.0 Ti/Tv ratio of novel variants called threshold Ti/Tv ratio for freq less than threshold EM−SNP SNVer Figure 2.4: dbSNP ratio and Ti/Tv (transition/transversion) ratio of the top 100 vari- ants called by EM-SNP and SNVer, whose minor allele frequencies are less than the correspondingthresholdlabeledbythex-axis. AnothercriteriontoevaluateSNPcallsisthetransition-transversion(Ti/Tv)ratio. It is well known that transitions are much more frequent than transversions in evolution, andthenumberoftransitionsoverthenumberoftransversions,referredtoasTi/Tvratio, inknownSNPsisexpectedtobebetween2and4[57]. Figure2.4showsthatEM-SNP gives a consistently higher Ti/Tv ratio throughout the entire allele frequency range for boththewholesetofcalledvariantsandthenovelset. Fortheknownvariants,theTi/Tv ratiotrendsofthetwomethodsaresimilartoeachother. TableA.8givesanexampleof theTi/Tvratioamongthetop100SNPcallswithf EM < 0:2%byEM-SNPandSNVer. 28 TheeffectofminorallelefrequencyontherelativeperformanceofEM-SNPandSNVer intermsofTi/TvratioissimilartothatintermsofdbSNPratio(FigureA.10). Wealsoconsiderthetop150rankedSNPsandthecorrespondingfiguresandtables areshownasFigureA.11-A.12andTablesA.7-A.8. Thequalitativeconclusionsarethe same. 2.3.5 IdentifyingSNPsAssociatedwithType1Diabetes Next we study the association of the identified variants with type 1 diabetes (T1D). We first look at the common SNPs with estimated minor allele frequencies above 1% in the controls as in [1] and want to see if we can identify the common SNPs associated withT1D.Withtheestimatedallelefrequencies,wecanestimatethenumbersofminor alleles in the controls and in the cases separately. Based on the estimated counts, we obtain a preliminary p-value based on Fisher’s exact test as in [1]. However, the p- valueobtainedthiswayisnotaccurateasitdoesnotconsiderthevariationinestimating the allele frequencies using the EM algorithm. Therefore, we use the likelihood ratio statistic defined in equation 2.8 to calculate another p-value that is given in the last column in Table 2.4, where we only list SNPs with preliminary p-value less than10 −5 . The p-value obtained through the likelihood ratio test reflects the true p-value better becauseittakesthevariationinestimatingtheallelefrequencyintoaccount. The SNP rs3184504 residing within gene SH2B3 has an EM p-value of 8:4e−7. This SNP was also found to be associated with a preliminary p-value of 5e−7 in the originalstudy[1]andthecorrespondinggenewaspreviouslyidentifiedtobeassociated with T1D. The fractions of observed minor alleles in controls and cases in the original study were 53% and 41.5%, respectively, and our mapping results are consistent with the estimates. The EM estimated minor allele frequencies in the controls and cases are 52% and 41%, which are slightly smaller than the observed values since we took 29 SNP Gene ^ f 0 n 0 ^ f 1 n 1 Fisher’sp-value EMp-value rs3184504 SH2B3 0.52 499 0.41 394 1.9e-6 8.4e-7 rs7076103 IL2RA 0.19 178 0.10 93 4.5e-8 2.7e-7 rs2476601 PTPN22 0.09 86 0.16 151 8.1e-6 9.2e-6 Table 2.4: Testing for association between common SNPs ( ^ f 0 ≥ 1%) and T1D with Fisher’s preliminary p-value at most e − 5. In the table, ^ f 0 and ^ f 1 are the estimated minorallelefrequenciesusingtheEMalgorithminthecontrolsandcases,respectively, andn i =⌊960× ^ f i ⌋; i=0;1. Fisher’sp-valueisthepreliminaryp-valueusingFisher’s exacttestasin[1]andtheEMp-valueisthep-valuecalculatedbasedonthelikelihood ratiostatisticinequation2.8. sequencing errors into account. The SNP rs7076103 within the gene IL2RA has a pre- liminary Fisher’s p-value of 4.5e-8 and an EM p-value of 2.7e-07. This SNP was not reported to be associated with any phenotypes according to the catalog of GWAS stud- ies (http://www.genome.gov/gwastudies/) to date. Nevertheless, other SNPs within the IL2RAgenewerefoundtobeassociatedwithT1DbyCooperetal.[58]beforethepub- licationof[1]andbytwootherstudies[59,60]afterthepublicationof[1]usingdifferent approaches. Barrett et al. [59] used genome-wide association studies giving a p-value of 1.0e-13 while Huang et al. [60] used imputation of the genotypes based on the 1000 genomeprojectsyieldingap-valueof5e-9. Thesenewstudiessupportourresultonthe significant association of rs7076103 with T1D. The estimated minor allele frequencies in the controls and cases were 18.8% and 14.8% in [1] giving a p-value of 0.02 which is not significant after adjusting for multiple testing. This example shows that even for common polymorphisms, the EM approach can help to find likely associations that the naive approach can not. The SNP rs2476601 was found to be associated with T1D in severalstudiesbeforethepublicationof[1]andwasconfirmedinarecentstudyin[61] that was published after the publication of [1]. All these studies support our results for theassociationofcommonpolymorphismswithT1D. 30 For rare polymorphisms ( ^ f 0 < 1%) within the controls, we first use the naive approach described above to obtain a preliminary Fisher’s p-value for every SNP. Due to the low minor allele frequencies of the rare variants, none of the p-values is smaller than0.001. WedidnotpursuetheassociationofindividualvariantswithT1Dfurther. 31 Chapter3 FindingGeneticOverlapsamong DiseasesBasedonRankedGeneLists 3.1 TheGeneralSetup Tounderstanddiseaserelationshipsintermsoftheirgeneticmechanisms,itisimportant to study the common genetic basis between different diseases. Although discoveries on pleiotropic genes related to multiple diseases abound, methods flexibly applicable to various types of datasets generated from different studies or experiments are needed to gain insights into the genetic relationships among a large number of diseases. We develop a set of genetic similarity measures to gauge the genetic overlap between dis- eases, and also several estimators of the number of overlapping disease genes between diseases. These methods are based on ranked gene lists so that they could be robustly applied to different types of data. We first define and investigate the performance of a genetic similarity measure for evaluating the similarity between human diseases in simulation studies. Then we apply the methods to diseases from the OMIM database. We show that our proposed genetic measure achieves superior performance comparing with simpler methods in explaining phenotype similarities between diseases. Further- more,weidentifycommongenesunderlyingthegeneticoverlapbetweendiseasepairs. With an example of five vision-related diseases, we demonstrate how our methods can provide insights into the relationships among diseases based on their shared genetic mechanisms. 32 3.2 Methods We propose two sets of statistics based on ranked lists to detect and measure the extent ofgeneticoverlapbetweendiseasesgivenalistofgenesrankedaccordingtotheimpor- tance of their contributions to each disease: scan statistics K r and S M , and weighted sumstatisticsWeiSumEandWeiSumV. Fig 3.1 is a schematic outline of the workflow of our methods. Given two diseases A and B, we obtain for each disease a list of genes ranked by their strength of asso- ciation to the disease, regardless of the types of experiments that produce these ranks. Then based on our proposed genetic similarity measures, we compute the value of dis- similarity between diseases A and B, and similarly for all other disease pairs, giving a dissimilarity matrix. Finally we identify the common genes contributing to the genetic overlap between each disease pair and depict the pattern of the genetic overlap among the diseases of interest. In addition, we propose three estimators for the number of overlappingdiseasegenes. 3.2.1 TestingwhetherGeneticOverlapExitsUsingScanStatistics ABasicStatisticandItsExtensions. Givenn genes and the ranked list of these genes according to their relationship to each oftwodiseases,wefirstintroduceabasicstatisticthatmeasurestheextentofthegenetic overlapbetweenthetwodiseases,thenproposesomeextensionsonthisstatistic. Supposethroughsomepreviousstudieswearegiventworankedlists,whereineach list genes are ranked decreasingly based on their contributions to a disease. To address the question of whether the two diseases have a common genetic foundation in their respective etiologies, or in other words a significant “genetic overlap”, we develop a basicstatistictodescribetheextentofthis“geneticoverlap”asfollows. LetG d (k);d = 33 Ranked Gene List A Ranked Gene List B Disease A Disease B expriments expriments Dissimilarity Measure (A,B) (A,B) Dissimilarity Matrix Genetic Overlap Pattern (A,B) Figure 3.1: Outline of the methods. Given two diseases A and B, we obtain for each disease a list of genes ranked by their strength of association to the disease, regardless of the types of experiments that produce these ranks. Then based on our proposed geneticsimilaritymeasures,wecancomputethevalueofdissimilaritybetweendiseases A and B, and similarly for all other disease pairs, giving a dissimilarity matrix. The dissimilarity matrix provides a quantitative estimate of the genetic overlapping pattern among a set of diseases, which is readily applicable to further analysis. Furthermore we can identify the genes underlying the genetic overlaps and provide an overlapping patternforasetofdiseasesofinterest. 1;2 denote the set of topk ranked genes in listd corresponding to thed-th disease. We proposethefollowingstatistic: K =min{k :G 1 (k) ∩ G 2 (k)̸=}; 34 where is the empty set. If two diseases have genetic overlaps, K is expected to be small. Underthenullmodelthatthetworankedlistsarerandomshufflingofthegenes, thep-valuecanbedefinedas p−value=P 0 (K ≤k); wherek istheobservedvalueofK forthetwodiseases. This formulation of the null is equivalent to fixing the order of then genes for one disease,randomlyorderthegenesfortheotherdiseaseandlet K =min{k : min 1≤i≤k {O i }≤k}; where O i is the ranking order of the i-th gene for the second disease. Therefore, K follows the distribution of a shuffled order of one sequence under the null hypothesis, andthep-valueis: P 0 (K ≤k)=1− k−1 ∏ i=0 ( 1− k n−i ) : (3.1) DerivationofEquation3.1isgiveninSectionB.2.1. We can further extend the above studies tor ≥ 1 overlaps. LetK r be the minimum number of top ranked genes needed to observe r overlapping genes. We use K r as a statistic to test the hypothesis that the two gene lists have overlaps versus the null hypothesisthatthetworanksarenotrelated(randomlyshuffled)[62,63]. Similarly,the p-valueis: P 0 {K r ≤c}=1− r−1 ∑ c=0 ( k c ) ∏ c−1 i=0 (k−i) ∏ 2k−c−1 j=k (n−j) ∏ k−1 l=0 (n−l) ; (3.2) wherecistheobservedvalueofK r . DerivationofEquation3.2isgiveninSectionB.2.2. 35 As in the use of r-scan statistics to locate genes or transcription factor binding site clusters[62–66],thereisnogoldstandardfordeterminingtheoptimalvalueofr forthe K r statistics,andactually,theoptimalvalueofrmaydependonthediseasesofinterest. We may fix a pre-specified value ofr and use the statisticK r as a test statistic. For a fixed type I error, we can find a thresholdk r () to be the maximumc such that the p-value defined in Equation 3.2 is less than or equal to. We reject the null hypothesis that the two genetic diseases have no overlapping genes if the observedK r is less than or equal tok r (). This test is valid in the sense that it has a controlled type I error rate and it should have reasonable power under the alternative hypothesis, since K 1 should besmallandinturnallK r ’sshouldbesmallerthanexpectedunderthenullmodel. In reality, we do not know the true number of overlapping genes, which makes it hard to fixr in advance. In order to estimater from the data rather than pre-specifying it,weproposeanalternativeteststatisticasfollows. AnAlternativeTestStatistic. Underthenullhypothesisthatthetwodiseasesdonothaveoverlappinggenes,weexpect that the p-value using K r , p r , will approximate a uniform distribution within the unit interval [0, 1] although they are not independent for given diseases. On the other hand, under the alternative hypothesis that the two diseases do have overlapping genes, the value of p r is expected to decrease until r reaches the true number of overlaps. Thus, we can choose the smallest p-value among p 1 ;p 2 ;··· ;p M where M is the maximum numberofgenesweconsider. Let R M =arg min 1≤r≤M p r : (3.3) 36 Note that R M is a random variable. The alternative statistic we use to test the null hypothesisversusthealternativehypothesisis S M =P(K R M ≤k r M ); (3.4) wherek r M istheobservedvalueofK R M fortherealdata. ItisimportanttonotethatS M cannotberegardedasp-valueofthetestthoughsince we are taking the minimum of allp r for1 ≤ r ≤ M. Instead it should be regarded as a teststatistic. SinceS M followsthesamedistributionforagiventotalnumberofgenes(n)andthe maximum number of genes to consider (M) under the null model, we may obtain the distributionofS M firstusingsimulationsasdescribedinSectionB.2.3. 3.2.2 TestingwhetherGeneticOverlapExitsUsingWeightedSums Section 3.2.1 proposed two sets of test statistics based on the number of overlapping genes up to a certain position to gauge the overlap between two ranked lists. In this sectionwewilldiscussadifferentsetofteststatisticsthattakethecompleteoverlapping patternacrosstworankedlistsintoaccount. Yangetal.[45]proposedasimilarityscore (OrderedList), defined as the weighted sum of the number of overlapping genes X i on topofthelistsuptoeachpositioni,withanexponentiallydecreasingweighte −i ,which werefertolaterasWeiSumO: WeiSumO= n ∑ i=1 e −i X i : However,itisnotclearhowtochoosetheparameter forrealstudies. 37 Based on this idea, we propose two weighted sums that normalize the number of overlapping genes X i by its expectation (WeiSumE) or by its standard deviation (WeiSumV): WeiSumE= n ∑ i=1 e −i X i EX i ; (3.5) WeiSumV = n−1 ∑ i=1 e −i X i (X i ) : (3.6) where EX i = i 2 n and (X i )= i(n−i) n √ n−1 ; sinceX i followsahypergeometricdistributionHyper(i;i;n). Each of the three definitions of weighted sum statistics generates a weighted sum corresponding to a certain value, therefore giving three sets of statistics with respect to a set of possible values. Simulations can be used to obtain the null distributions andp-valuep foreach. Notethatwhen =0,WeiSumEsimplifiestoWeiSumE ∗ as definedinEq3.11. Similarly, as in Section 3.2.1, for each of the three weighted sums we also propose onesinglestatisticacrossaseriesof’sbychoosingtheminimump-value: p m =min p ; wherep isthep-valuegivenbyaweightedsumstatisticwithaspecific. For the choice of ’s we adopt the default series of values in the R package OrderedList [45,67]. There they set a minimum weight to 10 −5 , and the magnitude of determines up to what position of the list the assigned weighte −i = 10 −5 . Then 38 they set a series of positions i = 100, 150, 200, 300, 400, 500, 750, 1000, 1500,2000, 2500,andtakethecorresponding =−log(10 −5 )=iasthedefault series. Inaddition, wealsoadd =0,imposingnoexponentialdecay. To determine p ’s and the p-value of the single statistic p m for each weighted sum (denotedasWeiSumX)outofWeiSumO,WeiSumEandWeiSumVforagivenlistpair, wefollowtheproceduredescribedinSectionB.2.4. 3.2.3 EstimatingtheNumberofOverlappingDiseaseGenes Todeterminethenumberofoverlappingassociatedgenes,wemaylookatR M asdefined in Equation 3.3, which givesthe minimum p-value, and takethe number of overlapson topofthelistsuptoK R M asanestimator: ^ o 1 =X K R M : (3.7) Notethat^ o 1 mightnotalwaysequalR M sincethenumberofoverlapsactuallyobserved inthetopK R M genesmaybelargerthanR M . In addition, we propose a few alternative approaches to determine the number of overlappinggenes. Foranygeneg,definetheBernoullitrialB g =1ifthegeneranksamongthetopK r genesforbothdiseases,andB g =0otherwise: O = n ∑ g=1 B g ; whereO isthenumberofoverlappinggenesinthetopK r oftherankedlists. 39 LetP 0 n = P 0 (B g = 1), the probability of a gene ranked among topK r among both lists of length n, under the null hypothesis H 0 that the two diseases do not overlap. Then: E H 0 O =nP 0 n =n ( K r n ) 2 : Also, the expected position on the list where the first overlap occurs under the null modelis: E H 0 K 1 = n ∑ k=1 P 0 (K 1 ≥k): (3.8) Intuitively, a true overlapping gene associated with both diseases is more likely to occurasanoverlapbeforearandomoverlapoccurs, aswescrolldownthelists. There- fore,wecanestimatethemaximumpositiononthelistwherethereisexpectedtobeless than one random overlap, and take the number of overlaps observed up to that position asanestimatorofthenumberofoverlappinggenes: ^ o 2 = min 1≤r≤100 {r :E H 0 O =K 2 r =n> 1}−1 = min 1≤r≤100 {r :K r > √ n}−1; (3.9) orthenumberofoverlapsobservedearlierthanthepositionwhereafirstrandomoverlap isexpectedtooccurunderthenullmodel: ^ o 3 = min 1≤r≤100 {r :K r >E H 0 K 1 }−1: (3.10) 40 3.3 Results 3.3.1 SimulationResultsSummary Weconductsimulationstudiestoevaluatetheperformanceofourmethodsunderdiffer- ent conditions and compare the best performing statistic from each category with each otherinTable3.3.1. ThesimulationproceduresandresultsareinSectionB.1.3. n 1000 6000 11000 25000 K 1 32 56 63 63 S 1 32 56 63 63 WeiSumE ∗ 27 46 66 94 p m 32 53 64 73 Table 3.1: Comparison of different statistics in simulation studies. For each length of gene listn, the number of parameter settings out of a total of 144 different settings that amethodshowsthehighestpower,includingties,takingp-valuethresholdof0.05. We find that one of our proposed statistics, WeiSumE ∗ , excels in detecting genetic overlapbetweenlongergenelists,especiallyforn=25000,whichisaboutthenumber of genes in the full human genome, therefore is most suitable for the study of human diseases: WeiSumE ∗ =n n ∑ i=1 X i i 2 ; (3.11) whereX i isthe number of overlappinggenes between two rankedgene lists among the top i ranked genes, and n is the total number of genes in a list. This is a special case ofthesetofWeiSumEstatisticsasexplainedintheMethodssection. However,wewill refertoitasWeiSumEintheResultssectionforconvenience. Table 3.3.1 gives an example of the mean squared error of different estimators of the number of overlapping genes in one parameter setting. It shows that ^ o 3 gives the 41 mostreliableestimationofthenumberofdiseasegenes,especiallyforhumandata(n∼ 25;000). 2 0:1 n(×1000) 1 6 11 25 ^ o 1 (M =5) 7.92 9.76 10.61 11.78 ^ o 1 (M =10) 9.46 12.68 13.88 15.79 ^ o 1 (M =20) 26.62 30.39 33.17 36.65 ^ o 2 2.52 3.47 4.09 5.03 ^ o 3 2.67 3.44 3.74 4.35 Table 3.2: The mean squared error (MSE) of different estimators of the number of overlapping genes when 2 = 0:1. 2 is a parameter describing the accuracy of the rankingalgorithm. Thelarger 2 is,thelessaccuratetherankingalgorithmis. n×1000 isthetotalnumberofgenesinthegenome. M isthecutoffpositionfora^ o 1 ,whereonly uptoM overlappinggenesareconsidered. 3.3.2 GeneticOverlapofOMIMDiseases We apply our proposed genetic similarity measures to study diseases from the OMIM database (http://www.ncbi.nlm.nih.gov/omim) and use ENDEAVOUR (2.44) [68, 69] (http://homes.esat.kuleuven.be/ ∼ bioiuser/endeavour/index.php) to rank the human genes. ENDEAVOURisageneprioritizationtoolwithvariousoptionsofdata-sources. To ensure the reliability of ranking, we consider diseases with at least five associated genes recorded in the OMIM database, which gives a total of 89 diseases. Then for eachdisease,wetakeitscorrespondingOMIMgenesasthetraininggenestoinputinto ENDEAVOURincludingalldata-sourcesprovidedandrankallthehumangenes. ConsistencyamongDifferentPhenotypeSimilarities. Beforewemoveontostudyingthegeneticrelationshipsamongthe89OMIMdiseases, itisinterestingtolookattheirphenotypesimilarityfirst. Weinvestigatethreephenotype similarity measures: HPO [70], MimMiner [71] and another similarity matrix provided 42 in Lage et al. [72]. Fig B.1 shows the rank of one similarity measure score versus another. Note that for each pair of similarity measures, we only display disease pairs available in both phenotype similarity matrices. As is shown in the figure, the three phenotypesimilaritymeasuresweaklycorrelatewitheachother,demonstratingthelack ofconsistencyinsuchproduceddiseaserelationships. ExplainingPhenotypeSimilaritywithGeneticSimilarity. Seeingthatconsistencyamongdifferentphenotypesimilaritiesislow,weproceedtoask howwellthegeneticsimilaritymeasureWeiSumEcanexplainthephenotypesimilarity, taking MimMiner as an example. For comparison, we also consider two other simpler measures of genetic similarity to see if the phenotype similarity can be explained with them as well. The first one (#OMIM) is simply the number of genes present in the OMIMdatabaseforbothdiseases. Anothermeasure(HyperP)isthep-valueofahyper- geometric test using a fixed cutoff, which is the probability of observing at least the number of overlapping genes between the two lists up to a fixed cutoff position under thenullhypothesisthatthetwolistsarerandompermutationofeachother. Werankthe WeiSumE genetic similarity, #OMIM similarity, HyperP similarity and the MimMiner phenotypesimilarityofalldiseasepairs,andcompareallthreegeneticoverlapmeasures with MimMiner to obtain the average density fold change of each method, as shown in Fig 3.2. It shows that WeiSumE performs the best out of the three measures of genetic similarity to explain the MimMiner phenotype similarity without having to set a cutoff thresholdasinHyperP. Next we look into the disease pairs where genetic similarity and phenotype simi- larity disagree. For example, the disease pair peeling skin syndrome and hypotrichosis 2 (HYPT2) displays very low MimMiner phenotype similarity but very high genetic similarity. According to OMIM description, both diseases are caused by mutations in 43 #OMIM HyperP_10 HyperP_20 HyperP_50 HyperP_100 HyperP_200 HyperP_500 HyperP_1000 WeiSumE Average Density Fold Change 0.0 0.5 1.0 1.5 2.0 2.5 WeiSumE HyperP #OMIM Figure 3.2: Performance of WeiSumE explaining the MimMiner phenotype similarity, compared with two other simpler methods #OMIM and HyperP with different cutoff positions. The average fold change is used as a measure for performance here, detailed definitions and explanations are in Section B.1.1. In particular, HyperP similarity per- formanceislistedwithvaryingcutoffpositions10,20,50,100,200,500,1000. the CDSN gene, where peeling skin syndrome is caused by a homozygous mutation and HYPT2 is caused by a heterozygous mutation. Therefore, despite the lack of phe- notype similarity, this pair of diseases do share a common genetic basis. A contrary example is the pair of maturity-onset diabetes of the young (MODY) and noninsulin- dependent diabetes mellitus (NIDDM), which shows high MimMiner phenotype simi- laritybutverylowgeneticsimilarity. MODYisanautosomaldominantformofdiabetes caused by insulin secretion defects, while NIDDM is a polygenic disease characterized by insulin resistance. Therefore, these two subtypes of diabetes must be treated differ- entlydespitetheirresemblanceinphenotypes. Theseexamplesofdiscrepancybetween geneticandphenotypesimilaritydemonstratetheimportanceofrelatingdiseasesbased 44 on their genetic mechanisms, since effective drug treatments and therapeutic interven- tionsaddresstheunderlyingmechanismsofdiseases,ratherthanphenotypicsymptoms. IdentifyingCommonGenesunderlyingtheGeneticOverlap. Having studied the ability of our proposed genetic similarity measure to explain the phenotype similarity, we attempt to further identify the common genes underlying the respectivegeneticmechanismsofthetwodiseases. Forthispurposeweproposeseveral estimators of the number of common disease genes based on ranked gene lists in the Methods Section, and take the estimated number of overlapping genes from the top of liststobethecommongenesunderlyingthetwodiseasemechanisms. Fromsimulation studiesresultsshowninTable3.3.1,wewillusethe ^ o 3 estimator. Inviewsofthegeneticoverlapsizesofdiseasepairs,asshowninFigB.2,themajor- ity of disease pairs share genetic overlap to some extent, mostly around 1 to 16 over- lappinggenes. Thepleiotropyofgenes,ontheotherhand,aremoderate: outofthefull human genome that we ranked, 2920 genes are considered pleiotropic by our method. The distribution of pleiotropy extent of the genes is shown in Fig B.3. We find that the extentofpleiotropyvariessignificantly. Somegenesareextremelypleiotropicandrelate to a vastly diverse set of diseases. The most pleiotropic gene is CREBBP (CREB bind- ingprotein). Thisgeneplaysimportantrolesinmanybiologicalactivities. Forinstance, CREBBP may function as tumor suppressor gene or oncogene in prostate cancer, and may serve as potential therapeutic target [73]; Mutations in CREBBP result in neural tube defects in mice [74] and modest associations were also observed in human [75]. Bothresultswerere-confirmedinourstudy. Nextwetake5eye-relateddiseases: leberopticatrophy,maculardegeneration(age- related, 1, ARMD1), cataract (autosomal dominant), retinitis pigmentosa (RP), color- blindness(partial,deutanseries,CBD),asanexampleanddemonstratehowourgenetic 45 Genetic Dissimilarity of 5 Eye−related Diseases LEBER OPTIC ATROPHY ARMD1 CATARACT RP CBD LEBER OPTIC ATROPHY ARMD1 CATARACT RP CBD 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure 3.3: A heatmap of the WeiSumE dissimilarity quantile matrix among 5 eye- relateddiseases. overlap measures explain their common genetic basis. We plot in Fig 3.3 a heat map showing the dissimilarity quantile matrix of the 5 diseases measured by WeiSumE dis- similaritymeasure,whereeachvalueisthefractionoftheWeiSumEdistributionshown in Fig B.4 among all disease pairs that is less than the dissimilarity value of the given disease pair. This figure provides clear evidence that Cataract (autosomal dominant), Retinitis pigmentosa (RP) and Colorblindness (partial, Deutan series, CBD) share a substantial common genetic basis, while Leber optic atrophy and Macular degenera- tion,age-related,1(ARMD1)appeartobegeneticallymoredistanttoanyotherofthe5 diseases. Tofurtherinvestigatethegenesthatcontributetotheirgeneticoverlap,Fig3.4shows aVenndiagramofthenumberofgenesunderlyingthegeneticoverlapamongthe5dis- eases. There is no gene underlying all 5 eye-related diseases, conforming to the pattern shown in Fig 3.3 that Leber optic atrophy and ARMD1 are genetically distant to the 46 other 3 diseases. However, it is interesting that 2 genes are identified to be underly- ing the genetic overlap among 4 diseases, each contributing to Leber optic atrophy or ARMD1 and the closely related 3 diseases, despite the low genetic overlap of Leber optic atrophy or ARMD1 and the three. Under close investigation, we find that these 2 genes are CDH23 (cadherin-related 23) and COL7A1 (collagen, type VII, alpha 1), both widely pleiotropic, residing within the very tail of the pleiotropy extent distribu- tioninFigB.3. CDH23,contributingtoLeberopticatrophyandthe3closediseases,is underlying the genetic overlap of 171 disease pairs; COL7A1, contributing to ARMD1 and the 3 close diseases, is underlying the genetic overlap of 1085 disease pairs. Muta- tions within both genes are shown in literature [76,77] that may cause vision loss. In contrast, the 32 genes underlying the genetic overlap of Cataract, RP and CBD include genes that are more specific to the three diseases, therefore contributing more signif- icantly to their genetic overlap. One example is the gene PDE6A (phosphodiesterase 6A, cGMP-specific, rod, alpha), contributing to the genetic overlap among the 3 dis- eases exclusively. According to NCBI records, this gene encodes a subunit of a key phototransduction enzyme and participates in processes of transmission and amplifica- tionofthevisualsignal. 47 Figure 3.4: A Venn diagram of the genetic overlap among 5 eye-related diseases. This Venn diagram shows the number of overlapping genes among any subset of diseases fromthe5eye-relateddiseases. 48 Chapter4 SummaryandFutureDirections 4.1 A Unified Approach for Allele Frequency Estima- tion, SNP Detection and Association Studies Based onPooledSequencingDataUsingEMAlgorithms InChapter2,wedevelopedanEMalgorithmbasedunifiedapproachforminorallelefre- quencyestimation,SNPcallingandassociationstudies,applicabletopooledsequencing datawheregeneticmaterialsofmultipleindividualsarepooledtogether. Thisstudydif- fers from previous studies in that we estimate sequencing error rate for each position whilepreviousstudiesgenerallyassumeapre-specifiedsequencingerrorrateacrossall sequenced regions. Since sequencing error rate depends on the genomic context, it is essential that the sequencing error rate be estimated specifically for different loci. In a pooling design without tagging, the origin of the reads is not known, and it is impossi- bletoobtaintheindividualgenotypesfromthepooleddata. Therefore,wemodelledthe pooledsequencingdataasa“missingvalue”problemanddesignedanEMalgorithmto estimatetheminorallelefrequencyandsequencingerrorrate. Wefirststudiedtheeffectsofminorallelefrequency,sequencingerrorrate,number of pools, number of individuals in each pool, and the sequencing depth in each pool, on the estimation accuracy of the minor allele frequency. It was shown that the naive approach,whichestimatestheminorallelefrequencybythefractionofobservedminor alleles in the reads, can significantly over-estimate the true minor allele frequency, and 49 that the effect is most severe for rare variants. The EM based algorithm, on the other hand,canestimatetheminorallelefrequencyinarelativelyunbiasedmanner. Although the variance of this estimator seems to be relatively large, a major part of the variation comes from the sampling of individuals from the population rather than the algorithm itself. We also show that the estimation accuracy of the EM algorithm increases with thenumberofpoolsandsequencedepthasexpected. However,theestimationaccuracy decreaseswiththenumberofindividualsineachpool,mostlikelybecausemoreexten- sive pooling induces greater loss of information. Secondly, we used a likelihood ratio statistic based on the estimated parameters from EM to call SNPs. With the real data from [1], in terms of the dbSNP ratio, we showed that EM-SNP outperforms SNVer for rare variants with minor allele frequency less than 1%. We also showed that the transition/transversion ratio of the called SNPs for rare variants based on EM-SNP is higher than that of the called SNPs by SNVer. These two independent pieces of evi- dence demonstrate that EM-SNP is superior to SNVer in the discovery of rare variants. However,theextentofthisadvantagedecreasesasminorallelefrequencyincreasesdue to the tradeoff between EM-SNP’s bias adjustment for the estimation of minor allele frequenciesandextravariationintroducedintheEMalgorithm. Finally,weappliedour approach to reanalyze the case-control data from [1] and showed that we can find the associated common SNPs. Unfortunately we did not find any significantly associated rare variants. One possible explanation is that the power of finding rare variants asso- ciated with complex traits is generally low as a consequence of the low frequencies of minoralleles. We made several simplifying assumptions in our study. First and foremost, we did not consider errors introduced by mapping the reads to the reference genome. The mapping of Roche 454 data still has many challenges, in particular, in regions around 50 homopolymers, and further development of algorithms for mapping is needed. Sec- ondly,althoughweassumedthattheamountofgeneticmaterialfromeachindividualis thesameforeachpool,thisassumptioncanbeviolated. Toovercomethisproblem,one approach would assume that the fractions of genetic material from individuals follow a Dirichlet distribution [17]. Thirdly, the called SNPs by EM-SNP still have many false positives since the Ti/Tv ratio for the called novel SNPs is still low compared to the known SNPs. Further improvements in SNP calling are needed. Finally, the computa- tionalspeedoftheEMbasedapproachcanberelativelyslow,andthemethodcannotbe appliedtowholegenomeassociationstudiesalthoughthisisnotaproblemfortargeted sequencingstudiesasin[1]. Thesearethetopicsforfutureresearch. 4.2 FindingGeneticOverlapsamongDiseasesBasedon RankedGeneLists Thetraditionaldiseaseclassificationsystemgroupsdiseaseswithsimilarclinicalsymp- tomsandphenotypictraits. Thus,diseaseswithentirelydifferentunderlyingpathologies could be grouped together, leading to similar treatment design. Such problems may be avoided if diseases can be classified based on their genetic mechanisms. In fact, recent researchshowedthatmultiplediseasescouldsharethesamesetofmal-functionalgenes. Groupingdiseaseswithsimilarpathogenesismechanismscouldinspirenovelstrategies for effective re-positioning existing drugs and therapies. The key challenge is how to assessthegeneticsimilaritybetweentwodiseases, andhowtoidentifythecontributing genes. In Chapter 3, we aim to detect and identify genetic overlaps among different dis- eases. Twogroupsofstatisticsandthreeestimatorsofthenumberofoverlappinggenes 51 aredeveloped. Thefirstgroupisbasedonscanstatisticsconsideringthenumberofover- lappingtoprankedgenesbetweentwolists. Thesecondgroupisbasedontheweighted sum of the number of top ranked genes in two lists. We evaluate the effectiveness of these statistics by comparing their power in detecting genetic overlaps under a variety ofscenarios. Wealsostudytheeffectsofvariousparameterssuchasthereliabilityofthe ranking, the number of associated genes, the number of overlapping associated genes, and the total number of genes under study. As expected, the reliability of the ranking significantlyaffectsthepowerofthesestatistics. Inaddition,thetwogroupsofstatisticshavedifferentmerits. Thescanbasedstatis- ticscanbeappliedtosituationswhereonlyafractionofthetoprankedgenesareavail- ableforeachdisease,whichisthecasefortheNIHcatalogofGWASwhereonlyasso- ciations with p-value less than 10 −5 are reported. Among this group, the statistic K 1 performs reasonably well in most situations. However, these statistics are generally lesspowerfulthantheweightedsumstatisticsthatconsideralltheelementsinthelists. Amongtheweightedsumstatistics,WeiSumEperformsverywellinmostscenarioswe studied. Moreover,itovercomesthedifficultyofchoosingweightsfordifferentdiseases. For applications, we use WeiSumE to measure genetic overlap among diseases in the OMIM database based on ranked gene lists produced by ENDEAVOUR. We show that our method demonstrates superior performance relative to other simple methods in explainingthephenotypesimilarity. Furthermore,welookintodiseasepairsdisplaying major discrepancy between their genetic and phenotype similarity. For disease pairs high in genetic similarity but low in phenotype similarity, the known common genetic variants responsible for these disease pairs support their common genetic basis. On the otherhand,somediseasepairsshowhighphenotypesimilaritybutlowgeneticsimilarity since mutations responsible for these diseases may be involved in different pathways. In addition, we show the overall pattern of genetic overlap sizes between disease pairs 52 thatwestudy,andthepatternofpleiotropyextentofgenes. Finallywedemonstrateina specificexampleoffivevision-relateddiseaseshowourmethodscanprovideimportant biologicalinsightsintotheirgeneticmechanisms. Despitethesesignificantfindings,thisstudyhassomelimitations. First,thesimilar- itymeasuresdependpurelyontherankedgenelistswithoutexplicitlyconsideringtheir reliability. Whenexperimentaldatafrommultiplestudiesforeachdiseaseareavailable, we may be able to associate with each gene a confidence score that describe quantita- tively how the gene is related to the disease. Instead of transforming the confidence score to a rank which may reduce the power of detecting the genetic overlaps, we can use the confidence score directly to study the disease relationships. Second, the statis- tics in this chapter depend on the number of overlapping genes among the top genes for both lists. They have the advantages of being simple and easy to compute. Never- theless we may also consider statistics depending on the number of overlapping genes amongdifferentnumbersoftopgenesinthefirstlistandthesecondlist. Finally,weuse simulations to approximate the distributions of several statistics studied in this chapter. Theoreticalresultsontheirdistributionswillhelptocomputethestatisticalsignificance moreaccuratelyandefficiently. Thesearethetopicsforfuturestudies. 53 4.3 FutureDirections Anaturalextensiontoourmethodsonrankedgenelistsistoconsideratwo-dimensional version of the WeiSumE and WeiSumV statistics to detect the genetic overlap between diseases: WeiSumE 2 = ∑ 1≤i;j≤n X i;j EX i;j ; (4.1) WeiSumV 2 = ∑ 1≤i;j<n X i;j (X i;j ) : (4.2) whereX i;j isthenumberofoverlapsbetweentopigenesinlist1andtopj genesinlist 2,and EX i;j = ij n and (X i;j )= √ ij(n−i)(n−j) n √ n−1 ; The two dimensional extension of the weighted sum statistics is based on the two dimensionaloverlappingmatrixX =(X ij ),ratherthanourpreviouslyusedoverlapping vectorX = (X 1 ;··· ;X n ). Asaresult,itoffersalesscompressedrepresentationofthe overlapping pattern between two ranked lists, and thus preserves a richer depiction of the information, specifying the number of overlaps between any different portions of toprankedelementsineachlist. Regardingtheextendedtwo-dimensionalstatistics,onequestionofinterestwouldbe whetherweareabletoderive(orapproximate)itsdistributionunderthenullhypothesis that the two lists are random permutations of each other, based on which we can derive itsp-value. Thisgoalis,however,bynomeansasimpleonetoachieve. Nevertheless, without theoretical results, we are still able to carry out simulation studies to have a general understanding of the potential performances and issues of the method. An immediate difficulty we encountered is the considerable computation 54 burdenonceweextendittotwodimensionsandattempttoconducttensofthousandsof simulations to evaluate its performance. As a preliminary approach to relieve the issue, weadoptanexponentiallyincreasingstepsize s k = ⌊ n 25000 e k 13:5 ⌋ k =1;··· ;100 where n is the length of the gene list, and zero-valued s k ’s are discarded. In this way we can extract certain rows and columns of the overlapping matrixX = (X i;j ) to con- structasmallermatrix ~ X =(X i;j );i;j =1;1+s 1 ;··· ;1+ ∑ l k=1 s k ;···. Forexample when n = 25000, we can obtain a 100 × 100 matrix ~ X, which has 10000 numbers, insteadofan=25000lengthoverlapvector. Despitethatthetwodimensionalstatistics are based on only less than half the size of numbers recorded in the one dimensional case, the performance of the two dimensional statistics is surprisingly better than the one dimension ones in most scenarios. We show two examples in Fig 4.1 and Fig 4.2, wheretwodimensionalstatisticsconsistentlyshowhigherpowerthantheircorrespond- ing one dimensional cases. Thus we may conjecture that, even with more compressed information, the two-dimensional extensions capture the overlapping pattern between two ranked lists in a more efficient way, removing much of the redundant information intheonedimensionalcase. Wemayexpectevenbetterperformanceasweincreasethe numberofstepsandsearchforbetterwaystodefinethestepsizes. 55 σ 2 =0.01 σ 2 = 0.05 σ 2 = 0.1 Type I Error 0.00 0.02 0.04 σ 2 =0.01 σ 2 = 0.05 σ 2 = 0.1 Power 0.0 0.2 0.4 0.6 0.8 1.0 n=25000 a=10 o=1 WeiSumE2D WeiSumV2D WeiSumE1D WeiSumV1D Figure4.1: TheTypeIerrorratesandpoweroftheweightedsumstatisticsv.s. theirtwo dimensional extensions, takingd = 10;n = 25;000;o = 0 under null ando = 1 under alternative. 56 σ 2 =0.01 σ 2 = 0.05 σ 2 = 0.1 Type I Error 0.00 0.02 0.04 σ 2 =0.01 σ 2 = 0.05 σ 2 = 0.1 Power 0.0 0.2 0.4 0.6 0.8 1.0 n=25000 a=20 o=1 WeiSumE2D WeiSumV2D WeiSumE1D WeiSumV1D Figure4.2: TheTypeIerrorratesandpoweroftheweightedsumstatisticsv.s. theirtwo dimensional extensions, takingd = 20;n = 25;000;o = 0 under null ando = 1 under alternative. 57 ReferenceList [1] SergeyNejentsev,NeilWalker,DavidRiches,MichaelEgholm,andJohnA.Todd. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type1diabetes. Science,324:387–389,2009. [2] Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Henders, DaleRNyholt,PamelaAMadden,AndrewCHeath,NicholasGMartin,GrantW Montgomery, Michael E Goddard, and Peter M Visscher. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42:565– 569,2010. [3] Teri A. Manolio, Francis S. Collins, Nancy J. Cox, David B. Goldstein, Lucia A. Hindorff, David J. Hunter, Mark I. McCarthy, Erin M. Ramos, Lon R. Cardon, AravindaChakravarti,JudyH.Cho,AlanE.Guttmacher,AugustineKong,Leonid Kruglyak, Elaine Mardis, Charles N. Rotimi, Montgomery Slatkin, David Valle, Alice S. Whittemore, Michael Boehnke, Andrew G. Clark, Evan E. Eichler, Greg Gibson, Jonathan L. Haines, Trudy F. C. Mackay, Steven A. McCarroll, and Peter M. Visscher. Finding the missing heritability of complex diseases. Nature, 461:747–753,2009. [4] Matthew R. Nelson, Daniel Wegmann, Margaret G. Ehm, Darren Kessner, Pamela St. Jean, Claudio Verzilli, Judong Shen, Zhengzheng Tang, Silviu-Alin Bacanu,DanaFraser,LilingWarren,JenniferAponte,MatthewZawistowski,Xiao Liu,HaoZhang,YongZhang,JunLi,YunLi,LiLi,PeterWoollard,SimonTopp, Matthew D. Hall, Keith Nangle, Jun Wang, Gonalo Abecasis, Lon R. Cardon, Sebastian Zllner, John C. Whittaker, Stephanie L. Chissoe, John Novembre, and VincentMooser. Anabundanceofrarefunctionalvariantsin202drugtargetgenes sequencedin14,002people. Science,337:100–104,2012. [5] Jacob A. Tennessen, Abigail W. Bigham, Timothy D. OConnor, Wenqing Fu, Eimear E. Kenny, Simon Gravel, Sean McGee, Ron Do, Xiaoming Liu, Goo Jun, Hyun Min Kang, Daniel Jordan, Suzanne M. Leal, Stacey Gabriel, Mark J. Rieder,GoncaloAbecasis,DavidAltshuler,DeborahA.Nickerson,EricBoerwin- kle, Shamil Sunyaev, Carlos D. Bustamante, Michael J. Bamshad, and Joshua M. 58 Akey.Evolutionandfunctionalimpactofrarecodingvariationfromdeepsequenc- ingofhumanexomes. Science,337:64–69,2012. [6] Paola Benaglio, Terri L McGee, Leonardo P Capelli, Shyana Harper, Eliot L Berson,andCarloRivolta. Nextgenerationsequencingofpooledsamplesreveals newSNRNP200mutationsassociatedwithretinitispigmentosa. HumanMutation, 32:E2246–2258,2011. [7] Manuel A Rivas, Mlissa Beaudoin, Agnes Gardet, Christine Stevens, Yashoda Sharma,ClarenceKZhang,GabrielleBoucher,StephanRipke,DavidEllinghaus, Noel Burtt, Tim Fennell, Andrew Kirby, Anna Latiano, Philippe Goyette, Todd Green,JonasHalfvarson,TalinHaritunians,JoshuaMKorn,FinnyKuruvilla,Car- oline Lagac, Benjamin Neale, Ken Sin Lo, Phil Schumm, Leif Trkvist, Marla C Dubinsky, Steven R Brant, Mark S Silverberg, Richard H Duerr, David Altshuler, Stacey Gabriel, Guillaume Lettre, Andre Franke, Mauro D’Amato, Dermot P B McGovern,JudyHCho,JohnDRioux,RamnikJXavier,andMarkJDaly. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatoryboweldisease. NatureGenetics,43:1066–1073,2011. [8] Astrid A Out, Ivonne J H M van Minderhout, Jelle J Goeman, Yavuz Ariyurek, Stephan Ossowski, Korbinian Schneeberger, Detlef Weigel, Michiel van Galen, PeterEMTaschner,CarliMJTops,MartijnHBreuning,Gert-JanBvanOmmen, JohanTdenDunnen,PeterDevilee,andFrederikJHes. Deepsequencingtoreveal newvariantsinpooledDNAsamples. HumanMutation,30:1703–1712,2009. [9] Todd E Druley, Francesco L M Vallania, Daniel J Wegner, Katherine E Varley, OliviaLKnowles,JacquelineABonds,SarahWRobison,ScottWDoniger,Aaron Hamvas, F Sessions Cole, Justin C Fay, and Robi D Mitra. Quantification of rare allelicvariantsfrompooledgenomicDNA. NatureMethods,6:263–265,2009. [10] Vikas Bansal. A statistical method for the detection of variants from next- generationresequencingofDNApools. Bioinformatics,26:i318–i324,2010. [11] Zhi Wei, Wei Wang, Pingzhao Hu, Gholson J. Lyon, and Hakon Hakonarson. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generationsequencingdata. NucleicAcidsResearch,39:e132,2011. [12] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Bur- rowsWheelertransform. Bioinformatics,26:589–595,2010. [13] Andre Altmann, Peter Weber, Carina Quast, Monika Rex-Haffner, Elisabeth B. Binder, and Bertram Mller-Myhsok. vipR: variant identification in pooled DNA usingr. Bioinformatics,27:i77–i84,2011. 59 [14] Su Yeon Kim, Yingrui Li, Yiran Guo, Ruiqiang Li, Johan Holmkvist, Torben Hansen, Oluf Pedersen, Jun Wang, and Rasmus Nielsen. Design of association studies with pooled or un-pooled next-generation sequencing data. Genetic Epi- demiology,34:479491,2010. [15] Tao Wang, Chang-Yun Lin, Thomas E Rohan, and Kenny Ye. Resequencing of pooled DNA for detecting disease associations with rare variants. Genetic Epi- demiology,34:492–501,2010. [16] JoonSangLee,MurimChoi,XitingYan,RichardPLifton,andHongyuZhao. On optimal pooling designs to identify rare variants through massive resequencing. GeneticEpidemiology,35:139–147,2011. [17] Xiaowei Chen, Jennifer B Listman, Frank J Slack, Joel Gelernter, and Hongyu Zhao. Biases and errors on allele frequency estimation and disease association tests of next generation sequencing of pooled samples. Genetic Epidemiology, 2012(Epubaheadofprint). [18] Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos, Jayashri P Mehta, Francis S Collins, and Teri A Manolio. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America,106(23):9362–9367,2009. [19] Damini Jawaheer, Michael F. Seldin, Christopher I. Amos, Wei V. Chen, Russell Shigeta, Joanita Monteiro, Marlene Kern, Lindsey A. Criswell, Salvatore Albani, J. Lee Nelson, Daniel O. Clegg, Richard Pope, Harry W. SchroederJr., S. Louis BridgesJr., David S. Pisetsky, Ryk Ward, Daniel L. Kastner, Ronald L. Wilder, Theodore Pincus, Leigh F. Callahan, Donald Flemming, Mark H. Wener, and Peter K. Gregersen. A genomewide screen in multiplex rheumatoid arthritis fam- ilies suggests genetic overlap with other autoimmune diseases. American Journal ofHumanGenetics,68(4):927–936,2001. [20] Patrick Danoy, Karena Pryce, Johanna Hadler, Linda A. Bradbury, Claire Farrar, Jennifer Pointon, Michael Ward, Michael Weisman, John D. Reveille, B. Paul Wordsworth, Millicent A. Stone, Walter P. Maksymowych, Proton Rahman, DafnaGladman,RobertD.Inman,MatthewA.Brown,Australo-Anglo-American Spondyloarthritis Consortium (TASC), and Spondyloarthritis Research Consor- tium of Canada (SPARCC). Association of variants at 1q32 and STAT3 with ankylosingspondylitissuggestsgeneticoverlapwithcrohn’sdisease. PLoSGenet, 6(12):e1001195,2010. [21] Deborah J. Smyth, Vincent Plagnol, Neil M. Walker, Jason D. Cooper, Kate Downes, Jennie H.M. Yang, Joanna M.M. Howson, Helen Stevens, Ross 60 McManus, CiscaWijmenga, GrahamA.Heap, PatrickC.Dubois, DavidG.Clay- ton, Karen A. Hunt, David A. van Heel, and John A. Todd. Shared and distinct genetic variants in type 1 diabetes and celiac disease. New England Journal of Medicine,359(26):2767–2777,2008. [22] Jeffrey F. Scherrer, Hong Xian, Kathleen K. Bucholz, Seth A. Eisen, Michael J. Lyons,JackGoldberg,MingTsuang,andWilliamR.True. Atwinstudyofdepres- sion symptoms, hypertension, and heart disease in middle-aged men. Psychoso- maticMedicine,65(4):548–557,2003. [23] Daniel M. Blonigen, Brian M. Hicks, Robert F. Krueger, Christopher J. Patrick, and William G. Iacono. Psychopathic personality traits: heritability and genetic overlap with internalizing and externalizing psychopathology. Psychological Medicine,35(05):637–648,2005. [24] Liam S. Carroll and Michael J. Owen. Genetic overlap between autism, schizophreniaandbipolardisorder. GenomeMedicine,1(10):102,2009. [25] P. Van Damme, J. H. Veldink, M. van Blitterswijk, A. Corveleyn, P. W. J. van Vught, V. Thijs, B. Dubois, G. Matthijs, L. H. van den Berg, and W. Robberecht. ExpandedATXN2CAGrepeatsizeinALSidentifiesgeneticoverlapbetweenALS andSCA2. Neurology,76(24):2066–2072,2011. [26] TimotheaToulopoulou,MarcoPicchioni,FruhlingRijsdijk,MeiHua-Hall,Ulrich Ettinger, Pak Sham, and Robin Murray. Substantial genetic overlap between neu- rocognition and schizophrenia: Genetic modeling in twin samples. Archives of GeneralPsychiatry,64(12):1348,2007. [27] R N Kalaria and C Ballard. Overlap between pathology of alzheimer disease and vascular dementia. Alzheimer Disease and Associated Disorders, 13 Suppl 3:S115–123,1999. [28] SteveEyre,AnneHinks,JohnBowes,EdwardFlynn,PaulMartin,AnthonyGWil- son, Ann W Morgan, Paul Emery, Sophia Steer, Lynn J Hocking, David M Reid, Pille Harrison, Paul Wordsworth, Yorkshire Early Arthritis (YEAR) Consortium, Biologics in RA Control (BIRAC) Consortium, Wendy Thomson, Jane Worthing- ton, and Anne Barton. Overlapping genetic susceptibility variants between three autoimmune disorders: rheumatoid arthritis, type 1 diabetes and coeliac disease. ArthritisResearch&Therapy,12(5):R175,2010. [29] Shanya Sivakumaran, Felix Agakov, Evropi Theodoratou, JamesG. Prendergast, LinaZgaga,TeriManolio,IgorRudan,PaulMcKeigue,JamesF.Wilson,andHarry Campbell. Abundantpleiotropyinhumancomplexdiseasesandtraits. TheAmer- icanJournalofHumanGenetics,89(5):607–618,2011. 61 [30] Andrey Rzhetsky, David Wajngurt, Naeun Park, and Tian Zheng. Probing genetic overlapamongcomplexhumanphenotypes. ProceedingsoftheNationalAcademy ofSciences,104(28):11694–11699,2007. [31] ChrisCotsapas,BenjaminF.Voight,ElizabethRossin,KasperLage,BenjaminM. Neale, Chris Wallace, Gonalo R. Abecasis, Jeffrey C. Barrett, Timothy Behrens, JudyCho,PhilipL.DeJager,JamesT.Elder,RobertR.Graham,PeterGregersen, Lars Klareskog, Katherine A. Siminovitch, David A. van Heel, Cisca Wijmenga, Jane Worthington, John A. Todd, David A. Hafler, Stephen S. Rich, Mark J. Daly, and on behalf of the FOCiS Network of Consortia. Pervasive sharing of genetic effectsinautoimmunedisease. PLoSGenet,7(8):e1002254,2011. [32] SilpaSuthram,JoelT.Dudley,AnnieP.Chiang,RongChen,TrevorJ.Hastie,and Atul J. Butte. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput Biol,6(2):e1000662,2010. [33] BolanLinghu,EvanS.Snitkin,ZhenjunHu,YuXia,andCharlesDeLisi.Genome- wide prioritization of disease genes and identification of disease-disease associa- tions from an integrated human functional linkage network. Genome Biology, 10(9):R91,2009. [34] David Ellinghaus, Eva Ellinghaus, Rajan P. Nair, Philip E. Stuart, Tnu Esko, Andres Metspalu, Sophie Debrus, John V. Raelson, Trilokraj Tejasvi, Majid Belouchi, Sarah L. West, Jonathan N. Barker, Sulev Kks, Klli Kingo, Tobias Balschun, Orazio Palmieri, Vito Annese, Christian Gieger, H. Erich Wichmann, Michael Kabesch, Richard C. Trembath, Christopher G. Mathew, Gonalo R. Abecasis,StephanWeidinger,SusannaNikolaus,StefanSchreiber,JamesT.Elder, Michael Weichenthal, Michael Nothnagel, and Andre Franke. Combined analysis ofgenome-wideassociationstudiesforcrohndiseaseandpsoriasisidentifiesseven shared susceptibility loci. The American Journal of Human Genetics, 90(4):636– 647,2012. [35] Priit Adler, Raivo Kolde, Meelis Kull, Aleksandr Tkachenko, Hedi Peterson, Jri Reimand, and Jaak Vilo. Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biology, 10(12):R139,2009. [36] Anne-Laure Boulesteix and Martin Slawski. Stability and aggregation of ranked genelists. BriefingsinBioinformatics,10(5):556–568,2009. [37] XutaoDeng, JunXu, andCharlesWang. Improvingthepowerfordetectingover- lapping genes from multiple DNA microarray-derived gene lists. BMC Bioinfor- matics,9(Suppl6):S14,2008. 62 [38] Giuseppe Jurman, Samantha Riccadonna, Roberto Visintainer, and Cesare Furlanello. Algebraic comparison of partial lists in bioinformatics. PLoS ONE, 7(5):e36540,2012. [39] Shili Lin and Jie Ding. Integration of ranked lists via cross entropy monte carlo withapplicationstomRNAandmicroRNAstudies. Biometrics,65(1):918,2009. [40] Shili Lin. Space oriented rank-based data integration. Statistical Applications in GeneticsandMolecularBiology,9:Article20,2010. [41] Vasyl Pihur, Susmita Datta, and Somnath Datta. RankAggreg, an r package for weightedrankaggregation. BMCBioinformatics,10(1):62,2009. [42] Loki Natarajan, Minya Pu, and Karen Messer. Statistical tests for the intersection ofindependentlistsofgenes: Sensitivity,FDR,andtypeierrorcontrol.TheAnnals ofAppliedStatistics,6(2):521–541,2012. [43] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Ben- jaminL.Ebert,MichaelA.Gillette,AmandaPaulovich,ScottL.Pomeroy,ToddR. Golub, Eric S. Lander, and Jill P. Mesirov. Gene set enrichment analysis: A knowledge-basedapproachforinterpretinggenome-wideexpressionprofiles. Pro- ceedings of the National Academy of Sciences of the United States of America, 102(43):15545–15550,2005. [44] W. Fury, F. Batliwalla, P.K. Gregersen, and Wentian Li. Overlapping probabili- ties of top ranking gene lists, hypergeometric distribution, and stringency of gene selectioncriterion. In28thAnnualInternationalConferenceoftheIEEEEngineer- inginMedicineandBiologySociety,2006.EMBS’06,pages5531–5534,2006. [45] Xinan Yang, Stefan Bentink, Stefanie Scheid, and Rainer Spang. Similarities of ordered gene lists. Journal of Bioinformatics and Computational Biology, 4(3):693–708,2006. [46] Helge G. Roider, Thomas Manke, Sean O’Keeffe, Martin Vingron, and Ste- fan A. Haas. PASTAA: identifying transcription factors associated with sets of co-regulatedgenes. Bioinformatics,25(4):435–442,2009. [47] Seema B. Plaisier, Richard Taschereau, Justin A. Wong, and Thomas G. Graeber. Rankrankhypergeometricoverlap: identificationofstatisticallysignificantoverlap between gene-expression signatures. Nucleic Acids Research, 38(17):e169–e169, 2010. [48] Shengyu Ni and Martin Vingron. R2KS: a novel measure for comparing gene expression based on ranked gene lists. Journal of Computational Biology, 19(6):766–775,2012. 63 [49] MichaelAntosh,DavidFox,LeonNCooper,andNicolaNeretti. CORaL:compar- isonofrankedlistsforanalysisofgeneexpressiondata. JournalofComputational Biology,20(6):433–443,2013. [50] Steven G. Self and Kung-Yee Liang. Asymptotic properties of maximum likeli- hood estimators and likelihood ratio tests under nonstandard conditions. Journal oftheAmericanStatisticalAssociation,82:605–610,1987. [51] Aaron R Quinlan, Donald A Stewart, Michael P Strmberg, and Gbor T Marth. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods,5:179–181,2008. [52] PaulFlicekand EwanBirney. Sense fromsequencereads: methodsfor alignment andassembly. NatureMethods,6:S6–S12,2009. [53] LaDeana W Hillier, Gabor T Marth, Aaron R Quinlan, David Dooling, Ginger Fewell, Derek Barnett, Paul Fox, Jarret I Glasscock, Matthew Hickenbotham, Weichun Huang, Vincent J Magrini, Ryan J Richt, Sacha N Sander, Donald A Stewart, Michael Stromberg, Eric F Tsung, Todd Wylie, Tim Schedl, Richard K Wilson,andElaineRMardis. Whole-genomesequencingandvariantdiscoveryin c.elegans. NatureMethods,5:183–188,2008. [54] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The sequence align- ment/mapformatandSAMtools. Bioinformatics,25:2078–2079,2009. [55] Kai Wang, Mingyao Li, and Hakon Hakonarson. ANNOVAR: functional anno- tation of genetic variants from high-throughput sequencing data. Nucleic Acids Research,38:e164–e164,2010. [56] StephenT.Sherry,MinghongWard,andKarlSirotkin. dbSNPDatabaseforsingle nucleotide polymorphisms and other classes of minor genetic variation. Genome Research,9:677–679,1999. [57] MarkADePristo, EricBanks, RyanPoplin, KiranVGarimella, JaredRMaguire, ChristopherHartl,AnthonyAPhilippakis,GuillermodelAngel,ManuelARivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, and Mark J Daly. A framework for variation discovery and genotyping using next-generation DNAsequencingdata. NatureGenetics,43:491–498,2011. [58] Jason D Cooper, Deborah J Smyth, Adam M Smiles, Vincent Plagnol, Neil M Walker, James E Allen, Kate Downes, Jeffrey C Barrett, Barry C Healy, Josyf C 64 Mychaleckyj, James H Warram, and John A Todd. Meta-analysis of genome- wide association study data identifies additional type 1 diabetes risk loci. Nature Genetics,40:1399–1401,2008. [59] Jeffrey C Barrett, David G Clayton, Patrick Concannon, Beena Akolkar, Jason D Cooper,HenryAErlich,CcileJulier,GrantMorahan,JrnNerup,ConcepcionNier- ras, Vincent Plagnol, Flemming Pociot, Helen Schuilenburg, Deborah J Smyth, Helen Stevens, John A Todd, Neil M Walker, and Stephen S Rich. Genome-wide associationstudyandmeta-analysisfindthatover40lociaffectriskoftype1dia- betes. NatureGenetics,41:703–707,2009. [60] Jie Huang, David Ellinghaus, Andre Franke, Bryan Howie, and Yun Li. 1000 genomes-based imputation identifies novel and refined associations for the well- come trust case control consortium phase 1 data. European Journal of Human Genetics,20:801–805,2012. [61] VincentPlagnol, Joanna M. M. Howson, Deborah J. Smyth, Neil Walker, Jason P. Hafler, Chris Wallace, Helen Stevens, Laura Jackson, Matthew J. Simmonds, Polly J. Bingley, Stephen C. Gough, John A. Todd, and Type 1 Diabetes Genet- ics Consortium. Genome-wide association analysis of autoantibody positivity in type1diabetescases. PLoSGenetics,7:e1002216,2011. [62] ChingferChenandSamuelKarlin.r-scanstatisticsofapoissonprocesswithevents transformed by duplications, deletions, and displacements. Advances in Applied Probability,39(3):799–825,2007. [63] Samuel Karlin and Chingfer Chen. r-scan extremal statistics of inhomogeneous poissonprocesses. LectureNotes-MonographSeries,45:287–290,2004. [64] Warren J. Ewens, Warren John Ewens, and Gregory Grant. Statistical Methods in Bioinformatics: AnIntroduction. Springer,2006. [65] S. Karlin and V. Brendel. Chance and statistical significance in protein and DNA sequenceanalysis. Science,257(5066):39–49,1992. [66] AndrewD.Smith,PavelSumazin,DebopriyaDas,andMichaelQ.Zhang. Mining ChIP-chip data for transcription factor and cofactor binding sites. Bioinformatics, 21(suppl1):i403–i412,2005. [67] Claudio Lottaz, Xinan Yang, Stefanie Scheid, and Rainer Spang. OrderedLista bioconductorpackagefordetectingsimilarityinorderedgenelists.Bioinformatics, 22(18):2315–2316,2006. [68] SteinAerts,DietherLambrechts,SunitMaity,PeterVanLoo,BertCoessens,Fred- erik De Smet, Leon-Charles Tranchevent, Bart De Moor, Peter Marynen, Bassem 65 Hassan, Peter Carmeliet, and Yves Moreau. Gene prioritization through genomic datafusion. NatureBiotechnology,24(5):537–544,2006. [69] Lon-Charles Tranchevent, Roland Barriot, Shi Yu, Steven Van Vooren, Peter Van Loo, Bert Coessens, Bart De Moor, Stein Aerts, and Yves Moreau. Endeavour update: a web resource for gene prioritization in multiple species. Nucleic Acids Research,36(suppl2):W377–W384,2008. [70] Peter N. Robinson, Sebastian Kohler, Sebastian Bauer, Dominik Seelow, Denise Horn,andStefanMundlos. Thehumanphenotypeontology: Atoolforannotating and analyzing human hereditary disease. American Journal of Human Genetics, 83(5):610–615,2008. [71] MarcA.vanDriel,JornBruggeman,GertVriend,HanG.Brunner,andJackA.M. Leunissen. A text-mining analysis of the human phenome. European Journal of HumanGenetics,14(5):535–542,2006. [72] Kasper Lage, E. Olof Karlberg, Zenia M. Strling, Pll lason, Anders G. Ped- ersen, Olga Rigina, Anders M. Hinsby, Zeynep Tmer, Flemming Pociot, Niels Tommerup, Yves Moreau, and Sren Brunak. A human phenome-interactome net- workofproteincomplexesimplicatedingeneticdisorders. NatureBiotechnology, 25(3):309–316,2007. [73] Zaki Shaikhibrahim, Andreas Lindstrot, Reinhard Buettner, and Nicolas Wernert. Analysis of laser-microdissected prostate cancer tissues reveals potential tumor markers. InternationalJournalofMolecularMedicine,28(4):605–611,2011. [74] Vasker Bhattacherjee, Kristin H. Horn, Saurabh Singh, Cynthia L. Webb, M. Michele Pisano, and Robert M. Greene. CBP/p300 and associated transcrip- tional co-activators exhibit distinct expression patterns during murine craniofacial and neural tube development. The International Journal of Developmental Biol- ogy,53(7):1097–1104,2009. [75] Wei Lu, Adrian R. Guzman, Wei Yang, Claudia J. Chapa, Gary M. Shaw, Robert M. Greene, M. Michele Pisano, Edward J. Lammer, Richard H. Finnell, and Huiping Zhu. Genes encoding critical transcriptional activators for murine neuraltubedevelopmentandhumanspinabifida: acase-controlstudy. BMCMed- icalGenetics,11(1):141,2010. [76] Julie M Schultz, Rashid Bhatti, Anne C Madeo, Amy Turriff, Julie A Muskett, Christopher K Zalewski, Kelly A King, Zubair M Ahmed, Saima Riazuddin, Nazir Ahmad, Zawar Hussain, Muhammad Qasim, Shaheen N Kahn, Meira R Meltzer, Xue Z Liu, Murali Munisamy, Manju Ghosh, Heidi L Rehm, Ekaterini T Tsilou, Andrew J Griffith, Wadih M Zein, Carmen C Brewer, Sheikh Riazuddin, 66 and Thomas B Friedman. Allelic hierarchy of CDH23 mutations causing non- syndromic deafness DFNB12 or usher syndrome USH1D in compound heterozy- gotes. JournalofMedicalGenetics,48(11):767–775,2011. [77] P.l. Dighiero, S. Balayre, J.-J. Gicquel, and P. Vabres. Corneal recurrent erosions andmutationsinthegeneCOL7A1. ARVOMeetingAbstracts,45(5):1510,2004. 67 AppendixA SupplementaryMaterialsfor“A UnifiedApproachforAlleleFrequency Estimation,SNPDetectionand AssociationStudiesBasedonPooled SequencingDataUsingEM Algorithms” A.1 SupplementaryMethods A.1.1 Details of Simulation Studies on Estimating the Power of AssociationTestsUsingEM TostudythepoweroftestingforassociationbetweenaSNPandaphenotypeofinterest using the method in the “Testing for associations between a SNP and a phenotype in case-control studies” subsection, we first generate case-control pooled sequencing data on loci that are not associated with the phenotype and repeat the process R = 1000 times. Wethenobtaintheempiricaldistributionsof. ForagiventypeIerror ,wetake the upper percentile t as the threshold to decide whether a locus is associated with 68 the phenotype. Then we simulate case-control data with true association forR = 1000 times, and compute the fraction of times that the value of is abovet , which is taken asanestimateofthepowerofassociationwiththephenotype. A.2 SupplementaryResults A.2.1 ResultsontheAccuracyof ^ f em UsingtheModifiedMeasures ofMSEandCg There are a few outlier estimates that are far away from the true value. This indicates that the standard MSE criterion used in the main text may not effectively reflect the estimation accuracy of the parameters. Thus, we modify the MSE as follows. First, remove the upper percent and the lower percent of the estimated values of ^ f em . Second, calculate the mean squared error of the remaining ^ f em ’s, denoted as 2 m (), wherethesubscriptmreferstothismodifiedversion. Herewelet=5and10andthe valuesof 2 m ()’sarepresentedinsupplementaryTablesA.1andA.2,correspondingto the16scenariosinsupplementaryFiguresA.2andA.3. A.2.2 TheEffectsoftheNumberofChromosomesK inaPool,the NumberofPoolsG, and theNumber ofReadsn onthe Esti- mationAccuracyof ^ f em TostudytheeffectsofthenumberofchromosomesK inapoolandthenumberofpools Gontheestimationaccuracyof ^ f em asanestimatoroff,wefixf =1%; start =1%and plotthehistogramsof ^ f em withn=1000insupplementaryFigureA.4. Itshowsthat ^ f em maintainsthenicepropertyofunbiasednessaroundthetruevaluef asshowninthe“The 69 effects of minor allele frequencyf and sequencing error rate on the estimation accu- racyof ^ f em ”subsection,while ^ f em asanestimatoroff displaysconsiderablevarianceat the same time. However, in supplementary Figure A.5 where we plot the histograms of ^ f em −f frac whenn=1000,thevarianceof ^ f em −f frac issignificantlyreducedcompared to ^ f em insupplementaryFigureA.4. ThesmallerthevalueofK is,themoresignificant the reduction in variance is, which is also confirmed quantitatively in supplementary TableA.3. ThisobservationindicatesthatwhenthenumberofchromosomesK ineach pool is relatively low, ^ f em is a fairly accurate estimate of the fraction of chromosomes f frac carryingminoralleles,andthevarianceof ^ f em ismostlyduetothevarianceoff frac , which is an intrinsic variance of the sampling procedure from the population. As the number of chromosomes K in each pool increases, the variance induced by the algo- rithmitselfplaysamoreimportantrole,whilethevarianceoff frac contributeslesstothe variance of ^ f em due to the larger sample size. In addition, supplementary Figures A.4 andA.5showaconsistenttrendofanincreaseintheaccuracyof ^ f em (ordecreaseinthe MSE) as the number of pools G increases and as the number of chromosomes K in a pooldecreases. ThiscanalsobeseenfromsupplementaryTableA.3. To study the effect of the number of reads n on the estimation accuracy of f em for f, we also let n = 3000. The results are presented in supplementary Figures A.6 and A.7, and supplementary Tables A.4 and Table A.6. It can be seen that the accuracy of ^ f em increaseswiththenumberofreadsineachpool. A.3 SupplementaryTables 70 =0:05% = 0:1% =0:5% = 1% f MSE Cg MSE Cg MSE Cg MSE Cg 0:1% 6.3e-07 3.7e-08 6.4e-07 1.2e-07 1.5e-06 9.3e-07 3.6e-06 3.2e-06 0:5% 3.4e-06 1.1e-07 3.4e-06 1.6e-07 4.1e-06 9.6e-07 5.1e-06 1.5e-06 1% 7.4e-06 3.8e-07 7.5e-06 5.1e-07 8.1e-06 1.4e-06 9.1e-06 2.7e-06 5% 3.3e-05 9.4e-06 3.4e-05 9.6e-06 3.8e-05 1.2e-05 4.9e-05 2.0e-05 TableA.1: Thevaluesof 2 m (5)(definedintermsofMSEandCg)for ^ f em asanestimator off orf frac fordifferentvaluesoff and,(K;G;n)=(100;10;3000). 2 m (10) = 0:05% =0:1% =0:5% =1% f MSE Cg MSE Cg MSE Cg MSE Cg 0:1% 4.9e-07 3.7e-08 4.8e-07 1.2e-07 7.9e-07 2.2e-07 1.1e-06 4.0e-07 0:5% 2.4e-06 1.1e-07 2.4e-06 1.5e-07 2.9e-06 6.6e-07 3.3e-06 8.6e-07 1% 5.2e-06 3.6e-07 5.2e-06 4.9e-07 5.7e-06 1.4e-06 6.2e-06 2.2e-06 5% 2.4e-05 8.7e-06 2.4e-05 8.8e-06 2.6e-05 1.1e-05 3.2e-05 1.3e-05 Table A.2: The values of 2 m (10) (defined in terms of MSE and Cg) for ^ f em as an estimatoroff orf frac fordifferentvaluesoff and,(K;G;n)=(100;10;3000). G= 10 G =20 G =50 K MSE Cg MSE Cg MSE Cg 50 3.3e-05 1.1e-05 1.5e-05 4.3e-06 5.5e-06 2.0e-06 100 5.3e-05 4.3e-05 4.5e-05 4.0e-05 1.8e-05 1.6e-05 200 5.2e-05 5.4e-05 5.9e-05 5.9e-05 3.8e-05 3.7e-05 TableA.3: ThevaluesofMSEandCgfor ^ f em asanestimatoroff andf frac ,respectively, fordifferentvaluesof(K;G),(n;f; start )=(1000;0:01;0:01). G= 10 G =20 G =50 K MSE Cg MSE Cg MSE Cg 50 2.4e-05 4.2e-06 1.4e-05 2.6e-06 5.1e-06 1.4e-06 100 1.6e-05 5.6e-06 7.6e-06 2.4e-06 2.8e-06 6.8e-07 200 2.5e-05 2.4e-05 1.6e-05 1.4e-05 7.5e-06 6.8e-06 TableA.4: ThevaluesofMSEandCgfor ^ f em asanestimatoroff andf frac ,respectively, fordifferentvaluesof(K;G),(n;f; start )=(3000;0:01;0:01). 71 G=10 G=20 G=50 K 50 100 200 50 100 200 50 100 200 2 m;MSE (5) 1.8e-05 2.6e-05 2.8e-05 8.8e-06 1.9e-05 3.3e-05 3.0e-06 6.8e-06 1.6e-05 2 m;Cg (5) 5.3e-06 2.1e-05 2.8e-05 1.6e-06 1.8e-05 3.4e-05 5.0e-07 5.7e-06 1.6e-05 2 m;MSE (10) 1.2e-05 1.8e-05 1.8e-05 6.2e-06 1.2e-05 2.1e-05 2.1e-06 4.1e-06 9.2e-06 2 m;Cg (10) 4.1e-06 1.5e-05 1.9e-05 1.3e-06 1.2e-05 2.2e-05 5.1e-07 3.2e-06 8.9e-06 Table A.5: The modified mean-squared-error 2 m (5) and 2 m (10) of the estimated val- ues of f for different numbers of chromosomes in a pool K and numbers of pools G, (n;f; start )=(1000;0:01;0:01). G=10 G=20 G=50 K 50 100 200 50 100 200 50 100 200 2 m;MSE (5) 1.4e-05 9.1e-06 1.1e-05 7.2e-06 4.1e-06 6.7e-06 2.7e-06 1.5e-06 2.9e-06 2 m;Cg (5) 2.3e-06 2.7e-06 1.0e-05 1.1e-07 9.2e-07 6.0e-06 1.9e-08 2.9e-07 2.6e-06 2 m;MSE (10) 1.0e-05 6.2e-06 7.5e-06 5.0e-06 2.8e-06 4.2e-06 1.9e-06 1.1e-06 1.8e-06 2 m;Cg (10) 1.2e-06 2.2e-06 7.5e-06 1.2e-07 8.1e-07 4.0e-06 1.7e-08 2.6e-07 1.7e-06 Table A.6: The modified mean squared error 2 m (5) and 2 m (10) of the estimated val- ues of f for different numbers of chromosomes in a pool K and numbers of pools G, (n;f; start )=(1000;0:01;0:01). variantscalled f EM < 0:2% f avg < 0:2% #total #dbSNP #novel #total #dbSNP #novel a)Top100SNPs EM-SNP 11 9 2 13 11 2 SNVer 10 1 9 10 0 10 b)Top150SNPs EM-SNP 29 18 11 30 22 8 SNVer 30 4 26 26 2 24 Table A.7: The total number of SNPs (2nd and 5th columns), the number of SNPs in thedbSNPdatabase(3rdand6thcolumns),andthenumberofnovelSNPs(4thand7th columns) among the top a) 100 and b) 150 EM-SNP or SNVer called SNPs with minor allele frequency at most 0.2% according to eitherf em orf avg . The dbSNP ratio for the SNPscalledbyEM-SNPismuchhigherthanthatforSNVer. 72 variantscalled total known(dbSNP) novel #Ti #Tv #Ti #Tv #Ti #Tv b)Top100SNPs EM-SNP 9 2 7 2 2 0 SNVer 3 7 1 0 2 7 b)Top150SNPs EM-SNP 20 9 13 5 7 4 SNVer 13 17 3 1 10 6 TableA.8: Thenumberoftransitions(Ti)(2nd,4thand6thcolumns)andthenumberof transversions(Tv)(3rd,5thand7thcolumns)amongthetopa)100andb)150EM-SNP orSNVercalledSNPswithminorallelefrequencyatmost0.2%accordingtoeitherf em or f avg . The Ti/Tv ratio for the SNPs called by EM-SNP is much higher than that for SNVer. 73 A.4 SupplementaryFigures 0 50 100 150 200 250 0 2 4 6 f avg simulation index log(RE avg ) 0 50 100 150 200 250 0 2 4 f em simulation index log(RE em ) Figure A.1: The log values of the relative errors (RE) of ^ f avg and ^ f em for the 288 sim- ulations. The four sections (72 simulations each) divided by the solid lines correspond to simulations with f = 0:1%;0:5%;1%;5%, respectively. For ^ f avg the decrease by section indicates the significant effect off on RE. Within each section the two subsec- tions (36 simulations each) divided by the dashed line correspond to simulations with n = 1000;3000, respectively. For ^ f avg the indifference between subsections indicate no effects of n on RE. Within each subsection the four quarters (9 simulations each) correspond to simulations with = 0:05%;0:1%;0:5%;1%, respectively. For ^ f avg the step-likeshapeindicateslittleeffectofG;K onRE,comparedtosignificanteffectof. 74 f=0.001 α start =5e−04 f em Frequency 0.000 0.002 0.004 0 150 300 n= 3000 K= 100 G= 10 f=0.001 α start =0.001 f em Frequency 0.000 0.002 0.004 0 150 300 n= 3000 K= 100 G= 10 f=0.001 α start =0.005 f em Frequency 0.000 0.004 0.008 0 100 250 n= 3000 K= 100 G= 10 f=0.001 α start =0.01 f em Frequency 0.000 0.005 0.010 0.015 0.020 0 100 250 n= 3000 K= 100 G= 10 f=0.005 α start =5e−04 f em Frequency 0.000 0.004 0.008 0.012 0 40 100 n= 3000 K= 100 G= 10 f=0.005 α start =0.001 f em Frequency 0.000 0.004 0.008 0.012 0 40 100 n= 3000 K= 100 G= 10 f=0.005 α start =0.005 f em Frequency 0.000 0.004 0.008 0.012 0 40 100 n= 3000 K= 100 G= 10 f=0.005 α start =0.01 f em Frequency 0.000 0.005 0.010 0.015 0 40 80 n= 3000 K= 100 G= 10 f=0.01 α start =5e−04 f em Frequency 0.000 0.010 0.020 0 20 50 n= 3000 K= 100 G= 10 f=0.01 α start =0.001 f em Frequency 0.005 0.015 0.025 0 20 60 n= 3000 K= 100 G= 10 f=0.01 α start =0.005 f em Frequency 0.005 0.015 0.025 0 20 40 60 n= 3000 K= 100 G= 10 f=0.01 α start =0.01 f em Frequency 0.000 0.010 0.020 0.030 0 20 60 n= 3000 K= 100 G= 10 f=0.05 α start =5e−04 f em Frequency 0.02 0.04 0.06 0 10 30 n= 3000 K= 100 G= 10 f=0.05 α start =0.001 f em Frequency 0.02 0.04 0.06 0.08 0 10 25 n= 3000 K= 100 G= 10 f=0.05 α start =0.005 f em Frequency 0.00 0.02 0.04 0.06 0.08 0 20 40 60 n= 3000 K= 100 G= 10 f=0.05 α start =0.01 f em Frequency 0.00 0.02 0.04 0.06 0 20 40 60 n= 3000 K= 100 G= 10 Figure A.2: The effects of f and on the accuracy of ^ f em as an estimator of f. The histograms of ^ f em under different combinations of f and are shown, (K;G;n) = (100;10;3000). 75 f=0.001 α start =5e−04 f em - f frac Frequency −1e−03 0e+00 1e−03 0 200 500 n= 3000 K= 100 G= 10 f=0.001 α start =0.001 f em - f frac Frequency −0.0010 0.0000 0.0010 0.0020 0 200 500 n= 3000 K= 100 G= 10 f=0.001 α start =0.005 f em - f frac Frequency 0.000 0.004 0.008 0 200 500 n= 3000 K= 100 G= 10 f=0.001 α start =0.01 f em - f frac Frequency 0.000 0.005 0.010 0.015 0.020 0 200 400 n= 3000 K= 100 G= 10 f=0.005 α start =5e−04 f em - f frac Frequency −0.001 0.001 0.003 0 200 500 n= 3000 K= 100 G= 10 f=0.005 α start =0.001 f em - f frac Frequency −0.001 0.000 0.001 0.002 0.003 0 200 500 n= 3000 K= 100 G= 10 f=0.005 α start =0.005 f em - f frac Frequency −0.002 0.002 0.004 0.006 0 200 500 n= 3000 K= 100 G= 10 f=0.005 α start =0.01 f em - f frac Frequency 0.000 0.005 0.010 0 200 400 n= 3000 K= 100 G= 10 f=0.01 α start =5e−04 f em - f frac Frequency −0.010 0.000 0.005 0.010 0 200 400 n= 3000 K= 100 G= 10 f=0.01 α start =0.001 f em - f frac Frequency −0.010 0.000 0.005 0.010 0 200 400 n= 3000 K= 100 G= 10 f=0.01 α start =0.005 f em - f frac Frequency −0.010 −0.005 0.000 0.005 0 200 400 n= 3000 K= 100 G= 10 f=0.01 α start =0.01 f em - f frac Frequency −0.010 0.000 0.010 0 100 200 n= 3000 K= 100 G= 10 f=0.05 α start =5e−04 f em - f frac Frequency −0.03 −0.01 0.01 0 40 80 n= 3000 K= 100 G= 10 f=0.05 α start =0.001 f em - f frac Frequency −0.03 −0.01 0.01 0 40 80 n= 3000 K= 100 G= 10 f=0.05 α start =0.005 f em - f frac Frequency −0.04 −0.02 0.00 0.02 0 40 80 n= 3000 K= 100 G= 10 f=0.05 α start =0.01 f em - f frac Frequency −0.04 −0.02 0.00 0.02 0 40 100 n= 3000 K= 100 G= 10 Figure A.3: The effects off and on the accuracy of ^ f em as an estimator off frac . The histogramsof ^ f em −f frac underdifferentcombinationsoff andareshown,(K;G;n)= (100;10;3000). 76 K= 50 G= 10 f em Frequency 0.00 0.01 0.02 0.03 0.04 0 20 40 60 80 f= 0.01 α start =0.01 n= 1000 K= 50 G= 20 f em Frequency 0.000 0.005 0.010 0.015 0.020 0.025 0 10 20 30 40 f= 0.01 α start =0.01 n= 1000 K= 50 G= 50 f em Frequency 0.005 0.010 0.015 0.020 0.025 0 10 20 30 40 50 f= 0.01 α start =0.01 n= 1000 K= 100 G= 10 f em Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0 10 20 30 40 f= 0.01 α start =0.01 n= 1000 K= 100 G= 20 f em Frequency 0.00 0.01 0.02 0.03 0.04 0 10 20 30 40 50 f= 0.01 α start =0.01 n= 1000 K= 100 G= 50 f em Frequency 0.005 0.015 0.025 0.035 0 20 40 60 80 100 f= 0.01 α start =0.01 n= 1000 K= 200 G= 10 f em Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0 10 20 30 40 f= 0.01 α start =0.01 n= 1000 K= 200 G= 20 f em Frequency 0.00 0.01 0.02 0.03 0.04 0 10 20 30 40 f= 0.01 α start =0.01 n= 1000 K= 200 G= 50 f em Frequency 0.01 0.02 0.03 0.04 0 20 40 60 80 f= 0.01 α start =0.01 n= 1000 Figure A.4: The effects of (K;G) on the estimation accuracy of ^ f em as an estimator of f,(n;f; start )=(1000;0:01;0:01). 77 K= 50 G= 10 f em - f frac Frequency −0.01 0.00 0.01 0.02 0.03 0 100 200 300 400 f= 0.01 α start =0.01 n= 1000 K= 50 G= 20 f em - f frac Frequency 0.000 0.005 0.010 0 50 100 150 200 f= 0.01 α start =0.01 n= 1000 K= 50 G= 50 f em - f frac Frequency 0.000 0.005 0.010 0 50 100 150 f= 0.01 α start =0.01 n= 1000 K= 100 G= 10 f em - f frac Frequency −0.01 0.00 0.01 0.02 0.03 0 10 30 50 70 f= 0.01 α start =0.01 n= 1000 K= 100 G= 20 f em - f frac Frequency −0.01 0.00 0.01 0.02 0.03 0 20 40 60 80 f= 0.01 α start =0.01 n= 1000 K= 100 G= 50 f em - f frac Frequency −0.005 0.005 0.010 0.015 0.020 0.025 0 20 60 100 f= 0.01 α start =0.01 n= 1000 K= 200 G= 10 f em - f frac Frequency −0.01 0.00 0.01 0.02 0.03 0.04 0 10 20 30 40 f= 0.01 α start =0.01 n= 1000 K= 200 G= 20 f em - f frac Frequency −0.01 0.00 0.01 0.02 0.03 0 10 20 30 40 50 f= 0.01 α start =0.01 n= 1000 K= 200 G= 50 f em - f frac Frequency 0.00 0.01 0.02 0.03 0 20 40 60 80 f= 0.01 α start =0.01 n= 1000 Figure A.5: The effects of (K;G) on the estimation accuracy of ^ f em as an estimator of f frac ,(n;f; start )=(1000;0:01;0:01). 78 K= 50 G= 10 f em Frequency 0.00 0.01 0.02 0.03 0.04 0 50 100 150 f= 0.01 α start =0.01 n= 3000 K= 50 G= 20 f em Frequency 0.005 0.010 0.015 0.020 0.025 0.030 0 20 40 60 80 100 f= 0.01 α start =0.01 n= 3000 K= 50 G= 50 f em Frequency 0.005 0.010 0.015 0.020 0 10 30 50 70 f= 0.01 α start =0.01 n= 3000 K= 100 G= 10 f em Frequency 0.000 0.010 0.020 0.030 0 20 40 60 f= 0.01 α start =0.01 n= 3000 K= 100 G= 20 f em Frequency 0.005 0.010 0.015 0.020 0.025 0 10 20 30 40 f= 0.01 α start =0.01 n= 3000 K= 100 G= 50 f em Frequency 0.005 0.010 0.015 0.020 0 10 30 50 f= 0.01 α start =0.01 n= 3000 K= 200 G= 10 f em Frequency 0.00 0.01 0.02 0.03 0.04 0 10 30 50 f= 0.01 α start =0.01 n= 3000 K= 200 G= 20 f em Frequency 0.00 0.01 0.02 0.03 0.04 0 20 40 60 80 f= 0.01 α start =0.01 n= 3000 K= 200 G= 50 f em Frequency 0.005 0.010 0.015 0.020 0.025 0.030 0 10 20 30 40 50 60 f= 0.01 α start =0.01 n= 3000 Figure A.6: The effects of (K;G) on the estimation accuracy of ^ f em as an estimator of f,(n;f; start )=(3000;0:01;0:01). 79 K= 50 G= 10 f em - f frac Frequency 0.000 0.005 0.010 0 200 400 600 800 f= 0.01 α start =0.01 n= 3000 K= 50 G= 20 f em - f frac Frequency 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0 200 400 600 f= 0.01 α start =0.01 n= 3000 K= 50 G= 50 f em - f frac Frequency 0.000 0.002 0.004 0.006 0.008 0.010 0 200 400 600 f= 0.01 α start =0.01 n= 3000 K= 100 G= 10 f em - f frac Frequency −0.010 −0.005 0.000 0.005 0.010 0.015 0 50 100 150 200 f= 0.01 α start =0.01 n= 3000 K= 100 G= 20 f em - f frac Frequency 0.000 0.005 0.010 0 50 100 150 f= 0.01 α start =0.01 n= 3000 K= 100 G= 50 f em - f frac Frequency 0.000 0.002 0.004 0.006 0.008 0.010 0 20 40 60 80 100 f= 0.01 α start =0.01 n= 3000 K= 200 G= 10 f em - f frac Frequency −0.01 0.00 0.01 0.02 0.03 0 20 40 60 f= 0.01 α start =0.01 n= 3000 K= 200 G= 20 f em - f frac Frequency −0.01 0.00 0.01 0.02 0 20 40 60 80 100 f= 0.01 α start =0.01 n= 3000 K= 200 G= 50 f em - f frac Frequency −0.005 0.000 0.005 0.010 0.015 0.020 0 10 30 50 f= 0.01 α start =0.01 n= 3000 Figure A.7: The effects of (K;G) on the estimation accuracy of ^ f em as an estimator of f frac ,(n;f; start )=(3000;0:01;0:01). 80 0.00 0.01 0.02 0.03 0.04 0.05 0.90 0.92 0.94 0.96 0.98 1.00 f power n=1000 n=3000 0.002 0.004 0.006 0.008 0.010 0.95 0.96 0.97 0.98 0.99 1.00 α power n=1000 n=3000 50 100 150 200 0.95 0.96 0.97 0.98 0.99 1.00 K power n=1000 n=3000 10 20 30 40 50 0.95 0.96 0.97 0.98 0.99 1.00 G power n=1000 n=3000 FigureA.8: ThepowerofSNPdetectionatatypeIerrorof0.05usingcasesandcontrols together,(K;G;f;;)=(100;20;1;0:5%;1:2). 81 EM−SNP top 100 called variants f em Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 25 30 EM−SNP top 100 called variants f avg Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 SNVer top 100 called variants f em Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 SNVer top 100 called variants f avg Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 Figure A.9: Allele frequency spectrum of the top 100 variants called by EM-SNP and SNVer. 82 f em ≤ 0.005 0.005<f em ≤ 0.01 0.01<f em ≤ 0.05 f em > 0.05 dbSNP proportion in EM−SNP’s and SNVer’s top 100 variants 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 EM−SNP novel SNPs SNVer novel SNPs EM−SNP/SNVer dbSNPs f em ≤ 0.005 0.005<f em ≤ 0.01 0.01<f em ≤ 0.05 f em > 0.05 Ti/Tv proportion in EM−SNP’s and SNVer’s top 100 variants 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 EM−SNP Tv SNVer Tv EM−SNP/SNVer Ti f em ≤ 0.005 0.005<f em ≤ 0.01 0.01<f em ≤ 0.05 f em > 0.05 Ti/Tv proportion in EM−SNP’s and SNVer’s konwn variants 0 5 10 15 20 0 5 10 15 20 EM−SNP Tv SNVer Tv EM−SNP/SNVer Ti f em ≤ 0.005 0.005<f em ≤ 0.01 0.01<f em ≤ 0.05 f em > 0.05 Ti/Tv proportion in EM−SNP’s and SNVer’s novel variants 0 5 10 15 0 5 10 15 EM−SNP Tv SNVer Tv EM−SNP/SNVer Ti Figure A.10: dbSNP proportion and Ti/Tv proportion of the top 100 variants called by EM-SNPandSNVer,groupedbyallelefrequencylevel. 83 EM−SNP top 150 called variants f em Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 60 EM−SNP top 150 called variants f avg Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 60 SNVer top 150 called variants f em Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 SNVer top 150 called variants f avg Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 Figure A.11: Allele frequency spectrum of the top 150 variants called by EM-SNP and SNVer. 84 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 dbSNP ratio threshold dbSNP ratio with freq less than threshold EM−SNP SNVer 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.5 1.0 1.5 2.0 Ti/Tv ratio of all variants called threshold Ti/Tv ratio for freq less than threshold EM−SNP SNVer 0.0 0.1 0.2 0.3 0.4 0.5 0 2 4 6 8 Ti/Tv ratio of known variants called threshold Ti/Tv ratio for freq less than threshold EM−SNP SNVer 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.5 1.0 1.5 Ti/Tv ratio of novel variants called threshold Ti/Tv ratio for freq less than threshold EM−SNP SNVer FigureA.12: ThedbSNPratioandTi/Tvratioofthetop150variantscalledbyEM-SNP andSNVer. 85 AppendixB SupplementaryMaterialsfor“Finding GeneticOverlapsamongDiseases BasedonRankedGeneLists” Thesupplementarymaterialsfilecontainstwoparts. Thefirstpartincludessupplemen- tary results on OMIM diseases clustering, another application of clustering diseases in the NIH GWAS catalog, and simulation studies. The second part includes supplemen- tarymethods. B.1 SupplementaryResults B.1.1 GeneticOverlapofOMIMDiseases Fig B.1 to Fig B.4 are referenced in “Clustering Diseases Based on Genetic Overlaps amongRankedGeneLists”. DiseasePairGeneticSimilarityv.s. PhenotypeSimilarity We rank the WeiSumE genetic similarity and the MimMiner [71] phenotype similarity ofalldiseasepairs,wherethemostsimilarpairisrankedasfirstandthemostdissimilar rankedlast. ThenweplotthetworanksofdiseasepairsagainsteachotherinFigureB.5. We observe a generally uniform distribution across the plane except a relatively small area of concentration in the bottom left corner. This observation suggests that, on one 86 0 500 1000 1500 0 500 1000 1500 HPO v.s. MimMiner MimMiner similarity score rank HPO similarity score rank 0 50 100 150 0 50 100 150 Lage v.s. MimMiner MimMiner similarity score rank Lage similarity score rank 0 50 100 150 0 50 100 150 Lage v.s. HPO HPO similarity score rank Lage similarity score rank Figure B.1: Scatter plots of MimMiner, HPO, Lage disease pair similarity ranks versus each other for the 89 OMIM diseases. Each plot contains only disease pairs present in bothphenotypesimilaritymatrices. 0 3 6 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Distribution of the Number of Overlapping Genes between Disease Pairs Number of Overlapping Genes between Disease Pairs 0 50 100 150 200 FigureB.2: DistributionoftheNumberofOverlappingGenesbetweenDiseasePairs. hand, for disease pairs that do not share a common genetic foundation or do not show considerable phenotype similarity, their similarity ranks approximate a 2-dimensional 87 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 Distribution of Pleiotropy pleiotropy 0 200 400 600 800 Figure B.3: Distribution of the number of diseases associated with pleiotropic genes. For each gene, we refer to the number of diseases that it is identified as an overlapping genewithanotherdiseaseasameasureofitsextentof“pleiotropy”. Histogram of WeiSumE WeiSumE Frequency 0.0 0.2 0.4 0.6 0.8 0 100 200 300 400 500 600 FigureB.4: AhistogramoftheWeiSumEdissimilaritymeasurevaluesbetweenalldis- easepairs. uniform random distribution; on the other hand, for disease pairs that do have a signifi- cantgeneticoverlap,theytendtobeaccompaniedbystrongsimilaritiesinphenotypeas well. 88 0 500 1000 1500 2000 0 500 1000 1500 2000 β = 0 standardized WeiSumE dissimilarity rank (−) MimMiner similarity score rank Figure B.5: Disease pair rank by its negative MimMiner phenotype similarity v.s. rank byitsgeneticdissimilarity(WeiSumE,taking =0),sothatthemoresimilarapairisin itsphenotypeorgeneticmeasure,thehigheritisrankedbythecorrespondingmeausre. Since22diseasesfromthe89OMIMdiseaseswestudyarenotincludedinMimMiner, thisfiguredisplaysallpairsoftheremaining67diseases. Next we look further into disease pairs that MimMiner phenotype similarity is best explained with WeiSumE genetic similarity, i.e. the bottom left corner in Fig B.5. For anyrankr,weconsiderdiseasepairsrankedbyMimMinerhigherorequaltor astruly related pairs, and consider disease pairs ranked by a genetic overlap measure higher or equaltor aspairsclaimedtobesignificantlyrelated. Thenwemayusethetruepositive rate (TPR) and the positive predictive value (PPV) to gauge the results ranked by the variousgeneticoverlapmeasuresweproposed: TPR r = TP TP+FN and PPV r = TP TP+FP : (B.1) Similarly, we also consider the bottom left square in Fig B.5 with each rank up to r, and define the density of this square as the number of points in the square divided by the area of the square. Thus the density of the entire scatter plot in Fig B.5 is one over the total number of disease pairs. Under perfect situation where the genetic overlap 89 ranking of disease pairs completely agrees with MimMiner ranking, the true positive rate (TPR) and positive predictive value (PPV) should always be one for all r, while thedensityshouldbeoneoverr;whereasundertherandomsituationwherethegenetic overlap ranking is a random permutation of the MimMiner ranking, the expectations of the true positive rate (TPR) and the positive predictive value (PPV) should both be r over the total number of disease pairs, while the expectation of density should be one overthetotalnumberofdiseasepairs,whichequalsthedensityoftheentirescatterplot in Fig B.5. Therefore for eachr, the fold change of density under the perfect situation relative to the expected density under the random situation, is equal to the fold change ofTPRandPPVundertheperfectsituationrelativetotheexpectedTPRandPPVunder therandomsituation. Weplotthedensityfoldchangeoftheobserveddensityrelativeto the expected density againstr in Fig B.6. Each rank valuer represents a square on the bottom left corner of Fig B.5 with its side length equal tor, which includes the disease pairsrankedhigherthanthisvaluebybothsimilaritymeasures. FigB.7providesazoomed-inversionofFigB.6,displayingthedensityfoldchange curves of the top 100 disease pairs in the left bottom corner of Fig B.5. The solid line isthedensityfoldchangecurveofleftbottomsquaresinFigB.5relativetotherandom case, where one set of ranks is a random permutation of the other. The figure shows a drasticdifferencebetweentheactualdensitycurveandthepermutedone,andtheactual densityshootsupatanearlyrankvaluethenquicklydropsaroundthepermutationden- sitylevel. ThisconfirmsourpreviousobservationinFigB.5thatsomediseasepairsare concentrated in the bottom left corner, while the rest are roughly uniformly distributed, indicating that WeiSumE successfully captures the disease pairs whose genetic overlap contributessignificantlytotheirphenotypesimilarity. Having compared the genetic similarity measure WeiSumE with MimMiner phe- notype similarity, it is natural to ask whether the situation would remain similar with 90 0 500 1000 1500 2000 0 50 100 150 200 250 rank fold change of density Figure B.6: The complete density fold change curve. The dashed line marks the value 1. The solid line is the density fold change curve of Fig B.5, representing the fold changeofdensityrelativetothedensityoftheentireplot. Thedensityisdefinedbythe numberofdiseasepairswhosephenotypicandgeneticranksarebothlessorequaltothe correspondingxvalue,dividedbyx 2 ,i.e. thenumberofpointsintheleftbottomsquare inFigB.5,dividedbytheareaofthesquarewithsidelengthequaltothecorresponding x value. The density of the entire scatter plot in Fig B.5 is one over the total number of disease pairs, which is also the density of the case where one set of ranks is a random permutationoftheother. different genetic similarity measures and phenotype similarity measures. Therefore we consider two other phenotype measures from HPO [70] and Lage et al. [72], and com- pare them against all genetic overlap measures studied in the Methods section in the main text, across a range of parameter choices. Fig B.8 shows the average density fold change for varying phenotype similarity measures, genetic overlap measures and parameters. The higher the average, the more concentrated is the scatter plot of genetic rank v.s. phenotype rank, the more consistent the genetic measure is with the pheno- type measure. Fig B.8 shows that our genetic overlap measure is most consistent with MimMiner. 91 0 20 40 60 80 100 0 50 100 150 200 250 rank fold change of density Figure B.7: The density fold change curve for disease pairs ranked in top 100 by both rankings. The dashed line marks the value 1. The solid line is the density fold change curve of Fig B.5, representing the fold change of density relative to the density of the entire plot. The density is defined as the number of disease pairs whose ranks are both lessorequaltothecorrespondingxvalue,dividedbyx 2 ,i.e. thenumberofpointsinthe leftbottomsquareinFigB.5,dividedbytheareaofthesquarewithsidelengthequalto the correspondingx value. The density of the entire scatter plot in Fig B.5 is one over thetotalnumberofdiseasepairs,whichisalsotheexpecteddensityinthecasethatone setofranksisarandompermutationoftheother. ClusteringOMIMDiseases Next we cluster the 89 OMIM diseases. For clustering, dissimilarity measures among thediseasesareneededandthuswechangethesimilaritymeasuretodissimilaritymea- sure as follows. We apply a normalized version of each of the genetic similarity mea- sures,forexampleWeiSumE: Dissimilarity(Disease 1 ;Disease 2 )= max E −WeiSumE max E −min E ; (B.2) 92 0 20 40 60 80 100 0 1 2 3 4 r or M average density fold change K S MimMiner HPO Lage 0.00 0.02 0.04 0.06 0.08 0.10 0 1 2 3 4 β average density fold change WeiSumE WeiSumV WeiSumO MimMiner HPO Lage Figure B.8: The black lines are the average density fold change from Fig B.6 corre- sponding to differentr or values. The red lines are for HPO similarity and the green linesareforLagesimilarity. where max E =WeiSumE(apairofidenticallyrankedlists) and min E =WeiSumE(apairofreverselyrankedlists): Inthiswayweobtainadissimilaritymatrixofthediseases,basedonwhichweconduct hierarchicalclusteringusingaverageagglomerationmethod. 93 Figure B.9 shows an example of the clustering results of the OMIM diseases using WeiSumE. Here we take an arbitrary 0.2 threshold of the height, which separates the two bulks of values that the normalized WeiSumE falls into as shown in Fig B.4, and look at clusters below this threshold. Several interesting results are observed. Cluster I containspeelingskinsyndromeandhypotrichosis2. Thesetwodiseasesarebothcaused by mutations in the CDSN gene, as explained in the main text. Cluster II contains six diseases related to hearing loss. According to the NIH Genetics Home References, one of the characteristics of OSMED syndromes is severe hearing loss and the features of OSMED are similar to those of the Weissenbacher-Zweymller syndrome. Hearing loss is also one of the features for Stickler syndrome. Although hearing loss is not listed as a symptom for fibrochondrogenesis 2, the reported cases died at birth and it is possible that hearing loss was not diagnosed. On the other hand, fibrochondrogenesis 2 share manyfeaturessimilartoSticklersyndromejustifyingtheirgeneticoverlaps. ClusterIII contains two cancer/testis antigen family 47. Cluster IV contains two diseases, suscep- tibilitytohepatitisCvirusandsusceptibilitytosarcoidosis,bothrelatedtoliver. In addition to the small clusters, we also observe some large clusters enriched with certain type of diseases. For example, cluster V contains four mitochondrial related diseases. ClusterVIcontainsfourdiseasesallrelatedtoeyeimpairment. Thebigcluster VIIisenrichedwithvariouscancerswhileclusterVIIIisenrichedwithdiseasesrelated totheimmunesystem. Given the clustering result in Fig B.9, we take a close look into the relationship of highly similar diseases and relatively more distantly related diseases in terms of their ranked gene lists. We take Hypotrichosis 2 (HYPT2) and Peeling skin syndrome from Cluster I as an example of disease pairs with a profound genetic overlap (Disease pair 1); then we take Hypotrichosis 2 (HYPT2) from Cluster I and Fibrochondrogenesis 2 (FBCG2) from Cluster II as an example of disease pairs that do not share a significant 94 genetic basis (Disease pair 2). Fig B.10 gives an intuitive demonstration of how these methods work to discern the genetic relationship between diseases, by displaying the overlap vector of the two disease pairs against random permutations. Overlap vector is defined as the number of overlapping genesX i on top of the lists up to each positioni. The two disease pairs represent contrasting degrees of genetic overlap: the overlapped pair starts to digress from the permutation average early from the top part of the ranks, whereasthedistantpairremainsclosetothepermutationaverageandonlydigresslater along the ranked list. This difference in the overlap vectors, together with the drastic dissimilarity difference shown in Fig B.9, confirms that WeiSumE is an efficient form of representing and compressing the genetic overlap information in ranked lists, and it capturesthedifferenceintheearlystageofrankedlistswithhighsensitivity. B.1.2 GeneticOverlapofDiseasesintheNIHGWASCatalog Next we study the catalog of GWAS data from the National Human Genome Research Institute (http://www.genome.gov/). Instead of the full genome, only genes reported to be significantly associated with each disease are included, therefore we cannot use the class of weighted sum tests here. In addition, for each disease, we only consider genes that are found significant in at least two studies, and rank the genes according to each gene’s median p-values in different studies, obtaining 122 diseases in total with a ranked gene list for each. The length of lists differs from 1 to 87, which is equivalent to the situation where only a number (n d < n) of genes are ranked (for disease d), whiletherest(n−n d )aredeemedunimportant,orinotherwordstyingforrankn d +1. Therefore we are able to transform the problem into an equivalent one complying with the test assumption that gene lists consist of the same set of genes (in this case the full humangenome). ThenweuseK 1 toscanforthefirstoverlapbetweeneachpairoflists, arbitrarily setting the total length of the complete list n to 25;000, which is about the 95 IV II III I VI VII VIII 0.8 0.6 0.4 0.2 0.0 HYDA TIDIFORM MOLE, RECURRENT , 1; HYDM1 SPERMA TOGENIC FAILURE, Y−LINKED, 2; SPGFY2 DIABETES MELLITUS, TRANSIENT NEONA T AL, 1 RETINITIS PIGMENTOSA 11; RP11 TRICHOHEP A TOENTERIC SYNDROME 2; THES2 BARDET−BIEDL SYNDROME; BBS MITOCHONDRIAL COMPLEX I DEFICIENCY LEIGH SYNDROME; LS MITOCHONDRIAL COMPLEX IV DEFICIENCY LEBER OPTIC A TROPHY WILLIAMS−BEUREN SYNDROME; WBS LEUKOENCEPHALOP A THY WITH VANISHING WHITE MA TTER; VWM ADRENAL HYPERPLASIA, CONGENIT AL, DUE TO 21−HYDROXYLASE DEFICIENCY NEURAMINIDASE DEFICIENCY PONTOCEREBELLAR HYPOPLASIA, TYPE 2C; PCH2C ASTHMA, SUSCEPTIBILITY TO HYPOTRICHOSIS 2; HYPT2 PEELING SKIN SYNDROME ALCOHOL DEPENDENCE SUDDEN INFANT DEA TH SYNDROME HYPERTROPHIC NEUROP A THY OF DEJERINE−SOTT AS CA T ARACT , AUTOSOMAL DOMINANT USHER SYNDROME, TYPE I; USH1 RETINITIS PIGMENTOSA; RP BLUE CONE MONOCHROMACY; BCM COLORBLINDNESS, P ARTIAL, DEUT AN SERIES; CBD EHLERS−DANLOS−LIKE SYNDROME DUE TO TENASCIN−X DEFICIENCY FIBROCHONDROGENESIS 2; FBCG2 DEAFNESS, AUTOSOMAL RECESSIVE 53; DFNB53 DEAFNESS, AUTOSOMAL DOMINANT 13; DFNA13 WEISSENBACHER−ZWEYMULLER SYNDROME; WZS STICKLER SYNDROME, TYPE III; STL3 OTOSPONDYLOMEGAEPIPHYSEAL DYSPLASIA; OSMED EHLERS−DANLOS SYNDROME, TYPE III INTERVERTEBRAL DISC DISEASE; IDD LEPROSY , SUSCEPTIBILITY TO, 4; LPRS4 BLOOD GROUP , CHIDO/RODGERS SYSTEM CANCER/TESTIS ANTIGEN FAMIL Y 47, MEMBER A5; CT47A5 CANCER/TESTIS ANTIGEN FAMIL Y 47, MEMBER A10; CT47A10 SARCOIDOSIS, SUSCEPTIBILITY TO, 2; SS2 NARCOLEPSY 7; NRCLP7 P ARKINSON DISEASE, LA TE−ONSET; PD PHEOCHROMOCYTOMA DIABETES MELLITUS, NONINSULIN−DEPENDENT; NIDDM JUVENILE MYELOMONOCYTIC LEUKEMIA; JMML HEP A TOCELLULAR CARCINOMA GLIOMA SUSCEPTIBILITY 1; GLM1 LUNG CANCER GASTRIC CANCER COLORECT AL CANCER; CRC MESOTHELIOMA, MALIGNANT; MESOM OVARIAN CANCER CENTRAL HYPOVENTILA TION SYNDROME, CONGENIT AL; CCHS THYROID CARCINOMA, P APILLARY TESTICULAR GERM CELL TUMOR; TGCT LEUKEMIA, ACUTE MYELOID; AML RENAL CELL CARCINOMA, NONP APILLARY; RCC ESOPHAGEAL CANCER SQUAMOUS CELL CARCINOMA, HEAD AND NECK; HNSCC FANCONI ANEMIA, COMPLEMENT A TION GROUP A; FANCA ENDOMETRIAL CANCER BREAST CANCER PROST A TE CANCER EWING SARCOMA; ES MYCOBACTERIUM TUBERCULOSIS, SUSCEPTIBILITY TO OBESITY MA TURITY−ONSET DIABETES OF THE YOUNG; MODY NEURAL TUBE DEFECTS TETRALOGY OF FALLOT; TOF CONOTRUNCAL HEART MALFORMA TIONS; CTHM STROKE, ISCHEMIC COMPLEMENT COMPONENT 2 DEFICIENCY; C2D MACULAR DEGENERA TION, AGE−RELA TED, 1; ARMD1 HEMOL YTIC UREMIC SYNDROME, A TYPICAL, SUSCEPTIBILITY TO, 4; AHUS4 MALARIA, MILD, SUSCEPTIBILITY TO BLEEDING DISORDER, PLA TELET−TYPE, 11; BDPL T11 AUTOINFLAMMA TION, LIPODYSTROPHY , AND DERMA TOSIS SYNDROME; ALDD BARE L YMPHOCYTE SYNDROME, TYPE I SARCOIDOSIS, SUSCEPTIBILITY TO, 1; SS1 HEP A TITIS C VIRUS, SUSCEPTIBILITY TO PSORIASIS SUSCEPTIBILITY 1; PSORS1 SPONDYLOARTHROP A THY , SUSCEPTIBILITY TO, 1; SPDA1 SEVERE CUT ANEOUS ADVERSE REACTION, SUSCEPTIBILITY TO RHEUMA TOID ARTHRITIS; RA A TYPICAL MYCOBACTERIOSIS, FAMILIAL SYSTEMIC LUPUS ERYTHEMA TOSUS; SLE PSORIA TIC ARTHRITIS, SUSCEPTIBILITY TO HEP A TITIS B VIRUS, SUSCEPTIBILITY TO MALARIA, SUSCEPTIBILITY TO V Figure B.9: Hierarchical clustering result of 89 OMIM diseases. A dendrogram for the hierarchical clustering result of 89 diseases in the OMIM database based on WeiSumE geneticsimilarity. 96 0 5000 10000 15000 20000 0 5000 10000 15000 20000 HYPOTRICHOSIS 2; HYPT2 v.s. PEELING SKIN SYNDROME i O i permutation average Disease Pair 1 0 5000 10000 15000 20000 0 5000 10000 15000 20000 HYPOTRICHOSIS 2; HYPT2 v.s. FIBROCHONDROGENESIS 2; FBCG2 i O i permutation average Disease Pair 2 Figure B.10: An example of the overlap vectors against random permutations of two disease pairs. Disease pair 1 is a closely related pair: Hypotrichosis 2 (HYPT2) and Peeling skin syndrome; Disease pair 2 is a distant pair: Hypotrichosis 2 (HYPT2) and Fibrochondrogenesis 2 (FBCG2). The black lines are obtained by permuting the first ranked list for 1000 times as 1000 the second ranked list, then taking average of each elementoftheresulting1000overlapvectors. length of the full human genome. If there is no overlap between a pair of lists, we set thep-valuearbitrarilyto1. InthiswayweobtainadissimilaritymatrixconsistingofK 1 testp-values. FigB.11 showstheclusteringresultofthe122diseasesbasedonthisdissimilaritymatrix. 97 1.0 0.8 0.6 0.4 0.2 0.0 r=1 Warfarin maintenance dose Vitamin D levels Urate levels Uric acid levels Thyroid cancer Sphingolipid levels Sex hormone−binding globulin levels Response to metformin Response to hepatitis C treatment Renal cell carcinoma Platelet counts Phospholipid levels (plasma) Pancreatic cancer Paget’s disease Optic disc parameters Nonalcoholic fatty liver disease Neuroblastoma Narcolepsy Migraine Menopause (age at onset) Male−pattern baldness Lp (a) levels IgE levels Hypothyroidism Hepatitis B Iron status biomarkers Hemoglobin Iron levels Cutaneous nevi Melanoma Co"ee consumption Central corneal thickness Calcium levels Bladder cancer Bilirubin levels Autism Attention de#cit hyperactivity disorder Atrial #brillation Amyotrophic lateral sclerosis Alcohol dependence Adverse response to aromatase inhibitors Adiponectin levels Cognitive performance Chronic lymphocytic leukemia Asthma Systemic sclerosis Rheumatoid arthritis Follicular lymphoma Primary biliary cirrhosis Celiac disease Vitiligo Multiple sclerosis Parkinson’s disease T ype 1 diabetes Kawasaki disease Systemic lupus erythematosus HIV−1 control Psoriasis Ankylosing spondylitis Behcet’s disease In$ammatory bowel disease Acute lymphoblastic leukemia (childhood) Fibrinogen Crohn’s disease Vitamin B12 levels Fasting plasma glucose Glycated hemoglobin levels Alzheimer’s disease Alzheimer’s disease (late onset) Age−related macular degeneration Cholesterol, total Metabolic syndrome Cardiovascular disease risk factors HDL cholesterol LDL cholesterol Lipoprotein−associated phospholipase A2 activity and mass T riglycerides C−reactive protein Metabolic traits Activated partial thromboplastin time Metabolite levels Prostate cancer Thyrotoxic hypokalemic periodic paralysis Breast cancer Colorectal cancer Urinary bladder cancer Hodgkin’s lymphoma Bone mineral density Ulcerative colitis Bone mineral density (hip) Bone mineral density (spine) Chronic obstructive pulmonary disease Pulmonary function Soluble ICAM−1 Venous thromboembolism Coronary heart disease Intracranial aneurysm Glaucoma Glaucoma (primary open−angle) Smoking behavior Glioma Lung adenocarcinoma Lung cancer Body mass index Restless legs syndrome T ype 2 diabetes Obesity Obesity (extreme) PR interval Electrocardiographic traits QT interval Hypertension Diastolic blood pressure Systolic blood pressure Bipolar disorder Blood pressure Schizophrenia Orofacial clefts Polycystic ovary syndrome Mean platelet volume Height Menarche (age at onset) II I III V IV VI VII VIII IX X XI XII FigureB.11: Hierarchicalclusteringresulton122diseasesinthecatalogofGWASdata usingK 1 . 98 Cluster I contains two related traits, urate and uric acid levels. Cluster II contains three traits related to hemoglobin. Cluster III contains two traits related to the skin. The big cluster IV is enriched with diseases related to the immune system. Cluster V contains two Alzheimer diseases. Next comes a big cluster VI enriched with choles- terolandmetaboliterelatedtraits. ClusterVIIisenrichedwithseveralcancersincluding urinary bladder, colorectal, prostate and breast cancers. Cluster VIII is enriched with traits related to bone mineral density. This cluster also contains Ulcerative colitis and Hodgkins lymphoma that usually result in the decrease of bone density. The next big cluster IX is enriched with the lung and pulmonary traits including lung cancer, smok- ing, and pulmonary function. Cluster X mostly contains traits related to obesity and type 2 diabetes. Blood pressure related traits are mainly in cluster XI. Two psychiatric disorders, schizophreniaandbiopolargrouptogether inclusterXII.Theseobservations validatetheeffectivenessofusingGWASassociationresultstoclassifythetraits. B.1.3 SimulationStudies TestingtheNullHypothesisagainstAlternative–ScanStatistics Tostudy theTypeI errorrate of thetesting statisticsK r andS M underthenull hypoth- esis and the power under the alternative hypothesis, we conduct the following simula- tions. Supposewehaveatotalofngenes. 1. Randomly choose d genes to be associated with the first disease and the other genes are not associated. Each associated gene corresponds to X + ∼ N(1; 2 ), and each non-associated gene corresponds to X − ∼ N(0; 2 ). Rank the genes accordingtothevalueofX’s. 2. RepeatStep1fortheseconddisease,buttheassociatedgeneshasooverlapswith thedgenesassociatedwiththefirstdisease. 99 3. Compute the statisticsK r ;r = 1;2;··· ;100 andS M ;M = 1;2;··· ;100. Com- pute the p-values corresponding to K r and S M using Equation B.3 and Equa- tionB.4. Rejectthenullhypothesisifthecorrespondingp-valueislessthan. 4. Repeatsteps1-3for10,000times,andcomputethefractionoftimesthatthenull hypothesis is rejected. The Type I error rate and the power is computed as the fractionoftimesthenullhypothesisisrejectedforo=0ando> 0,respectively. Conduct the above simulations using each combination of the various values of parametersasfollows: 1. 2 =0:01;0:05;0:1;0:5. 2. M =1;2;··· ;100. 3. P-valuecutoff =0:01;0:05;0:1. 4. Totalnumberofgenesn=1000;6000;11000;25000. 5. Numberofassociatedgenesd =1;5;10;20. 6. Numberofoverlappingassociatedgeneso=0;1;5;10,whereo≤d. Fig B.12 shows an example of the effect of variance 2 on the Type I error rates of K r andS M under the null hypothesis and their power under the alternative hypothesis. The Type I error rates under the null hypothesis are reasonably controlled for both K r and S M , regardless of the choice of r and M, or the magnitude of variance. However, the power of both tests under the alternative hypothesis are significantly influenced by themagnitudeofvariance,wherepowerreducesdrasticallyas 2 increases,withbarely any power when 2 is as large as 0.5. Therefore, the performance of both K r and S M reliesonadecentaccuracyofthegivenrankingalgorithm. 100 0 20 40 60 80 100 0.035 0.040 0.045 0.050 0.055 Type I error of K r r Type I error of K r σ 2 =0.01 σ 2 =0.05 σ 2 =0.1 σ 2 =0.5 0 20 40 60 80 100 0.035 0.040 0.045 0.050 0.055 Type I error of S M M Type I error of S M σ 2 =0.01 σ 2 =0.05 σ 2 =0.1 σ 2 =0.5 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of K r r Power of K r σ 2 =0.01 σ 2 =0.05 σ 2 =0.1 σ 2 =0.5 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of S M M Power of S M σ 2 =0.01 σ 2 =0.05 σ 2 =0.1 σ 2 =0.5 Figure B.12: The effect of 2 on the Type I error rates of K r and S M under the null hypothesis and their power under the alternative hypothesis, taking = 0:05, n = 11000, d = 10, o = 0 under null and o = 5 under alternative hypothesis. The vertical yellowlinesindicatethetruenumberofoverlappingassociatedgenes. Similarly,FigureB.13,B.14andB.15showtheeffectsofthetotalnumberofgenes n,thenumberofassociatedgenesdandthenumberofoverlappinggenesoontheType I error rates ofK r andS M under the null hypothesis and their power under the alterna- tivehypothesis. TheTypeIerrorofbothstatisticsiswellcontrolledunderallscenarios. Thepowerofbothstatisticshasaslightdecreasingtrendasthetotalnumberofgenesn increases, or as the number of associated genesd increases, indicating different organ- isms with different gene numbers and different diseases with different associated gene numbers will only influence the power of tests at a moderate level, with fewer total or associated gene number giving slightly higher power. In contrast, the number of over- lapping genes o shows dramatic effect on the test power. Both tests display a much higher power when the associated genes of the two diseases are mostly the same than 101 0 20 40 60 80 100 0.00 0.01 0.02 0.03 0.04 0.05 Type I error of K r r Type I error of K r n=1000 n=6000 n=11000 n=25000 0 20 40 60 80 100 0.00 0.01 0.02 0.03 0.04 0.05 Type I error of S M M Type I error of S M n=1000 n=6000 n=11000 n=25000 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of K r r Power of K r n=1000 n=6000 n=11000 n=25000 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of S M M Power of S M n=1000 n=6000 n=11000 n=25000 FigureB.13: Effectofn: theTypeIerrorratesofK r andS M underthenullhypothesis and their power under the alternative hypothesis, taking 2 = 0:1, = 0:05, d = 10, o = 0 under null and o = 5 under alternative hypothesis. The vertical yellow lines indicatethetruenumberofoverlappingassociatedgenes. when only a few associated genes of one disease overlap with the other disease. This result indicates that both tests function as an evaluation of the relative genetic overlap- ping level, i.e. the size of the genetic overlap between the two diseases relative to the sizeoftheirrespectivegeneticbasis. In summary, the test works best when the ranking algorithm is reliable and the genetic overlap consists of a considerable portion of the genetic basis of the diseases. The organism complexity and the disease complexity only have marginal effects on the testpower. 102 0 20 40 60 80 100 0.00 0.01 0.02 0.03 0.04 0.05 Type I error of K r r Type I error of K r d=5 d=10 d=20 0 20 40 60 80 100 0.00 0.01 0.02 0.03 0.04 0.05 Type I error of S M M Type I error of S M d=5 d=10 d=20 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of K r r Power of K r d=5 d=10 d=20 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of S M M Power of S M d=5 d=10 d=20 FigureB.14: Effectofd: theTypeIerror ratesofK r andS M underthenull hypothesis andtheirpowerunderthealternativehypothesis,taking 2 =0:1, =0:05,n=11000, o = 0 under null and o = 5 under alternative hypothesis. The vertical yellow lines indicatethetruenumberofoverlappingassociatedgenes. TestingtheNullHypothesisagainstAlternative–WeightedSums To compare the three sets of weighted sum statistics, we consider the same parameter settings as in Section B.1.3. Under each parameter setting, we conduct 10,000 simula- tionsandcomputethecorrespondingp-valuesasdescribedinSection3.2.2. Fig B.16 shows an example of the power of the three sets of weighted sums and the minimum p-value statistic p m obtained from each of them, taking 2 = 0:05;d = 10;n=25;000;o =0undernullando=1underthealternative. TheTypeIerrorrates andthepowercurveacrossthecompleterangeofthe valuesweconsiderareshownin FigB.17. Thisisasituationwheretheextentofoverlapisrelativelylowcomparedwith theextentofsignal(numberofassociatedgenes),andtherankingalgorithmisdecently reliable. ThethreeweightedsumsallhavewellcontrolledTypeIerrorratesaround0.05 103 0 20 40 60 80 100 0.00 0.01 0.02 0.03 0.04 0.05 Type I error of K r r Type I error of K r o=0 0 20 40 60 80 100 0.00 0.01 0.02 0.03 0.04 0.05 Type I error of S M M Type I error of S M o=0 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of K r r Power of K r o=1 o=5 o=10 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Power of S M M Power of S M o=1 o=5 o=10 Figure B.15: Effect ofo: the Type I error rates ofK r andS M under the null hypothesis andtheirpowerunderthealternativehypothesis,taking 2 =0:1, =0:05,n=11000, d = 10, o = 0 under null and o = 1;5;10 under alternative hypothesis. The vertical yellowlinesindicatethetruenumberofoverlappingassociatedgenes. 0.00 0.01 0.02 0.03 0.0 0.2 0.4 0.6 0.8 1.0 σ 2 =0.05 n=25000 d=10 o=1 β power WeiSumE WeiSumV WeiSumO Figure B.16: The power of the three sets of weighted sums, taking 2 = 0:05;d = 10;n=25;000;o =1. 104 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.01 0.02 0.03 0.04 0.05 0.06 σ 2 =0.05 n=25000 d=10 o=1 β Type I error WeiSumE WeiSumV WeiSumO 0.00 0.02 0.04 0.06 0.08 0.10 0.0 0.2 0.4 0.6 0.8 1.0 σ 2 =0.05 n=25000 d=10 o=1 β power WeiSumE WeiSumV WeiSumO FigureB.17: TheTypeIerrorratesandpowerofthethreesetsofweightedsums,taking 2 =0:05;d =10;n=25;000;o=0undernullando=1underalternative. throughouttherangeof valuesthatwedefineinSection3.2.2. However,thepowerof WeiSumOisdrasticallyinfluencedbythechoiceof;whereasbyaddinganormalizing term, WeiSumE and WeiSumV turn out to be less affected by the choice of . The normalizationeffectisespeciallyobviousforWeiSumE,whereitmaintainsagenerally constant level of power starting at = 0. Similarly for the minimum p-value statistic p m ,WeiSumEdemonstratesanadvantageovertheothertwo,asisshowninFigB.18. 105 WeiSumE WeiSumV WeiSumO σ 2 =0.05 n=25000 d=10 o=1 0.0 0.2 0.4 0.6 0.8 1.0 T1error power FigureB.18: TheTypeIerrorratesandpoweroftheminimump-valuestatisticp m from each of the three sets of weighted sums, taking 2 = 0:05;d = 10;n = 25;000;o = 0 undernullando=1underalternative. In summary, We have proposed two classes of statistics to test for genetic overlap between diseases. The first class includes K r and S M , where results show that K 1 generallyhasthehighestpoweramongallK r ’s,andS 1 hasthehighestpoweramongall S M ’s. The second class includes WeiSumE and WeiSumV. We also include WeiSumO which is previously proposed by Yang et al. [45] for comparison. WeiSumE taking = 0 shows the highest power or remains close to the highest power in most cases, without the need for choosing. Therefore, we compare these winners from each type of statistics to see which one excels in most situations. Table 3.3.1 shows for each length of gene lists, the number of times a method shows the highest power (including ties,takingp-valuethresholdof0.05). 106 O ^ 1 (M=5) O ^ 1 (M=10) O ^ 1 (M=20) O ^ 2 O ^ 3 0 5 10 15 20 σ 2 =0.1 n=25000 d=10 o=5 Figure B.19: The number of overlapping genes determined by ^ o 1 ;^ o 2 and ^ o 3 , taking 2 =0:1, =0:05,n=25000,d=10ando=5. Thehorizontalyellowlineindicates thetruenumberofoverlappingassociatedgenes. DeterminingtheNumberofOverlappingGenes For the same set of simulations as in Section B.1.3, we can also estimate the number ofoverlappingassociated genes. Fig B.19showsan exampleof the number ofoverlap- ping genes determined by ^ o 1 ;^ o 2 and ^ o 3 . In this figure, ^ o 3 tends to be unbiased and the estimatedvaluescenteraroundthetruenumberofoverlappinggenes. Toobtainabigpictureoftheperformanceofdifferentestimators,wecomparethem in terms of their mean squared error (MSE) in Table B.1. It shows that when 2 is low,thatiswhentherankingalgorithmishighlyreliable, ^ o 2 performsbestinestimating the number of overlapping genes. However the estimation accuracy of ^ o 2 deteriorates rapidlyas 2 increases,indicatingthattheaccuracyof ^ o 2 dependsstronglyontheaccu- racy of the ranking algorithm. In contrast, ^ o 3 provides a fairly accurate estimation and stableperformanceacrossdifferentparameters,andisnotseverelyaffectedbytheaccu- racyoftherankingalgorithm( 2 ). Therefore,whenlittleknowledgeabouttheaccuracy 107 Table B.1: The mean squared error (MSE) of different estimators for the number of overlappinggenes. 2 0:01 0:05 0:5 n(×1000) 1 6 11 25 1 6 11 25 1 6 11 25 ^ o 1 (M =5) 5.20 4.77 4.68 4.61 5.48 5.43 5.50 5.71 14.54 14.75 14.71 14.68 ^ o 1 (M =10) 2.73 1.06 0.66 0.40 3.50 2.73 2.74 3.26 21.23 21.46 21.35 21.42 ^ o 1 (M =20) 10.43 3.66 2.37 1.29 12.18 7.33 7.18 7.30 66.45 66.66 69.40 68.72 ^ o 2 0.95 1.48 1.51 1.68 0.97 1.44 1.44 1.57 18.84 20.29 20.80 21.01 ^ o 3 2.60 3.61 3.78 4.08 2.57 3.46 3.58 3.77 15.16 16.86 17.24 17.59 oftherankingalgorithmisavailable,wewouldrecommendusing ^ o 3 astheestimatorof thenumberofoverlappinggenes. B.2 SupplementaryMethods B.2.1 DistributionofK Consider the following equivalent problem. n envelopes are labeled 1;2;··· ;n, and n letters are also labeled 1;2;··· ;n. Randomly put the n letters one-by-one in the n envelopes one-by-one starting from the envelope with the smallest number. Stop the experimentwhentheminimumletterlabelvalueequalsoneoftheenvelopelabelvalues. LetK bethenumberoftrials. WhatisthedistributionofK? SinceK =1ifandonlyiftheletter1isinsertedintotheenvelope1withprobability 1=n, P{K =1}=1=n: Furthermore, K > k if and only if all the first k letters are put into envelopes k + 1;k+2;··· ;n. Theprobabilityis P{K >k}= (n−k)(n−k−1)···(n−k−(k−1)) n(n−1)···(n−(k−1)) = k−1 ∏ i=0 ( 1− k n−i ) : 108 □ B.2.2 DistributionofK r NotethatK r >k whenatmostr−1oftheletterslabeled{1;2;··· ;k}areputintothe envelopeslabeled{1;2;··· ;k}. Theprobabilitythatexactlycletterslabeled{1;2;··· ;k}areinsertedintoenvelopes with labels{1;2;··· ;k} can be calculated as follows. First choosec letters from those labeled{1;2;··· ;k} with ( k c ) choices. LetE c be the event that thec chosen letters are putinanyoftheenvelopeslabeled{1;2;··· ;k},andtheotherk−cofthefirstk letters areinsertedintoenvelopeslabeled{k+1;k+2;··· ;n}. TheprobabilityE c is P(E c )= P k c P n−k k−c P n k = k(k−1)···(k−c+1)(n−k)(n−k−1)···(n−2k+c+1) n(n−1)···(n−k+1) ; whereP k 0 =1andP k c =k(k−1)···(k−c+1); c> 0. Thus, the probability that exactly c letters from {1;2;··· ;k} are inserted into envelopeswithlabels{1;2;··· ;k}is ( k c ) P(E c ). Thereforewehave p−value = 1−P{K r >k} = 1− r−1 ∑ c=0 ( k c ) P(E c ) = 1− r−1 ∑ c=0 ( k c ) k(k−1)···(k−c+1)(n−k)(n−k−1)···(n−2k+c+1) n(n−1)···(n−k+1) : (B.3) □ 109 B.2.3 DeterminingP-ValueofS M Supposethegenelistforthefirstdiseaseisrankedas1;2;3;··· ;n. 1. Permutetherankingofthegenesfortheseconddisease; 2. For the c-th permutation, compute the observed value of K r ; 1 ≤ r ≤ M, denotedask (c) r ,andthecorrespondingprobabilityp (c) r =P(K r ≤k (c) r ). 3. ComputethestatisticsS (c) M =min 1≤r≤M p (c) r . 4. Repeatsteps2and3forC =100;000timestoobtaintheapproximatedistribution ofS M . We use the 100,000 values ofS M to approximate its distribution for a given pair of nandM. Basedonthisdistribution,wecanestimatethep-valueofanobservedS M by p−value= #{c:S (c) M ≤S M } 100;000 : (B.4) B.2.4 DeterminingP-Valueofp m To determine p ’s and the p-value of the single statistic p m for each weighted sum (denotedasWeiSumX)outofWeiSumO,WeiSumEandWeiSumVforagivenlistpair, wedothefollowing: 1. Compute WeiSumX for the pair of lists corresponding to the series (12) of values. 2. Fixthefirstlistas1;2;··· ;n,thenrandomlypermutethelist10,000timesasthe second list, givingC = 10;000 list pairs. ComputeWeiSumX() (c) for each list pairtoobtainthenulldistributionofWeiSumX(). 110 3. Compare WeiSumX() computed in Step 1 with the null distribution obtained in Step2togenerateaseriesofp-valuesfordifferent’s: p = #{c:WeiSumX() (c) ≤WeiSumX()} 10;000 : 4. Taketheminimump-valuep m acrossall’s: p m =min p : 5. Toobtainthenulldistributionofp m ,stillusetheC =10;000listpairsgenerated in Step 2. For each list pair, compute WeiSumX() and compare it with theC = 10;000listpairstoobtainap-valueandtheminimump-valuep m acrossdifferent ’s,givingthenulldistributionofp m : p (c) = #{t:WeiSumX() (t) ≤WeiSumX() (c) } 10;000 ; p (c) m =min p (c) : 6. Computethep-valuesofp m ’sfromStep4bycomparingp m ’swiththenulldistri- butionobtainedinStep5: p−value= #{c:p (c) m ≤p m } 10;000 : (B.5) 111
Abstract (if available)
Abstract
This thesis aims to explore the genetic basis of complex traits. It addresses this problem in two parts. ❧ The first part stems from the general difficulties encountered in previous efforts to explain complex traits with discovered associated genetic variants. Accordingly, increasing attention has been paid to rare variants that have been overlooked in traditional methods. To accommodate the special properties of rare variant studies, pooled sequencing is often used as a cost-effective design. Based on pooled sequencing data, we develop a unified approach for rare variants detection, allele frequency estimation and association studies. The method is thoroughly evaluated on both simulation data and a real re-sequencing dataset, and has shown superior performance. ❧ The second part further studies the problem of finding common genetic basis between traits. We propose a set of test statistics to detect genetic overlap between diseases, and a set of estimators to measure the size of genetic overlap. Performance of these methods under various scenarios is evaluated in simulation studies. Based on this, we choose the most appropriate method to study the relationship between diseases from the OMIM database and diseases in the catalog of GWAS. Both applications demonstrate the effectiveness of our methods in providing insights into the relationships among diseases based on their shared genetic mechanisms, and in identifying common genes underlying the genetic overlap between diseases.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Exploring the genetic basis of quantitative traits
PDF
Adaptive set-based tests for pathway analysis
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Between genes and phenotypes: an integrative network-based Monte Carlo method for the prediction of human-gene phenotype associations
PDF
Neuroimaging in complex polygenic disorders
PDF
Understanding the genetic architecture of complex traits
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Discovery of complex pathways from observational data
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Exploring the application and usage of whole genome chromosome conformation capture
Asset Metadata
Creator
Chen, Quan
(author)
Core Title
Exploring the genetic basis of complex traits
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/06/2014
Defense Date
09/26/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
allele frequency,disease relationship,genetic overlap,OAI-PMH Harvest,pooled sequencing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu Z. (
committee chair
), Goldstein, Larry (
committee member
), Waterman, Michael S. (
committee member
), Zhou, Xianghong Jasmine (
committee member
)
Creator Email
iamchenquan@gmail.com,quanchen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-487384
Unique identifier
UC11286821
Identifier
etd-ChenQuan-3001.pdf (filename),usctheses-c3-487384 (legacy record id)
Legacy Identifier
etd-ChenQuan-3001.pdf
Dmrecord
487384
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chen, Quan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
allele frequency
disease relationship
genetic overlap
pooled sequencing