Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel computational methods of disease gene and variant discovery, parallelization and applications
(USC Thesis Other)
Novel computational methods of disease gene and variant discovery, parallelization and applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1 / 66 Thesis Title Novel computational methods of disease gene and variant discovery, parallelization and applications Author Hui Yang Conferring Program USC Neuroscience Graduate Program Degree being conferred PhD in Neuroscience Degree Conferral date August 2017 2 / 66 Acknowledgements I wish to give my greatest and deepest acknowledgement to my adviser, Professor Kai Wang, for his continuous support and guidance, and effort of mentoring. Great thanks to my lab colleagues, especially Chengliang Dong, Yunfei Guo, Leandro Lima, Shuangchao Ma, Mingyang Cai, for their help, inspiring comments and fun. Also my great gratefulness to Dr. Daniel Campbell, Dr. Wange Lu, Dr. Paul Thomas for their effort and time in my thesis committee. 3 / 66 Table of Content Thesis Title ................................................................................................................................. 1 Introduction ............................................................................................................................... 6 Chapter 1. Phenotype-based disease gene prioritization tool .................................................. 8 Abstract ................................................................................................................................. 8 Introduction ........................................................................................................................... 8 The daunting challenge of information extraction from a NGS experiment..................... 8 Phenolyzer: a tool integrate multiple sources of prior knowledge ................................... 9 Materials and Methods ....................................................................................................... 12 Compilation of gene-disease databases .......................................................................... 12 Disease and Phenotype term interpretation ................................................................... 12 Seed gene set generation ................................................................................................ 14 Seed gene set growth and model training ...................................................................... 15 Gene Disease Score System ............................................................................................. 17 Gene Prediction Score System ........................................................................................ 17 Tool comparison .............................................................................................................. 18 Result ................................................................................................................................... 19 Performance evaluation on known Mendalian disease genes ........................................ 19 Performance evaluation on known complex disease genes ........................................... 20 Performance evaluation on novel disease genes ............................................................ 22 Phenolyzer performance with phenotype data only ...................................................... 22 The impact of input tokenization on Phenolyzer ............................................................ 23 Case study on exome sequencing and copy number variation data ............................... 24 Combination between Phenolyzer and variant discovery tool ....................................... 26 Case study for Prader–Willi Syndrome ............................................................................ 27 Discussion ........................................................................................................................ 28 Chapter 2: Application of Phenolyzer to Neuroscience .......................................................... 31 Abstract ............................................................................................................................... 31 4 / 66 Introduction ......................................................................................................................... 31 The need for better functional characterization of mental disorder genetics ................ 31 The extension of Phenolyzer for integration of new database ....................................... 32 PsyPPRes: One implementation within Neurocomplex framework................................ 32 Differential PSD protein cluster pattern during mouse development ............................ 33 Method and material .......................................................................................................... 33 Biochemistry/Mass Spectrometry and ............................................................................ 33 PsyPPRes Software development .................................................................................... 33 Enrichment analysis and Hierarchical clustering ............................................................. 33 Result ................................................................................................................................... 35 The association between the PSD core proteins and schizophrenia .............................. 35 PsyPPRes and its utility .................................................................................................... 36 Differential protein cluster pattern during development ............................................... 37 Discussion ............................................................................................................................ 40 Chapter 3: HadoopCNV: A Hadoop based parallel tool to detect CNVs and LOH ................... 40 Abstract ............................................................................................................................... 40 Introduction ......................................................................................................................... 41 Materials and Methods ....................................................................................................... 42 Simulated data set ........................................................................................................... 42 Testing data set (NA12878) ............................................................................................. 43 Testing data set (10-member pedigree) .......................................................................... 43 Implementation of HadoopCNV ...................................................................................... 43 CNV calling on WGS data by other software ................................................................... 46 Performance evaluation .................................................................................................. 46 Availability ....................................................................................................................... 47 Result ................................................................................................................................... 47 Overview of the approach ............................................................................................... 47 Speed comparison with other CNV calling methods ....................................................... 49 Evaluation on simulated datasets ................................................................................... 50 5 / 66 Detection of loss-of-heterozygosity (LOH) ...................................................................... 51 Evaluation on the NA12878 sample ................................................................................ 52 Comparison of CNV calling accuracy on a 10-member pedigree .................................... 54 Discussion ............................................................................................................................ 55 Conclusion ............................................................................................................................... 58 Reference ................................................................................................................................ 59 6 / 66 Introduction Before I start my thesis, I sincerely want to address the importance and contribution of computational biologists, because at times it is difficult for traditional biologists to appreciate their efforts. As Next Generation Sequencing became more and more popular in current days, it is bringing plenty of promises and hopes that a series of very difficult but vital questions about basic biology could be answered, and multiple severely diseases could be cured. In 2015, Obama administration initiated the Precision Medicine initiative, promoting the research and development of tailored medicine and treatment based on individual’s genomic information, disease history and physical conditions. However, the strong reliance on genomic sequencing brings new challenges to most traditional biologists and clinicians, especially considering the unprecedented scale of the data, the requirement of the programming and data processing skills, and the technical challenges for each individual analysis. Thus, bioinformatics and computational biology is never as important before as in the genomic era. At the same time, computational biologists are not generally appreciated by traditional biologists, especially when their work doesn’t include production of original data but only a meta-analysis or data mining from public data. For example, one author wrote a paper in PLOS Biology about that computational biologists making sense of published data are described as “research parasites” by the editor-in-chief of the New England Journal of Medicine. In this paper titled “All biology is computational biology”, he argued that “the next modern synthesis in biology will be driven by mathematical, statistical, and computational methods being absorbed into mainstream biological training, turning biology into a quantitative science” [1]. In my view, claiming “all biology is computational biology” might be an overshoot, but there is no doubt that computational biology is a necessary and vital component in today’s biology world, and deserve more and more attention, not only including bioinformatics especially genomic data analysis, but also including computational neuroscience, synthetic biology, biological system simulation and so on. My Phd accomplishments could be categorized as a contribution to genomic data analysis, majorly including gene and variation discovery and prioritization, beneficiary to research scientists and clinicians with an overwhelming amount of information in front of them. My thesis is divided into three chapters, with the first one focusing on a tool developed majorly by myself called Phenolyzer, to help clinicians and scientists to utilize the power of phenotype information for gene and variant discovery. Second chapter is a collaborative work focusing on 7 / 66 unveiling the interactomes of PSD (Post-Synaptic Density) proteins across different developmental stages in mouse, and a web server for users to access the protein interaction data interactively. The third chapter is about a Copy Number Variation (CNV) detection tool based on Hadoop framework, to address the scale-up issue widely existing in modern population genomic analysis. 8 / 66 Chapter 1. Phenotype-based disease gene prioritization tool Abstract In the era of next-generation sequencing (NGS), hundreds or even thousands of individual genome data can be generated in a single study. In each individual, tens of thousands of single- nucleotide variants (SNVs), indels and structural variants (SVs) can be identified, but only a handful are relevant to the disease or phenotype of interest. Prioritizing candidate variants or genes from sequencing data therefore poses substantial challenges. Prior biological knowledge and phenotype information may also help pinpoint genes that contribute to disease, and several prioritization tools utilize known genotype-phenotype relationship information. Although these tools demonstrated the clear utility of using phenotype data to improve disease gene and variant discovery, none of these tools is able interpret user input as free text, and no tool is available for large scale analysis, due to the inaccessibility of their command line tool. Here we developed a new tool called Phenolyzer, a tool that uses prior information to implicate genes involved in diseases. Phenolyzer exhibits superior performance over competing methods for prioritizing Mendelian and complex disease genes, based on disease or phenotype terms entered as free text. Phenolyzer is available at http://phenolyzer.wglab.org. This is my published work with Phenolyzer and wANNOVAR published in two different journals [2, 3]. Introduction The daunting challenge of information extraction from a NGS experiment In a typical study using human whole genome or exome sequencing data, tens of thousands of single nucleotide variants (SNVs), indels and structural variants (SVs) can be identified, but only a handful are relevant to a disease or phenotype of interest. Prioritizing candidate variants or genes from sequencing data therefore poses substantial challenges[4]. Several computational tools, including ANNOVA[5], snpEff [6], VEP[7], Jannovar[8] and VAT[9], address this problem mainly by employing a number of variant filtering steps, such as keeping non-synonymous and 9 / 66 splice variants, keeping variants with high conservation scores and low alternative allele frequencies[10]. However, these tools solely utilize the information from variants and sequences themselves based either on evolution conservation or protein structure alterations, while neglecting the most important information – phenotype. Prior biological knowledge and phenotype information may also help pinpoint genes that contribute to disease, and several prioritization tools utilize known genotype-phenotype information. However, the input for most of these tools is limited to training gene lists (for example, ENDEAVOUR[11]), specific disease identifiers (for example, MedSim[12] and Phevor[13]), or ontology identifiers (for example, Phen-Gen[14]). Such requirements may restrict usage and accessibility for average biologists. Several other tools can take keywords as input, such as Genecards[15], PosMed[16] and SNPs3d[17]. Yet others use phenotype information to improve the prediction or prioritization of disease genes. For example, Phenomizer assesses an input query consisting of standard Human Phenotype Ontology (HPO) terms and generates a diagnosis with P-values for each of the ~7,000 rare diseases in the HPO database[18]. PHIVE[19] improves the identification of disease genes by incorporating phenotype information from model organism studies in cross-species comparisons. Together, these approaches demonstrate the clear utility of using phenotype data to improve disease gene finding. However, no tool is able to utilize existing disease nomenclature systems to interpret the user input as free text, sometimes including multiple diseases or phenotype terms. Phenolyzer: a tool integrate multiple sources of prior knowledge Here we introduce a computational tool called Phenolyzer to prioritize human disease genes based on disease or phenotype information provided by users as free text. Phenolyzer includes multiple components: 1) a tool to map user-supplied phenotypes to related diseases, 2) a resource that integrates existing knowledge on known disease genes, 3) an algorithm to predict novel disease genes, 4) a machine learning model that integrates multiple features to score and prioritize all candidate genes and 5) a network visualization tool to examine gene-gene and gene-disease relationships. Phenolyzer works by an intuitive approach: it interprets user-supplied disease or phenotype terms as related disease names, which are then used to query pre-compiled databases to find and score relevant seed genes. The seed genes are then expanded to include related genes, based on several types of gene-gene relationship logic such as exhibiting a protein-protein interaction, sharing a biological pathway or gene family, or transcriptionally regulating or be 10 / 66 regulated by another gene. Finally, these different types of scores, compiled from seed gene ranking, protein-protein interaction, sharing the same biological pathway, sharing the same gene family, and transcription regulation are integrated to generate a ranked candidate gene list, together with detailed explanation to track the source of information used to compile the scores (Figure 1). Phenolyzer is available as a web server at http://phenolyzer.wglab.org, and a command-line version can also be downloaded for batch processing. The user-friendly web server interface takes user input and generates results within minutes. The results page includes several tabs, including a summary tab with wordcloud and links to output files, a gene-disease- term interaction network, a bar plot of the 500 highest-ranked genes with normalized scores, and detailed record-tracking information with raw scores and links to external databases. 11 / 66 Figure 1. Workflow of Phenolyzer. (1) Disease match: each disease or phenotype query term is separately translated into sets of disease names by word match, offspring search, synonym retrieval and phenotype interpretation in disease name databases. (2) Gene query: each retrieved disease name is queried in the gene-disease databases based on an exact match, to get a list of genes. (3) Gene score system: a score based on the type and confidence of the gene-disease relationship is generated for each gene corresponding to each disease name. Then, for each input term, a weighted sum score is calculated for each reported gene by adding all the scores retrieved in previous step. The seed gene set is generated by collating all the genes of all input terms, and each gene score is normalized. (4) Seed gene growth: candidate disease genes are expanded beyond the seed gene set based on 12 / 66 four types of gene-gene relationships; scores are calculated for all genes that connect with seed genes. (5) Gene ranking: all the information is integrated to generate a score for each gene, with the weights trained from a logistic regression model. The scores are renormalized to the final prioritized gene list. HPRD, Human Protein Reference Database; HTRI, Human Transcriptional Regulation Interaction Database. Materials and Methods Compilation of gene-disease databases Phenolyzer incorporates a list of gene-disease databases, pre-compiled from several data sources, including OMIM[20], Orphanet[21], ClinVar[22], Gene Reviews[23] and GWAS Catalog[24]. This list may expand in the future. It also integrates four different gene-gene relationship databases: HPRD[25] contains the human protein interaction data; Biosystem[26] contains records from KEGG, BioCyc, Reactome, Pathway Interaction Database, WikiPathways and Gene Ontology; HGNC Gene Family[27] contains the gene family information; and HTRI[28] contains the human transcription factor interaction information. Gene symbols are standardized to Entrez Gene identifiers. If a gene symbol in any gene-disease or gene relationship database has no corresponding symbols or synonyms in the Entrez Gene database, it is removed from the results. Disease and Phenotype term interpretation Currently, Phenolyzer uses CTD Medic vocabulary[29], Disease Ontology[30], precompiled OMIM disease synonyms[20], Human Phenotype Ontology database[31] and OMIM descriptors to expand and interpret a given term into a full set of specific disease names. 13 / 66 Figure 2. Illustration of the disease or phenotype term interpretation process. The term is first processed through a word match to several different data source, DO (Disease Ontology), CTD Medic disease ontology vocabulary, HPO (Human Phenotype Ontology), OMIM synonym, OMIM descriptors, and Phenolyzer’s compiled disease vocabulary. After the first match, the disease names are directly returned for Phenolyzer’s compiled disease names and OMIM synonyms. For DO and CTD, an ontology search will retrieve all the descendent disease names and synonyms. For OMIM descriptors, they are mapped into OMIM diseases with a conditional probability as reliability. For HPO, an ontology search first finds all the descendent phenotypes, then the phenotypes are mapped into diseases with reliabilities. A disease or phenotype term set, Term = {Term k} with one or multiple terms, is used as input to describe diseases or phenotypes for a given patient. Each Term k is interpreted into a set of specific disease and phenotype names, Disease = {Disease i}. For example, {‘Muscular dystrophy’,’ Duchenne’} is an input term set. The separation into two parts increases the chance of matching a larger list (and hence more candidate genes), since the fine-grained term 'Duchenne Muscular Dystrophy' may not be the most accurate diagnosis for the patient (there are dozens of types of muscular dystrophy) and it reduces the chance of false-negative results due to an excessively fine-grained query term that is not in the Phenolyzer data sources. On the other hand, the final score of the true candidate gene will not be affected substantially. Basic regular expression techniques are used here to process each term into a format that all the continuous non-word characters are transformed into a single white space. 14 / 66 Word matching is conducted between each term and the names or synonyms in the several disease databases mentioned above and the disease names from all the pre-compiled gene- disease databases. Each term will be treated as a single phrase without tokenization, to examine whether it is contained within the disease or phenotype records in the databases. Even though no tokenization is conducted here, the large number of available synonyms in disease vocabularies results in a high chance of a final match in the disease-gene databases. (Figure 2) For matched disease names, all their synonyms and descendent disease names or synonyms are returned with a reliability score of 1.00. For matched phenotype names, disease reliabilities are calculated differently for HPO and OMIM descriptors. For HPO, reliability is dependent on the phenotype-disease mapping descriptions given by the HPO annotation file retrieved from its website: “rare” to 0.05, “occasional” to 0.075, “frequent” to 0.33, “typical” 0.5, “variable” to 0.5, “common” to 0.75, “hallmark” to 0.90, “obligate” to 1.00. If such description is not available, it is treated as “frequent”. For OMIM descriptors, the conditional probability P(Disease|Descriptor) is used as the reliability, which is the probability of the disease given the matched descriptors, based on the OMIM disease description file. Finally, if a disease name has several reliability scores from several sources, only the highest reliability score will be saved, thus avoiding duplicated calculation for the same disease. Seed gene set generation Each disease name Disease i in the extended full disease name set is queried in the pre-compiled gene-disease databases. A case-insensitive match will be conducted. Each time a gene is found to be directly associated with a disease, a score is calculated. We calculate a weighted score for each gene given individual term, S(Gene ,Term k )= ∑ Score(Gene, Disease i )× Reliability(Disease i ) Disease i in Disease Count(Disease) . The Score(Gene, Disease i) is the sum of corresponding scores for all the five sources in the pre- compiled gene-disease database, with individual scores calculated by Gene Disease Score System (see methods below). Count(Disease) is the overall record number of the diseases 15 / 66 occurring in gene-disease databases corresponding to the term. Reliability(Disease i) is the reliability of the retrieved full disease name. One advantage of using the above formula is to support queries with multiple terms. For example, if ‘Alzheimer’ and ‘brain’ are entered as different terms, the genes corresponding to the more specific word ‘Alzheimer’ will have higher scores, due to its smaller number of corresponding records. Thus by adding each term’s scores, the most specific term will dominate. Additionally, the intersecting genes will always have higher scores. The final weighted sum score of a seed gene considering all reported gene-disease relationship is calculated by: S Reported (Gene ,Term)= ∑ S(Gene ,Term k ) Term k in Term It is then normalized to be between 0 and 1 by dividing by the maximal score, S ̃ (Gene ,Term)= S Reported (Gene ,Term) max {S Reported (Gene ,Term)} Seed gene set growth and model training The seed gene set is grown based on four different types of gene relationship databases - HPRD, NCBI’s Biosystem, HGNC Gene Family and HTRI databases. For each seed gene Gene i, each gene Gene j gets a score for each type of relationship with the seed gene. For each type of relationship, S relation (Gene j )= ∑ Score relation (Gene i ,Gene j )×S ̃ (Gene i ,Term) Gene i S ̃ (Gene i ,Term) is the normalized seed gene score generated in the previous step. The logic here is that, if a seed gene has a higher score, then its related genes will also have higher scores. Score relation(Gene i,Gene j) comes from the Gene Prediction Score System (see methods below). Each gene is processed to generate a score vector, x = [X 1, X 2, X 3, X 4, X 5, X 0], among which, X 1 = S ̃ (Gene i ,Term), X 2 = S HPRD (Gene i ) , X 3 = S Biosystem (Gene i ) X 4 = S HGNC (Gene i ), X 5 = S HTRI (Gene i ), and X 0 = 1. The initial weight vector is set w = [1.0, 0.1, 0.05, 0.05, 0.05, -0.5]. A 16 / 66 positive gene has y n = 1, and negative gene has y n = -1. Negative genes are selected randomly in Entrez Gene databases excluding true positive genes, as 10 times of the number of true positive genes. Logistic regression is used to model the score vector into y n, P(y n |𝐱 𝐧 )= 1 1+ e −y n 𝐰 𝐓 𝐱 𝐧 A gradient descent algorithm is used, with 10,000 steps and learning rate η=1, to minimize the cost function, E(𝐰 )=− 1 N ln∏P(y n |𝐱 𝐧 ) N n=1 = 1 N ∑ln(1+ e −y n 𝐰 𝐓 𝐱 𝐧 ) N n=1 For each step, 𝐰 (t+1)= 𝐰 (t)−η𝛁 E(𝐰 (t)) The training dataset comes from four different sources and includes four different diseases: type 1 diabetes (MIM 222100) with 439 Entrez genes from the Type 1 Diabetes Database[32], type 2 diabetes (MIM 125853) with 522 Entrez genes from the Type 2 Diabetes Database[33], Crohn’s disease (MIM 266600) with 207 Entrez genes from literature[34], and coronary artery disease with 604 Entrez genes from the Coronary Artery Disease Gene Database[35]. These represent some of the most well studied diseases by genome-wide and candidate gene association studies. After the learning, the optimized weight vector is identified and the score with trained weights is calculated by S logistic (Gene i ,Term)= 𝐰 𝐓 𝐱 𝐧 −X 0 , and is then normalized to be between 0 and 1. S ̃ logistic (Gene i ,Term)= S logistic (Gene i ,Term) max {S logistic (Gene i ,Term)} 17 / 66 Gene Disease Score System For each gene-disease pair in the gene-disease databases, the score is multiplied by two parameters, α represents what kind of study is used to infer the gene-disease relationship, and β represents to what extent such relationship is confirmed. These parameters are defined by ad hoc measures specific for each database as described below. For OMIM, α is 0.25, 0.5, 0.75 and 1 for Disorder Code 1, 2, 4 and 3, respectively, yet β is 0.25, 0.5, 0.75 and 1 for Status Code I, L, P, and C, respectively. Disorder Code is defined as below: 1 for ‘the disorder is placed on the map based on its association with a gene, but the underlying defect is not known’, 2 for ‘the disorder has been placed on the map by linkage; no mutation has been found’, 3 for ‘the molecular basis for the disorder is known; a mutation has been found in the gene’, and 4 for ‘a contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype’. Status Code is defined as below: I for ‘inconsistent - results of different laboratories disagree’, L for ‘limbo - evidence not as strong as that provisional, but included for heuristic reasons’, P for ‘provisional - based on evidence from one laboratory or one family’, C for ‘confirmed - observed in at least two laboratories or in several families’. For ClinVar, α is set at 0.25, and β is the reference count divided by the maximum number of reference counts. For Orphanet, α is 0.25 for ‘role in the phenotype of’ or ‘candidate gene tested in’, 0.50 for ‘part of a fusion gene in’, ‘modifying germline mutation in’ or ‘modifying somatic mutation in’, 0.75 for ‘major susceptibility in’ and 1.00 for ‘disease-causing germline mutations in’ or ‘disease-causing somatic mutations in’, whereas β is the reference count divided by the maximum number of reference count. For GWAS, α is set at 0.25, and β is 1 - P-value. For GeneReviews, α and β are both set as 1. The score for a gene-disease pair from a specific data source is then calculated as αβ. Gene Prediction Score System The parameters for Gene Prediction Score System are defined by ad hoc measures specific to each database. For HGNC gene family, the score is set at 1. For Biosystem, each entry has already been assigned an individual score by NCBI, which is normalized from 0 to 1 by dividing by the maximum score. For all the Biosystems containing the pair of genes of interest, the maximum normalized score is returned as the final score for this pair. For HPRD, the score is first set to 0; if the two genes have the interaction information as ‘in vivo’ , the score is increased by 1, if ‘in vitro’ increased by 0.5, if ‘yeast two hybrid’ increased by 0.25. Then the final score is normalized to be from 0 to 1 by dividing by 1.75. For HTRI, the score is calculated by normalizing 18 / 66 the Pubmed reference count by the maximum count, to range from 0 to 1, and the transcription factors (the source of the regulating relation) are penalized by multiplication by 0.25. Tool comparison For short disease names, the disease name is directly used as input. For long disease names, the best effort is used to obtain optimal result. For example, if no matching is found using the full name, we will divide the long disease name into shorter terms. For Phevor, a variant file with all the genes mutated to the same extent was generously provided by Marc V. Singleton (one of the developers of Phevor) specifically for testing Phevor. Thus we can first enter the disease terms in Disease Ontology field and phenotype terms in Human Phenotype Ontology field and generate the gene profile, then run Phevor with the supplied variant file and obtain a final gene list with scores. The gene ranking only depends on the disease or phenotype terms but not the mutations, so we can use the results for comparison and benchmarking. Since Phevor takes at most 5 terms for one type of input, in rare cases when there are more than 5 available terms, only the first 5 are selected. We observed that in Phevor’s result, a large number of genes may have the same rank. For example, in the output for ‘liver cirrhosis’, there are 7715 genes with the same rank. To deal with this issue, we use the smallest rank on all these genes. To ensure a fair comparison with other tools, we keep using the smallest rank when encountering the same situation for other tools. For Phenomizer, the disease or phenotype term is used to search for features and diseases. Then all the related features are added to the patient’s feature, and the clinical diagnosis is generated. For an extreme example of ‘intellectual disability’, there are 185 diseases that can be retrieved, and all the feature annotations of these diseases are manually added. These same features are also used as input for Phenolyzer to demonstrate its performance for 14 monogenic diseases and complex diseases. We made best efforts to ensure that for each disease or phenotype, a gene list is retrieved for each tool. If a term cannot generate any record, we will try a more general term. The only exception is ‘longevity’ for Phenomizer and Phevor, where ‘lifespan’ and ‘life span’ are also tried yet no result can be generated. All the comparison shown for Phenolyzer is conducted without 19 / 66 the add-on databases (which are additional databases outside of the core Phenolyzer databases), and with the trained weights from logistic regression except when stated otherwise. Result Performance evaluation on known Mendalian disease genes We first tested Phenolyzer on a small data set of 14 very well-established Mendelian diseases that are each known to be caused by mutations in only one or two genes[36]. Given the input phenotype, we examine how Phenolyzer ranks the disease causal gene against other genes in the genome, as ‘top 1’, ‘top 10’, ‘top 20’ and ‘below top 20’. Note that if a disease has no corresponding record, it is categorized as ‘below top 20’; for the two diseases with two causal genes, the rank for two genes is averaged and rounded to the smaller integer. Due to the relatively small size of the data set, we also tested several other tools manually (most of which do not offer batch processing), including Phenomizer, GeneCards, SNPs3d, PosMed and Phevor. Phenolyzer successfully prioritized all disease causal genes as ‘top 1’ (Figure 3a). SNPs3d failed only on one gene, which was ranked 'top 3'. PosMed does not work well to score genes for monogenic diseases, probably because its algorithm purely relies on co-occurrence of gene- disease pairs from the literature. Phevor did not work well, likely because its algorithm does not weight different ontology scores. Phenomizer works well, considering that its input is the phenotypes terms for each disease rather than the disease name itself. We next expanded our evaluation to 590 known inherited disease genes compiled in a newborn sequencing study[37] . Due to the large size of the dataset, we were unable to compare to other tools on their web servers. We found that 81.2% of the genes were ranked as ‘top 1’, 90.7% of the genes were ranked as ‘top 10’, and overall 93.4% of the genes could be identified by Phenolyzer in its ranked list of candidate genes. Thus, our analysis clearly demonstrated that Phenolyzer is adept at finding genes that are known to be associated with Mendelian diseases. 20 / 66 Figure 3. Comparison between Phenolyzer and other tools to find well-known monogenic disease genes and predict recently published novel disease genes. (a) The ranking distribution of genes for 14 monogenic diseases. (b) The ranking distribution of 55 recently published disease genes from four human genetics journals. Performance evaluation on known complex disease genes We next examined several datasets to evaluate the ability of Phenolyzer to prioritize candidate genes for complex diseases. For simplicity, we tested four complex diseases with input as ‘cancer’, ‘autism’, ‘rheumatoid arthritis’ and ‘anemia’, including 517 genes in the Cancer Gene Census from COSMIC (Catalog of Somatic Mutations in Cancer)[38], 22 genes strongly associated (FDR<0.05) with autism from exome sequencing[39], 634 genes from DRAP (Database of Rheumatoid Arthritis related Polymorphisms)[40], and 121 genes at 75 loci associated with red blood cell phenotype and anemia[41], respectively. All the reported genes were used as positive genes, and all other genes were used as a negative set (though we acknowledge that some may still be genuine disease genes). We used ROC (Receiver Operating Characteristic) curve and AUC (Area Under the Curve) values to evaluate each tool (Supplementary Fig. 4). Our results demonstrate that Phenolyzer compares favorably against all the other tools. For example, for autism genes that were identified from an exome sequencing study, Phenolyzer achieved an AUC score above 0.85, but none of the other tools has AUC scores higher than 0.81. PosMed performs nearly as well as Phenolyzer, but it does not find as many true positive genes as Phenolyzer, which leads to lower AUC value. Phevor’s ontology propagation algorithm works well to retrieve a large list of genes and has the ability to discover novel genes. SNPs3d, 21 / 66 Genecards and Phenomizer can only generate a limited number of candidate genes in the output and thus do not perform well, though tools such as Phenomizer were not designed to prioritize genes of complex disease. In addition, we also evaluated Phenolyzer with only the seed gene list (without the seed gene growth step), and found that the performance of Phenolyzer is greatly reduced, suggesting the importance of seed gene growth to find new genes not documented in Phenolyzer's disease-gene knowledgebase 22 / 66 Figure 4. Comparison between Phenolyzer and other tools to find disease genes for cancer, rheumatoid arthritis, autism and anemia. The AUC and ROC curve plot showing the performance comparison between Phenolyzer and other tools, on four gene sets of different complex diseases. (a, c, e and g) For each software, the AUC is calculated as the area under the ROC curve. (b, d, f and h) The ROC curve is plotted as True Positive Rate versus False Positive Rate. ‘Phenolyzer Phenotype’ is the Phenolyzer results with phenotype terms as input (the same input as Phenomizer). ‘Phenolyzer Logistic’ is Phenolyzer with weights trained with Logistic Regression model, compared with ‘Phenolyzer no training’. ‘Phenolyzer Seed’ is Phenolyzer’s seed gene result without the seed gene growth step, thus only representing the genes found in Phenolyzer’s disease-gene mapping knowledgebase. Performance evaluation on novel disease genes In addition to known disease genes, we further evaluated the performance of Phenolyzer to prioritize novel disease genes. For this analysis, we identified 55 disease genes published between June and August 2014 in four scientific journals (Nature Genetics, American Journal of Medical Genetics, American Journal of Human Genetics and Human Molecular Genetics). These discoveries represented novel findings at the time of our analysis, so it is less likely that Phenolyzer can use the exact gene-disease records in any knowledgebase. We submitted the phenotypes or diseases described in each of the original publication to Phenolyzer, and examined whether the reported genes can be found in the ranked list of candidate genes. Compared to competing tools, Phenolyzer achieves the highest response rate, top 5% ratio and top 50% ratio. Phevor works well to identify a large list of candidate genes, but did not excel in ranking the causal genes at the top of the list. PosMed performs nearly as well as Phenolyzer due to its implementation of a powerful literature mining engine, but it has a lower response rate. Phenomizer performs well on the genes for which it can generate predictions. GeneCards works poorly on this task. SNPs3d is not included here since it cannot generate a large number of candidate genes (Figure 3b). Phenolyzer performance with phenotype data only Compared to Phenomizer, Phenolyzer can analyze both concrete disease names as well as a list of phenotype terms. We performed a re-analysis of the 14 monogenic and four complex diseases only using a list of phenotype terms as the input and found that Phenolyzer has similar performance as Phenomizer for the known monogenic disease genes (Figure 5); for complex diseases, they also perform similarly, but Phenomizer is limited by its number of candidate genes in the results (Figure 4). 23 / 66 Figure 5. Evaluation of Phenolyzer by phenotype terms as input. Phenolyzer with phenotype terms (rather than disease names) as input is able to prioritize most genes as ‘Top 1’ for the 14 monogenic diseases, which is similar as Phenomizer. The impact of input tokenization on Phenolyzer Based on the implementation of Phenolyzer, if a long disease name is tokenized into multiple short terms, the possibility of a match for each short term is larger than the original long term; as a result, the number of genes in the result should be increased and the recall rate may be improved. However, the cost associated with the improved recall rate is that the precision may be affected. To demonstrate this, we thoroughly tokenized 13 disease names from the 14 Mendelian disease names used in the manuscript; for example, ‘Spermatogenic failure, nonobstructive, Y-linked’ is tokenized into ‘Spermatogenic;failure;nonobstructive;Y;linked’, with each single word being treated as a single term. Our result shows that thorough tokenization increased the number of returned genes from 2,809 to 22,222 (median) and from 3,287 to 21,286 (mean), at a cost of incorrectly prioritizing 3 genes out of 15 genes. 24 / 66 Case study on exome sequencing and copy number variation data To better illustrate the power of Phenolyzer, below we present three case studies on how to use Phenolyzer to analyze exome-sequencing data and copy number variation (CNV) data and identify disease causal genes. One group identified that a BRAF gene mutation drives growth of papillary craniopharyngiomas and targeted genotyping identified this mutation in all papillary craniopharyngiomas[42]. They first analyzed the whole-exome sequencing data from a cohort of adamantinomatous (n=12) and papillary craniopharyngiomas (n=3) and identified a small number of non-synonymous somatic mutations in both subtypes, including 490 variants within 444 genes. This gene list was supplied into Phenolyzer, together with the disease term ‘craniopharyngiomas’ (Figure 6). BRAF shows up as the top gene. We also evaluated other competing tools, with the same term as input, and restricted results by the same input gene list. PosMed ranked BRAF as 6th. Genecards and SNPs3d returned a result list without the BRAF gene. As Phevor does not include the record for disease term ‘craniopharyngiomas’, we used ‘craniopharyngioma’ and the BRAF gene is ranked as 321st. 25 / 66 Figure 6. Phenolyzer’s results on four case studies. (a and b) The candidate gene lists generated from two studies on ‘Craniopharyngiomas’ and ‘SHORT syndrome’ were used as input into Phenolyer. The network plot shows that BRAF and PIK3R1 are the genes with the highest scores corresponding to each disease separately. (c) For the CNV study of ‘Osteoporosis’, the generated significant CNV regions were used as input, and the Phenolyzer network successfully identified the correct gene, UGT2B17. (d) Combined with wANNOVAR, we first filtered the variants into a small list, then included all the genes in the variant list as the input into Phenolyzer. The correct gene PKLR was identified as the top gene for ‘hemolytic anemia’. Another group identified a PI3KR1 mutation underlying SHORT syndrome and verified its functional impact in fibroblasts[43]. Whole exome-sequencing was conducted in 6 individuals from two families affected by SHORT syndrome. After variant calling and filtering, 22 variants within 22 genes were left in the candidate list. This gene list was then provided as input into Phenolyzer, together with the term ‘SHORT syndrome’. In Phenolyzer’s network, it is easily seen that PI3KR1 is the most significant gene (Figure 6). After gene selection, PosMed can also rank PIK3R1 as top 1. Genecards ranked PIK3R1 as 2nd. SNPs3d does not have PIK3R1 in the output. Phevor had no record for term ‘SHORT syndrome’ at all. In addition, we also tried ‘partial lipodystrophy’, ‘low body mass index’, ‘short stature’, ‘progeroid face’, and ‘Rieger anomaly’ 26 / 66 (the phenotype descriptions in the original paper for SHORT syndrome) as input, and PIK3R1 is still ranked as the ‘top1’ gene by Phenolyzer. We used ‘Lipodystrophy ‘,’ Weight loss ‘, ‘Progeroid facial appearance ‘,‘Rieger anomaly’,’ Short stature’ as HPO terms for Phevor and PIK3R1 is also ranked as top 1. In the third example, a CNV study identified a CNV affecting the gene UGT2B17 as a contributing factor for osteoporosis[44]. Case-control genome-wide CNV studies were conducted using Affymetrix Human Mapping 500K Arrays, with the cohort of 350 Han Chinese individuals with a history of hip osteoporosis and 350 healthy matched controls. 727 significant CNVs were called from this study. These regions were used as input into Phenolyzer, together with the disease term ‘osteoporosis’. From the Phenolyzer’s network plot, UGT2B17 shows up as the top gene (Figure 6). We also evaluated other competing tools, by first compiling a gene list from these CNVs (1,510 genes in total). For PosMed, UGT2B17 ranked at 4th. For Genecards, UGT2B17 ranked at 3rd. For SNPs3d and Phevor, the final gene list did not include UGT2B17. Combination between Phenolyzer and variant discovery tool To facilitate users with whole genome or exome sequencing data rather than just a list of genes, we also implemented a wANNOVAR-Phenolyzer pipeline for analysis on VCF files at http://wannovar.wglab.org[45]. Disease or phenotype terms are accepted as optional input fields here, then Phenolyzer is called to prioritize candidate genes directly from wANNOVAR output. In the wANNOVAR result page, the gene list prioritized by Phenolyzer can be directly retrieved. Additionally, the link to the Phenolyzer result page is also available as the ‘Network Visualization’ link. For example, we previously reported an exome sequencing study identifying a mutation in PKLR as "unrelated finding" in a patient with hemolytic anemia, through a study originally designed to uncover the genetic basis of attention deficit and hyperactivity disorder (ADHD)[46]. The VCF file is used as the input into wANNOVAR, with ‘rare recessive Mendelian disease' selected as disease model. In total, 87 variants were left after the filtration, whose corresponding genes are then submitted automatically as input into Phenolyzer together with the term ‘anemia’ or ‘hemolytic anemia’, by wANNOVAR. From the result network, the PKLR gene is ranked top with the term 27 / 66 ‘hemolytic anemia’ (Figure 6). However, with the term ‘anemia’, PKLR ranked third. Thus, integrating wANNOVAR and Phenolyzer in the same pipeline can facilitate and expedite identifying disease causing variants, yet more refined disease terms can further improve the accuracy of disease gene finding, when the disease can be caused by multiple genes. Case study for Prader –Willi Syndrome Here, we present our extensive pedigree with phenotypic characterization, including 14 individuals of three generations with multiple different phenotypes. The K10032-10232 has PWS (Prader–Willi Syndrome), which is a complex genetic condition characterized by hypotonia and development delay in infancy, and chronic overeating and obesity in childhood, also often associated with mild intellectual impairment. Figure 7. Three de novo heterozygous deletions with a total size of about 5.5 Mb were reported, with the hg19 genomic coordinates of the breakpoints as chr15:22,749,401-23,198,800 (~449 Kb), chr15:23,608,601-28,566,000 (~4.96 Mb), and chr15:28,897,601-28,992,600 (~95 Kb). We represent this proband’s phenotypes in HPO (Human Phenotype Ontology) terms, thus Phenolyzer is able to take the three deletions as genomic input, and take the HPO terms as phenotypic input, and generate results to prioritize genes within the deletions and display the underlying interactions. As shown below, among all genes in the deletion regions, SNRPN, NDN and thirteen other genes are the most confident genes, among these, SNRPN and NDN are already reported in OMIM database. This is a straightforward case demonstrating Phenolyzer’s ability to help prioritize the candidate genes out of the variant information. Phenolyzer is able to identify two important candidate genes from the variant regions detected in K10032-10232 – NDN and SNRPN, which are reported to be associated with PWS in OMIM. Also, Phenolyzer is able identify the HPO term-disease-gene network underlying this discovery, from which we can tell GABRB9, GABRG3 and GABRA5 are in the same gene family which are also in the variant regions and related to the phenotypes by Phenolyzer (Figure 8). 28 / 66 Figure 8. Phenolyzer networks analysis of WGS gene findings, HPO terms, and diseases types. Phenolyzer revealed the diagnosis of PWS and how genes in the deletion regions are linking towards the phenotypes represented by HPO terms. The most disease relevant genes are showed as seed genes, alongside with predicted genes in the deletion regions. Yellow lines indicate that the two node genes are within the same Biosystem while green lines indicate that the two genes are within the same gene family. Discussion Compared to similar tools, Phenolyzer is better at prioritizing candidate genes for Mendelian diseases and complex diseases. In addition, the tool also works well for finding novel disease- gene associations, highlighting its ability to help formulate new biological hypotheses. For the Seed gene Input HPO term In the sam e bi osystem Predi cted gene Predi cted di sease In the sam e gene fam i l y 29 / 66 ever increasing number of human disease sequencing studies, Phenolyzer will help users leverage prior biological knowledge and phenotype information to expedite scientific discovery. Phenolyzer follows a strictly defined procedure including four steps - term interpretation, seed gene generation, seed gene growth and data integration. For term interpretation, at least a word match is needed to process the term, and the possibility to match the terms depends on the vocabularies of several different disease ontology systems, the synonyms provided by OMIM, and phenotype descriptors provided by the Human Phenotype Ontology. Some improvements can be made in the future to better explore all possible terms, including integrating several medical dictionaries and standardized vocabularies, such as Webster’s medical dictionary, ICD9 and ICD10 vocabulary. For the seed gene generation step, other databases will be incorporated in the future, such as DECIPHER[47]. We note that several heuristics are used in this step (Gene Score System), to translate association information and publication count into scores. The underlying logic is that the translated score is normalized proportionally to the rank percentage. For the seed gene growth step, similar heuristics are also used (Gene Prediction Score System). Despite the use of heuristics, our results demonstrated the effectiveness of the approach. To facilitate integration of gene-disease and gene-gene relationship databases, we now made an add-on system for the Phenolyzer command-line tool. As long as the user compile their own databases into a pre- defined format, it is straightforward to add databases into the Phenolyzer system. To demonstrate this, we integrated Mentha protein interaction database[48], the Genetic Association Database[49], DisGeNet[50] and the Genecards database into our web server as an option, with an ad hoc weight. The Logistic Regression with Gradient Descent method is designed to optimize weights for integrating the scores of seed genes and different relationship types. We examined how the performance will change if we instead assign ad hoc weights to each feature (Figure 4). Although the logistic regression model is trained from four diseases (type 1 and 2 diabetes, coronary artery disease and Crohn’s disease) that are totally different from the test set, the improvement over "no training" is still significant. Therefore, we used the optimal weights rather than arbitrarily selected weights in our software. In addition, we also provide the ‘Weight Adjust’ 30 / 66 option for users to apply their own arbitrary weights, or even turn off a database by setting the weight to zero. For the tool comparison, we noticed that inconsistent results appeared between monogenic diseases and complex diseases. We think the reason is that the monogenic disease and complex disease comparisons are two different types of comparisons. In monogenic disease comparisons, as long as the tool has the knowledge of the gene-disease mapping in a database, and can retrieve the record and prioritize it as top one, then it has a perfect performance. However, in complex disease comparison, we chose four disease gene sets – two from public databases and two from scientific literature. Each disease has a large number of genes to be positive genes. Thus the computational tool needs both high precision and recall rates to achieve good performance: precision means that the disease genes can be ranked higher than others, and recall means that a large candidate gene list can be found that includes the true disease genes. Therefore, it is not surprising that SNPs3d performs differently between these two types of tasks, as SNPs3d has high precision but has very limited recall. Although Phenolyzer works well to find candidate genes for many diseases, its ability to identify the correct genes is to a large extent limited by the available biological knowledge. For example, if a phenotype or disease has never been reported to be associated with any gene before, Phenolyzer is much less likely to find the candidate genes, unless a similar or related phenotype is available. Nevertheless, as Phenolyzer gives a continuous normalized score from 0 to 1 for each candidate gene, the integration between Phenolyzer score and other kinds of variant prediction scores (for example, SIFT scores, PolyPhen scores) may further increase power for finding candidate gene and variants, even if the genes are previously uncharacterized. Finally, we believe that an improved algorithm that integrates both previous gene-disease and gene- gene relationship knowledge and an improved score for variant deleteriousness may offer the greatest power to prioritize variants from whole genome and exome sequencing data. 31 / 66 Chapter 2: Application of Phenolyzer to Neuroscience Abstract Phenolyzer is a tool developed by us to conduct phenotype-based disease gene prioritization, as described in Chapter one. With the logic and algorithms of Phenolyzer, we continued to develope a framework called NeuroComplex, to facilitate prioritization of mental disorder genes and variants. In one implementation, with the proteomics data of 3527 in-vivo protein-protein interactions across 44 protein complexes, NeuroComplex framework is able to help directly identify genes and their interaction in protein complexes and visualize the network. Through this example, we demonstrated the flexibility and extensibility of our NeuroComplex framework, which could be extended in the future to integrate more information related with mental disorders, when more data is available. In addition, my analysis further helped identify differential PSD protein cluster patterns during mouse development. Introduction The need for better functional characterization of mental disorder genetics Mental disorders are common diseases that causes either suffering or an impaired ability to function in ordinary life. Worldwide more than one in three people in most countries report sufficient criteria for at least one mental disorder at some point in their life [51]. In the US, 46% are afflicted with a mental illness at some point of life [52]. The loss of earnings due to psychiatric disease in the US is estimated to be $200 billion a year [53]. Therefore, the management and treatment of these disorders pose significant economic and social burden to the society. Given the heterogeneous natures of the disorders, better molecular characterization of the diseases for individual patients is likely to be necessary to develop effective personalized treatment strategies. Clarification of the genetics underlying different types of mental disorders will provide insights about etiology and pathology of mental disorders, facilitating innovations of new treatments. Many rare and common variants are already reported to be associated with mental disorders, yet more remain to be discovered. Despite the large amount of high-throughput genomics data generated on mental disorders (such as autism and schizophrenia), there is a lack of integrative methods to systematically prioritize variants that confer susceptibility to mental disorders; additionally, despite the presence of many candidate variants that are associated with mental 32 / 66 disorders, their disease-contributing mechanism remains unclear so appropriate functional follow-up experiments cannot be designed. Altogether, these problems resulted in a huge gap between large amounts of genomics data and our understanding of their functional impacts on mental diseases, ultimately resulting in delay of the development of targeted therapeutic approaches. The extension of Phenolyzer for integration of new database As described in Chapter 1, Phenolyzer is able to integrate disease-gene, disease-phenotype, gene-gene, disease ontology and phenotype ontology databases, which renders Phenolyzer greater performance and more complete information compared with other similar tools. However, all the existing databases included in Phenolyzer might not be thorough enough when referring to a specific domain, such as mental disorders. Thus, to provide the flexibility and extensibility, we made Phenolyzer’s ability of database integration much easier, by providing a simple format and command line argument for users to add their own database into Phenolyzer. In Phenolyzer’s website, we already implemented a few additional integrations with Mentha gene-gene interaction databases, and a few disease-gene databases provided by our users, such as the small list for IgA nephropathy. With this extension framework, we are able to implement our NeuroComplex frame work and integrate the protein interaction data from our collaborator. PsyPPRes: One implementation within Neurocomplex framework Postsynaptic density (PSD) is a signaling and supportive structure in postsynaptic site, comprising around 1,500~2,000 proteins, with proteins compacting in a very small space and interacting largely with each other. Although quite a few PSD proteins have been individually extensively studied, the overall interaction is comparably less studied so far. Recent research progresses have unveiled that PSD protein complexes are highly enriched with schizophrenia risk factors. However, these studies mostly treat PSD as a static structure, without enough efforts of studying the dynamics across different developmental stages. From our collaborators, we mainly focused on three PSD protein complexes by immune-isolation of ‘hub proteins’ - Dlg4 (MAGUKs), Dlgap1 (DLGAPs) and Shank3 (SHANKs), at embryonic day 14 (e14), postnatal day 7 (p7), p14, and adult mouse prefrontal cortex (PFC). We showed these ‘hub proteins’ have dynamics of interaction during development, while having both shared and different interactions between each other. Based on Neurocomplex framework, we developed a new tool called PsyPPRes (Psychiatric Protein Pathways Resource), which facilitates users’ direct access to the large-scale proteomics data for psychiatric diseases. 33 / 66 Differential PSD protein cluster pattern during mouse development At the same time, through this collaboration, we discovered a very interesting result of differential cluster patterns during the mouse development. This dynamics probably represents the different interactomes for PSD scaffold proteins during development, which we assume to be the process of PSD maturation. Method and material Biochemistry/Mass Spectrometry and Protein interactions were considered positive if at least two peptides were present in triplicate assays and absent in controls. We determined protein complexes at embryonic day 14 (e14) postnatal days 7, and 14 (p7, p14) and adult (12-16 weeks old). Adult samples were determined at both PSD and non-PSD fractions3. PSD and non-PSD fractions were prepared as described[54]. The LC system was coupled to a hybrid linear ion trap–Fourier transform ion cyclotron resonance LTQ-FT (FTICR) 7 Tesla mass spectrometer(LC/MS). Standard method were used for WB and ICC[54]. PsyPPRes Software development The PsyPPRes is implemented based on the original Phenolyzer’s framework and based on NeuroComplex’s logic to facilitate functional characterization and visual display of protein interaction data for mental disorders. It implemented by Perl CGI, HTML and Javascript, with the Phenolyzer engine and its extension command line functions, also helped by Cytoscape.js for network visualzation. Also, the weights of Phenolyzer original disease-gene database are reduced to a tiny number (10e-6) or to zero, to make the result mainly represent the PSD complex proteins, with their weights as 1.0 and treated as a direct association instead of a protein-protein interaction. Different diseases such as autism, schizophrenia, etc. are treated the same as for the PSD complex interactors, however they have different genes based on Phenolyzer’s original framework. Enrichment analysis and Hierarchical clustering Enrichment analysis for schizophrenia risk factors is based on Fisher’s exact test, conducted in R language. The hierarchical clustering and matrix/dendrogram plot were conducted in R language, with help of the “corr” package beyond R’s native packages. Similarity measure is based on Jaccard index, which is the cross set number divided by the union set number. We observed that the counts of PSD proteins within individual complex vary from 10 to 140, which 34 / 66 is a good fit for Fisher’s exact test. The schizophrenia risk genes are compiled from two previous publications [55, 56], including 343 and 640 genes, respectively. Pde4dip -Adult-non PSD 118 Tnik-p14 107 Tnik-Adult-PSD 82 Tnik-Adult-non PSD 96 TNIK-hNPC 122 Tnik-p7 75 Nckap1-Adult 25 Homer1-Adult-PSD 36 Hcn1 -Adult-non PSD 38 Fmr1 -Adult-non PSD 132 AKAP9-hiPSC 69 Mycbp2-e14 129 Tnik-kinase domain-Adult-non PSD 105 Cyfip2 -Adult-PSD 31 Herc2 -e14 54 Tnik-kinase domain-Adult-PSD 46 Fxr1-e14 125 TNIK-hiPSC 110 Fmr1 -Adult 43 Tsc1 -e14 62 Syngap1-Adult-PSD 78 Fmr1 -e14 133 Mycbp2-Adult 20 Huwe1-e14 10 Tsc1 -Adult 47 Cnksr2-Adult-PSD 21 Fxr1-Adult-non PSD 46 AKAP9-hNPC 85 Tnik-kinase domain-e14 91 Cyfip1-Adult-PSD 114 Dlgap1-Adult-PSD 140 Shank3-Adult-PSD 108 PDE4DIP-hNPC 131 Pde4dip -e14 103 PDE4DIP-hiPSC 101 Tnik-e14 131 35 / 66 Huwe1-Adult 42 Cnksr2-Adult-non PSD 33 Pex5l-Adult-non PSD 43 Table 1. The counts of proteins within each PSD complex. We can observe that the smallest complex only contains 10 proteins while the largest contains 140 proteins. Result The association between the PSD core proteins and schizophrenia To demonstrate the association between PSD complex proteins provided and one of the important mental disorders – schizophrenia, we conducted an enrichment test of each PSD complex, to identify whether each complex has a significant enrichment of the SCZ risk proteins compared with random. With Fisher’s exact test, we observe 17 complexes out of 25 have significant enrichment of at least one type of schizophrenia risk category (Figure 9). From the figure, SCZ_GWAS represents the GWAS result from GWAS consortium [57], SINGLE or MULTIPLE represents single or multiple genes identified from the same GWAS loci. SCZ_DE_NOVO represents de novo single nucleotide variants[58], and SCZ_DE_NOVO_CNV represents de novo copy number variants CNVs [56]. 36 / 66 Figure 9. Fisher’s exact test result of 25 PSD protein complexes, for enrichment analysis. Y axes represent the negative decimal log of P-Value. The blue lines represents P < 0.05 significance and red line represents P < 0.01 significance. 17 out of 25 PSD protein complexes have at least one significant enrichment. PsyPPRes and its utility Despite the basic databases of Phenolyzer for psychiatric diseases, such as OMIM, Orphanet, GWAS Catalog, GeneReviews for gene-disease information, we added the 3527 in-vivo protein- protein interactions collected from immune-purification and proteomics across 44 protein complexes, most of which are enriched with schizophrenia risk factors. This new tool has a similar web interface as Phenolyzer, where the user can input one of the following disease: ‘schizophrenia’, ‘autism’, ‘bipolar’, ‘development delay’, ‘epilepsy’, ‘intellectual disability’, or with ‘no disease option’, together with a list of gene names. This tool will integrate the protein- protein interaction data together with Phenolyzer’s basic databases, and will ensure high weights on the proteomics data, thus able to display the interaction networks based on our interaction data. In the result network (Figure), the ‘hub’ of the clusters will be displayed as its interacting genes pointing to it with arrow. Also, the Phenolyzer gene-disease interaction information will be displayed. For example, we showed Schizophrenia query result below, specifically addressing the protein interactions of TNIK. From the result, we can see TNIK is a ‘hub protein’, with many interacting proteins, such as DLG1, GRIN1 and YWHAZ. 37 / 66 This not only made our collaborator’s data directly available to the users with visualization, but also demonstrated the feasibility of our NeuroComplex logic, to extend the current Phenolyzer system for mental disorders, because there is a growing need for better characterization and information extraction from genomics data for mental disorders (Figure 10). Figure 10. The output network from PsyPPRes. The yellow directed arrow indicates the relation between the interacting protein and the core protein (from the source to the target). The red line indicates the protein interaction based on Mentha database. As observed, TNIK is a core complex protein and has many proteins interacting with it. Differential protein cluster pattern during development By starting from three core PSD-scaffold interactomes Dlg4, Dlgap1 and Shank3 with knock-out knock-in mouse lines, we conducted immune-purification and determined the protein-protein interactions of these core proteins, with in total 3527 in-vivo protein-protein interactions across 44 protein complexes. Representing a detailed spatio-temporal profile of protein interactions from PSD components. 38 / 66 After we determined all the interactions, I further determined the similarity between components at different developmental stages. Especially, we conducted a hierarchical clustering analysis on these 44 protein complexes, and plotted the similarity matrix based on the Jaccard index. From the result, we observed five major clusters, with adult complexes mostly co- cluster with adult complexes, and embryonic complexes mostly co-cluster with embryonic complexes or post-natal complexes, with only one exception of Huwe1 complex. Figure 11. The matrix plot for preliminary 33 complexes with Jaccad index as similarity. We set the cluster number as five and could observe two big clusters in the matrix. Most adult complexes cluster together while most embryonic and postnatal complexes cluster together. 39 / 66 We also plotted the dendrogram of the hierarchical clustering result, from which it could be easily observed that Adult PSD proteins generally cluster together, different from embryonic PSD complexes. (Figure 12). This differential cluster pattern between adult and embryonic protein complexes shows the dynamics of PSD scaffold proteins and their interactomes, indicating a developmental switch of protein interactions within individual protein complexes from PSD scaffold proteins and their interactors. This is a major discovery of our collaborated paper, also I addressed here that this developmental switch deserves more research effort, especially about the question of how PSD scaffolds switch their major interactors and what the biological impacts are. 40 / 66 Figure 12. Hierarchical clustering of protein complexes showed three clusters: one with two HUWE1 complexes in both adult and embryo 14 days, one with mainly embryonic protein complexes PDE4DP-Adult-extrasynaptic, and the other with only adult proteins. Also the dendrogram demonstrates the clusters of adult proteins and embryonic protein complexes, with six clusters but still adult proteins and embryonic proteins in different clusters, separately. Discussion Through this chapter, we apply Phenolyzer’s network logic specifically to psychiatric disease genes, integrated with our collaborator’s PSD protein complex interaction data, to serve as a helpful tool for neuroscience community to identify specific genes and protein interactions associated with psychiatric disease risk proteins. This tool further demonstrated the flexibility and extensibility of Phenolyzer to easily integrate new data sources into our framework. Additionally, just by a side-discovery, we found differential cluster patterns of PSD proteins during development, which is worth more focuses and research from neuroscience community to delve into the dynamics of synaptic protein complexes. However, although these applications demonstrate the potential utility and extensitivity of Phenolyzer for neuroscience, and the plausibility of NeuroComplex framework, we have to point out that more data for mental disorders is absolutely needed to realize specific variant prioritization and discovery for mental disorders. Also, the current NeuroComplex framework is limited by protein-protein interaction, disease-gene interaction, disease/phenotype ontology databases. Thus, a more sophisticated framework is probably needed when need to process a large number of clinical data into information, which could be high dimensional at many times. Chapter 3: HadoopCNV: A Hadoop based parallel tool to detect CNVs and LOH Abstract BACKGROUND: Whole-genome sequencing (WGS) data may be used to identify copy number variations (CNVs). Existing CNV detection methods mostly rely on read depth or paired end distance, while neglecting allelic intensity ratios. Additionally, most CNV callers are not scalable to handle a large number of samples. METHODS: To facilitate large-scale and rapid CNV detection from WGS data, we developed a Dynamic Programming Imputation based algorithm called HadoopCNV, which infers copy 41 / 66 number changes through both allelic frequency and read depth information. Our implementation is built on the Hadoop framework, enabling multiple compute nodes to work in parallel. RESULTS: Compared to two widely used tools – CNVnator and LUMPY, our method has similar or better performance on both simulated data sets and the NA12878 individual. Additionally, analysis on a 10 member pedigree showed that HadoopCNV has a Mendelian precision that is slightly better than CNVnator. Furthermore, HadoopCNV can accurately infer loss of heterozygosity (LOH), while other tools cannot. More importantly, HadoopCNV on a 32-node cluster requires only 1.6 hours for a human genome with 30X coverage. CONCLUSIONS: The combination of high-resolution, allele-specific read depth from WGS data and Hadoop framework can result in efficient and accurate detection of CNVs. This work has been pre-published in an open source platform named BioRxiv [59]. Introduction Copy number variation (CNV) refers to copy number changes of a DNA fragment usually between one kilobases (kb) and less than five megabases (Mb) [60]. CNV is one of the major forms of genomic variations. Based on published high-quality data on healthy individuals of various ethnicities, it is estimated that 4.8–9.5% of the genome contributes to CNVs [61]. Numerous studies have reported that CNVs are associated with a large number of human diseases [62-64]. Before the next-generation sequencing (NGS) era, most CNVs documented in literature were identified through microarrays, with help from software tools such as PennCNV [65], QuantiSNP [66] and many others [67-70]. Although CNV discovery with microarrays was successful in discovering a large number of disease associated CNVs, the limited resolution of microarrays is not capable to identify all CNVs, therefore a large number of disease-contributing CNVs might have been overlooked. NGS technologies are increasingly used to find disease-contributing variations. The high- throughput capacity of NGS makes it practical to identify most CNVs in the genome that were difficult to be discovered previously. Many pioneering studies have demonstrated the feasibility of discovering CNVs through the use of NGS data, by examining the alignment files directly. Several approaches were developed and each had different advantages and disadvantages [71]. For small-scale CNVs, paired-end mapping and split-read are useful for detection, and tools such as BreakDancer [72], AGE [73], GASV [74], PINDEL [75], LUMPY [76] were developed and widely 42 / 66 used today. For larger CNVs, read-depth based approaches such as CNV-Seq [77], BIC-Seq [78], JointSLM [79] and CNVnator [80] have been demonstrated to be more useful. Among them, CNVnator [80] uses a mean-shift algorithm to identify the boundaries of CNVs and proceeds with multiple refinement and GC correction, demonstrating the clear utility of read depth for CNV calling. Assembly based approaches have also been useful for CNV detection, such as the algorithms used by Cortex assembler [81] and TIGRA-SV [82], but these approaches are typically not computationally efficient to be used in genome-wide analysis. Finally, several tools were developed specifically for CNV detection from exome sequencing data, such as XHMM [83], CODEX [84], ExomeCNV [85], ExomeCopy [86] and ExCopyDepth [87]. Hadoop framework has been demonstrated to be able to handle large-scale sequence data [88]. Here we present a scalable CNV calling algorithm within a big data framework, HadoopCNV, which rapidly extracts read depth information from sequence alignment (BAM) files. HadoopCNV integrates three components into a MapReduce based pipeline that seamlessly runs on environments ranging from a single computer to thousands of computers. The first component measures depths for each allele at user-specified intervals. The second component smooths these measurements by defining bins of a user-specified dimension (100 bp by default), and storing summary statistics at each bin. The final component invokes a Viterbi scoring algorithm using the model previously described [89]. HadoopCNV leverages the Hadoop Distributed File System (HDFS), making it possible to store a large number of samples and call CNVs on them in a distributed manner among tens, hundreds or even thousands of nodes, thus provides a big data solution to genome analysis. Considering that some large-scale genome centers are migrating to Hadoop to store alignment (BAM) files, and that several computational tools are already developed to perform genome analysis on BAM files [90], the availability of HadoopCNV therefore also allows seamless integration into existing analytical pipelines. This work is under peer review and already published in an open access platform [59]. Materials and Methods Simulated data set We used human genome (hg19) as our reference genome for data simulation. We created two independent copies of genome with single-nucleotide variants (SNVs) based on allele frequencies for the AFR population from the 1000 Genomes project [91]. Next, we simulated both deletions and duplication, for chromosomes 1 to 22, with sizes 1kb, 1.5kb, 2kb, 2.5kb, 3kb, 43 / 66 3.5kb, 4kb, 5kb, 6kb, 8kb, 10kb, 20kb, 30kb, 40kb, 50kb, 75kb, 100kb, 150kb, 200kb and 500kb, with a minimum distance of 100kb between any two CNVs. These variants were used as ground truth in our comparison. Next, we generated simulated Illumina paired-end short reads with the tool ART[92] and aligned them to the hg19 reference genome using BWA-mem [93]. The generated BAM files were used as input for HadoopCNV, CNVnator and LUMPY. The LOHs were simulated based on real LOHs from MCG CNV Database (http://www.cghtmd.jp/CNVDatabase/cnv-summary-map). Testing data set (NA12878) The whole genome sequence (WGS) data for HapMap subject NA12878 were retrieved from multiple resources, because this subject was sequenced multiple times at different institutions: (1) we downloaded the high-coverage (>30X) data from the 1000 Genomes Project site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/data/). This is part of the pilot project, where six genomes were sequenced at high coverage. (2) We downloaded the high-coverage (>30X) data generated by HiSeq X Ten from the Garvan Institute (https://dnanexus-rnd.s3.amazonaws.com/NA12878-xten.html). Testing data set (10-member pedigree) We also compared the performance of HadoopCNV with several other software tools using a WGS dataset on a 10-member pedigree generated in house. For Illumina sequencing, we quantified the samples using Qubit® dsDNA BR Assay Kit from Invitrogen (Carlsbad, CA, USA), and 1μg of each sample was sent out for WGS using the Illumina® Hiseq 2000 platform. Sequencing libraries were generated from 100ng of genomic DNA using the Illumina TruSeq Nano LT kit, according to manufacturer recommendations. The quality of each library was evaluated with the Agilent bioanalyzer high sensitivity assay (less than 5% primer dimers), and quantified by qPCR (Kappa Biosystem, CT, USA). The pooled library was sequenced in three lanes of a HiSeq2000 paired end 100bp flow cell. The number of clusters passing initial filtering was above 80%, and the number of bases at or above Q30 was above 85%. These samples are used for assessing the Mendelian precision of CNV calls, which is a widely accepted measure of error rates for CNV calling algorithms. Implementation of HadoopCNV Short reads were aligned using BWA-mem [93] to the reference genome, and the BAM file is used as input for HadoopCNV. Processing three billion bases encoded in BAM files poses a significant I/O burden. While it is possible to efficiently subsample a (random or otherwise) 44 / 66 fraction of the sites because BAM files are indexed, this suboptimal approach may miss critical sites that are informative for copy number changes. More importantly, this heuristic is likely to miss many of the polymorphic sites that are informative for heterozygosity status. To alleviate the heavy data processing demands, we exploited the benefits of a mature parallel programming framework - Apache Hadoop. Hadoop implements the MapReduce (MR) design in which very large datasets are divided into subsets that are physically stored on multiple distributed data nodes of a cluster. By processing data in a Map stage, bandwidth demands on both shared storage and local area network links are minimized. After the Map step by Mappers, which can be considered to be “embarrassingly parallel”, data is summarized by a Reduce stage, which entails shuffling of some data across the Mappers running on different compute nodes. In the context of CNV calling, we carry out the following pipeline, which consists of three MR components that execute in the following sequential order: Depth Caller MR: This component is the most data intensive step, as it entails parsing base-allele specific map and read quality scores for each read across all sites of a whole genome. Mappers in this component make use of the Hadoop-BAM library [94] for extracting a single sequence alignment/map (SAM) record from the BAM files. SAM records are exposed by the Java API of HTSlib, which is distributed as part of SAMtools [95]. A Mapper aligns read information within a fixed size view. In practice we set the default to a 100 base pair view, though this is a user-adjustable parameter. Key-value pairs are emitted by the mapper where the key is comprised of the chromosome and the first base pair position of the view. One-byte integer arrays are emitted as values, where the even indices (assuming a 0-based index scheme) store the allele code and the odd indices store their respective phred scores [96] of the allele stored in the preceding array element. Reducers collect all byte arrays that correspond to the chromosome and frame, and output site and allele specific quality weighted depth scores at base pair resolution, where depth score is the sum of the quality scores converted from phred scores: computed as 1 – 10 -0.1*phred_score . Region Binner MR: The output from the Depth Caller is enormous (can be hundreds of GB) as it contains the cross product of the average number of alleles and sites, where each record is stored as ASCII text on HDFS. The statistics we are most interested in is the median overall depth count and the Hidden Markov Model (HMM) state-specific mean squared errors (SSMSE) conditional on the observed allelic intensity ratios. To compute SSMSE, we first determine which sites within a bin are polymorphic with high 45 / 66 confidence based on the user supplied VCF file. Here we adopted the BAF (B Allele Frequency) concept from PennCNV, as the fraction of reads with alternative alleles over all reads; however, to simplify matters, we consider the minor allele as "B" allele so that BAF values are constrained to be in the range of 0 to 0.5. The heterozygous site is determined by the user-provided VCF file. The SSMSE is then calculated by averaging squared errors for each state. SSMSE=BAF 2 for all sites with homozygous SNVs; otherwise, SSMSE is calculated as SSMSE= { min (BAF 2 ,(BAF−0.25) 2 ,(BAF−0.5) 2 ) if CN=0 BAF 2 if CN=1 (BAF−0.5) 2 if CN=2 BAF 2 if CN=2&𝐿𝑂𝐻 (BAF−0.33) 2 if CN=3 min ((BAF−0.25) 2 ,(BAF−0.5) 2 ) if CN=4 Output from this component’s Reducer includes start and end location for the bin, median depth score, the SSMSE vector, and the total sites in the bin. CNV Caller MR: At this point we consider the window-specific summary statistics described above as observed data points for CNV inference. For a given chromosome, the Reducer of this component takes all observed summary statistics sorted by position as input, where each window can be represented as each node. The Dynamic Programming Imputation (DPI) method is a generalization of HMMs, which has been successfully applied to CNV inference in microarrays [89]. In contrast to a standard HMM where one assumes a probability distribution for each state in the emission matrix, the DPI method loosens this constraint by minimizing losses rather than maximizing a likelihood at each state. At each node, the median depth is transformed to a y value: first, the mean depth of the whole chromosome is calculated as 𝜇 , and the standard deviation is calculated as 𝜎 ; second, the depth at each node is normalized by deduction of 𝜇 and division by 𝜎 ; By default, there are a total of 6 states (𝑠 0 , 𝑠 1 , 𝑠 2 , 𝑠 3 , 𝑠 4 , 𝑠 5 ), corresponding to copy number 0, copy number 1, copy number 2 homozygous, copy number 2 heterozygous, copy number 3 and copy number 4 respectively, with expected y as -1.0, -0.5, 0, 0, 0.5 and 1.0. The state space can be readily expanded to accommodate more copy numbers. 46 / 66 The loss function is defined as: 𝐿 (𝒔 )=∑𝐿 𝑑𝑒𝑝𝑡 ℎ (𝑦 𝑖 ,𝑠 𝑖 )+𝛼 ∑𝐿 𝐵𝐴𝐹 (𝑥 𝑖 ,𝑠 𝑖 ) 𝒏 𝒊 =𝟏 +𝜆 1 𝒏 𝒊 =𝟏 ∑|𝑦̅(𝑠 𝑖 )|+ 𝑛 𝑖 =1 𝜆 2 ∑|𝑦̅(𝑠 𝑖 )−𝑦̅(𝑠 𝑖 −1 )| 𝑛 𝑖 =1 where 𝐿 𝑑𝑒𝑝𝑡 ℎ (𝑦 𝑖 ,𝑠 𝑖 ) is the squared error from the expected y in state i. 𝐿 𝐵𝐴𝐹 (𝑥 𝑖 ,𝑠 𝑖 ) is the calculated average SSMSE from previous step for state i, 𝑦̅(𝑠 𝑖 ) is the expected y for state I, 𝜆 1 and 𝜆 2 are tuning parameters for the penalties, where a large 𝜆 1 penalizes any deviation from copy number 2, and a large 𝜆 2 penalizes a state transition. In practice, 𝜆 1 is set to 0.01 and 𝜆 2 is set to 1.6. This loss function is minimized by Dynamic Programming starting from the first node to the last, with discrete states at each node. The Viterbi algorithm is used to retrieve the optimized state at each node. Post processing: the output from all the steps above is a copy number and state for each bin for the whole genome. We post-processed them by merging the neighbor deletions or duplications recursively, as long as their distance is less than 20% of their total length, until no two deletions or duplications can be further merged. By default, we filtered out the CNVs less than 500 bp as they are more likely caused by noises, due to the lack of use of split-read or paired-end distance information. Therefore, HadoopCNV is complementary to methods that leverage split-read or paired-end distance information, and their combined use may help reveal the full range of CNVs with different sizes. CNV calling on WGS data by other software CNVnator is used with a 100bp-bin as recommended in its README file, which uses an algorithm called ‘mean-shift’, from image processing, to call CNVs. LUMPY is also used with the recommended pipeline from its document at https://github.com/arq5x/lumpy-sv, which focuses on utilizing information from paired-end reads and split reads to infer break points and CNVs. For LUMPY, the output contains two CNV breakpoint estimates, which are averaged as a single point for the start or end of the CNV. Performance evaluation All the CNV calls from each tool are first compiled into a standard format and then tested with the same comparison script. The CNV calls are first sorted by their physical positions, then the overlapping CNVs are merged into a single CNV. The precision and recall of NA12878 CNVs are calculated with the published gold standard CNV calls as true positive CNVs [97]. 47 / 66 Two different standards are used to calculate the overlap: 1) the touch-based standard treats two CNVs as overlap if they share at least one single base pair; 2) the 50% reciprocal standard counts two CNVs as overlap only if their shared length is no less than 50% of their total non- overlapping length. The concordance rate is calculated as the number of overlapping CNVs divided by the total number of CNVs. Availability HadoopCNV is available at https://github.com/WGLab/HadoopCNV. Users need to install Hadoop2.0+ on the system, which could be a computing cluster with many nodes or merely a single machine. Therefore, for users without access to large-scale clusters, HadoopCNV can still be executed in single node mode. Result Overview of the approach The basic framework for HadoopCNV is described below (Figure 13). Since the size of whole genome sequencing (WGS) data is generally large, it is critical to have a scalable infrastructure to generate CNV calls, especially when a large number of samples need to be processed in a short period of time. We adopted Apache Hadoop (http://hadoop.apache.org/), which is an open- source infrastructure that allows for reliable and scalable distributed processing of very large data sets across clusters built from multiple computers using HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator) and the MapReduce programming models. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. YARN is a framework for job scheduling and cluster resource management, such as managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop. HadoopCNV takes BAM and VCF files of a WGS data as input. The BAM file needs to be available from HDFS while the relevant VCF file can be located anywhere at a user’s local file system. The CNV calling algorithm is then executed on multiple machines simultaneously with the MapReduce paradigm. The output from HadoopCNV is a list of regions, each annotated with deletion or duplication and the corresponding copy numbers. HadoopCNV is freely available at http://www.github.com/WGLab/HadoopCNV. 48 / 66 Figure 13. Overview of HadoopCNV. All the BAM files are stored in Hadoop File System and the VCF file can be stored in the local hard drive. 1) Alignments from each BAM file are first extracted, then generated for each bin in each chromosome with the Depth Caller MapReduce. 2) With the reads for each bin at each chromosome are aggregated, then with the help of VCF information, the median depth and SSMSE are extracted for each bin with the Region Binner MapReduce. Finally, the Median Depth and SSMSE are aggregated for each chromosome first, then the copy number for each position range is calculated, which is the final output from the software. 49 / 66 Speed comparison with other CNV calling methods To illustrate the speed advantage of HadoopCNV, we compared the running time among LUMPY [98], CNVnator [80] and HadoopCNV on a BAM file with ~30X coverage for NA12878, which is generated from the 1000 Genome Project. We recorded the timestamps before the CNV calling program was executed and after the program was finished. Since the performance of any Hadoop system depends on how many data nodes it has, we measured the running time of HadoopCNV on a cluster built with 2 data nodes, 6 data nodes, 10 data nodes, 16 data nodes, and 32 data nodes, respectively. With 32 data nodes, HadoopCNV only takes 1.6 hours to process one whole-genome, which is about one sixth the time used by CNVnator (Figure 14). Meanwhile, we confirmed that the speed approximately linearly scales with the number of data nodes in the Hadoop cluster, demonstrating the efficacy of our data-parallel algorithm. Figure 14. Speed comparison for HadoopCNV with 2, 6, 10, 16, 32 nodes, CNVnator and LUMPY. HadoopCNV only costs about 1.6 hours for a WGS data with 32 nodes. The number of replicate experiment is 3 for HadoopCNV, and 5 for CNVnator and LUMPY. Error bars represent standard deviation. 50 / 66 Evaluation on simulated datasets To evaluate the accuracy of different CNV callers including HadoopCNV, we simulated 20 deletions and 20 duplications with different sizes in each chromosome (480 CNVs in total), and then simulated individual genomes with these CNVs (see Methods for details). From the genome sequence, we then simulated sequencing reads, which consist of 100-bp reads as FASTQ files with 40X genome coverage. We mapped these reads to the human genome (build hg19) by BWA [93], generated BAM files and then tested HadoopCNV, CNVnator and LUMPY on these files. This approach allows us to evaluate their performance based on their concordance with the ground truth, and assess how performance changes as the minimal CNV length threshold increases. We assessed two types of "concordance" measures, touch (any overlap) and 50% reciprocal overlap (see Methods for details). For deletions, HadoopCNV has a consistently higher concordance with ground truth, compared to CNVnator by both standards, and a similar performance as LUMPY. For duplications larger than 15kb, HadoopCNV has similar concordance with CNVnator and LUMPY (Figure 15f-i). For duplications, HadoopCNV has similar performance with other tools, especially when the length threshold is larger than 15k, we observed a consistent performance advantage of Hadoop CNV over other tools. In addition, with our approach of combining HadoopCNV and LUMPY, we are able to achieve a better performance than any other individual tool in any scenario. The length distribution of the CNV calls shows that these tools are comparable with each other (Figure 15a-b). We also generated an IGV (Integrative Genome Viewer) snapshot of all the CNVs located on chromosome 21. From the snapshot. We noticed that CNVnator made a very large deletion call from this IGV snapshot. After further investigation, we found that this comes from a region that has all Ns in the reference genome. In contrast, HadoopCNV and LUMPY successfully skipped this region (Figure 15e). 51 / 66 Figure 15. Comparison of HadoopCNV and CNVnator on the simulated samples with CNVs. The length distribution of the ground truth CNV to generate simulations (a), HadoopCNV output (b), CNVnator output (c) and LUMPY output (d). A snapshot of all the deletions and duplications in chromosome 21 (e). The concordance between the output CNVs and the ground truth at different length thresholds were shown, for deletions with 50% reciprocal standard (f) and touch standard (g), and for duplications with 50% reciprocal standard (h) and touch standard (i). A snapshot of Ground Truth LOHs and the ones called by HadoopCNV is also shown (j). Detection of loss-of-heterozygosity (LOH) Due to its unique use of allelic information in sequence alignments, HadoopCNV is able to identify copy-neutral loss of heterozygosity (LOH) in the genomes, by treating it as a separate state in the model. Large LOH events typically signify genomic segments that are identity by descent (for example, due to consanguinity), or created by uniparental disomy, and may be of clinical significance, especially when the LOH region harbors disease variants for recessive diseases [99]. In comparison, other tools such as CNVnator and LUMPY cannot identify LOH in their output. To demonstrate the ability of HadoopCNV to identify LOHs, we simulated two LOHs on each chromosome of varying sizes (See Methods). From the result, HadoopCNV is able to identify all these regions, although calling two LOHs instead of one in one case in chromosome 4 (Figure 52 / 66 15j). The overall concordance based on 50% reciprocal overlap and touch standards are 91.3% and 95.6%, respectively. In summary, through the use of allelic information, HadoopCNV is capable to identifying copy-neutral LOH events, which may be missed by other CNV callers that ignore such information. Evaluation on the NA12878 sample To evaluate the performance of HadoopCNV on real data, we first used the NA12878 sample from the 1000 Genomes Project for comparison, which was generated by Illumina HiSeq. We used this sample because it has been comprehensively studied by multiple previous CNV studies [100]. We also downloaded the two whole-genome samples for NA12878 that were sequenced by HiSeq X Ten. The average concordance rate for these three samples is used for comparison. Moreover, a "gold standard" dataset has been previously compiled from NA12878 [97]. However, we acknowledge that although this 'gold standard' set is widely used in literature, it might miss many true CNV events as most of the CNVs in the gold standard are required to be verified by multiple calling methods. Thus here we only conducted a recall comparison, to evaluate how many gold standard CNVs can be captured by different methods. We compared our method with LUMPY and CNVnator, which are among the most widely used CNV calling methods for WGS data. Our results indicate that CNVnator generally has a slightly better recall. HadoopCNV is in general better than LUMPY, for both deletion and duplication. Hadoop CNV has a decent performance and consistently better performance than LUMPY (Figure 16a-d). The combination between HadoopCNV and LUMPY is compared here as more CNVs definitely mean a better recall, different from concordance comparison. The CNV length distributions are also shown for the gold standard and the outputs from these three tools, where LUMPY has a relatively narrow distribution, due its unique split read and read pair method (Figure 16e-h). Combining the results from both the simulation data and NA12878 data, we conclude that HadoopCNV has comparable performance with the commonly used read depth based CNV calling tool CNVnator, and even better performance for large CNVs. 53 / 66 Figure 16. The recall comparison for gold standard CNV calls of NA12878, among HadoopCNV, CNVnator and LUMPY, calculated by averaging the results of three different NA12878 sequencing samples. The recall comparison for deletion with 50% reciprocal overlap standard (a) or touch standard (b), and for duplication with 50% reciprocal standard (c) and (d) is shown. The CNV length distribution is shown as histograms, for gold standard (i), HadoopCNV output (j), CNVnator output (k) and LUMPY output (j). For the purpose of a better display, the CNVs larger than 100 kb are categorized into 100 kb bin. Error bars represent standard deviation within the 3 samples. 54 / 66 Comparison of CNV calling accuracy on a 10-member pedigree To further compare the performance of HadoopCNV against other existing software programs, we analyzed a WGS dataset generated from a 10-member pedigree, which was previously sequenced at ~50X coverage per individual. We used this pedigree to evaluate Mendelian precisions among different CNV callers, as measures of false positive/negative rates for CNV calls. We stress that NA12878 has been used by almost all CNV publications to date, thus it is likely that many other software tools are already tuned for NA12878 to achieve good performance by comparing to the published 'gold standard'. Therefore, it is important to have a different dataset for unbiased evaluation of the performance of different CNV callers. Additionally, since the vast majority of CNVs are inherited, Mendelian precision is a reliable measure of precision of CNV calling algorithms, which is the percentage of CNVs in the child similar as in parents. To evaluate Mendelian precisions, we selected all six trios in the pedigree (Figure 5e) and calculated the proportion of CNVs that are not detected in parents, based on the 50% reciprocal overlap and touch standards. For each comparison, we filtered CNV calls with different minimal length thresholds. We hypothesized that the higher the length threshold, the more likely that the CNV call is due to false positive in offspring or false negative in parents. We acknowledge the potential existence of de novo CNVs; however, given the very low prior probability of having a de novo event, we ignore this possibility in this evaluation. We calculated the Mendelian precision by one minus Mendelian error rate and conducted comparisons among HadoopCNV, CNVnator and LUMPY (Figure 17a-d). This comparative analysis demonstrated that HadoopCNV is comparable with CNVnator and LUMPY. At low length threshold, LUMPY generally has a slightly better Mendelian precision - for example, at 200bp threshold, LUMPY has 80% precision compared with around 50% precision of HadoopCNV and CNVnator for duplications, based on 50% reciprocal overlap (Figure 17c). At high length threshold, LUMPY performs slightly worse than HadoopCNV and CNVnator, probably caused by their different algorithms. 55 / 66 Figure 17. Comparison of the average Mendelian precision (1-Mendelian error rate) based on the 10-member family CNV calls. The Mendelian precision for deletions with 50% reciprocal standard (a) and touch standard (b), and for duplications with 50% reciprocal standard (c) and touch standard (d) are shown. The 10 individual family tree and the 6 trios are shown (e). The CNV length distributions for all CNVs from these 10 individuals are shown, for HadoopCNV (f), CNVnator (g) and LUMPY (h). For the purpose of a better display, the CNVs larger than 100 kb are categorized into 100 kb bin. Error bars represent the standard deviation within the 6 trios. Discussion In this study, we applied the Hadoop framework to implement a highly parallel and scalable CNV calling tool called HadoopCNV. We compared the performance of HadoopCNV with other CNV calling tools on both computational speed and calling accuracy. For speed, HadoopCNV is at least 6 times faster than CNVnator running in a 32-node Hadoop framework. For the comparison on simulated CNVs, HadoopCNV has a slightly better performance than CNVnator, especially for large deletions. For NA12878 gold standard, HadoopCNV has a comparable performance. For Mendelian precision, CNVnator, LUMPY and HadoopCNV have similar performance. 56 / 66 Compared to other CNV calling methods, HadoopCNV has several advantages. First, we utilize a Hadoop framework in our software framework, enabling us to call CNVs in parallel, which substantially speeds up our CNV calling process. From our timestamp evaluation, with 32 data nodes, our CNV caller is at least 6 times faster than CNVnator, which to our knowledge is one of the most commonly used CNV callers for WGS data. Based on our time benchmarking result, the more nodes we have, the faster speed we can achieve from HadoopCNV, and it appears to be linearly scalable. To the best of our knowledge, no other tools are able to use a Hadoop framework for highly parallel CNV calling, thus the advantage of HadoopCNV will be more obvious when handling large-scale datasets such as thousands of whole genomes. Second, we address here the importance of a scalable solution in the big data era of human genomics. Similar to CNV calling, other tasks such as SNV calling, indel calling and read alignment and even just data storage are difficult to scale, under current practices in genome analysis (i.e. local file storage, SGE-based parallelization), especially when we face thousands, tens of thousands, or even hundreds of thousands of WGS samples in the future. Thus a highly scalable big data framework and standard are urgently needed, and this framework needs to be adopted by the scientific community to prepare for the future. Third, HadoopCNV uses read depth information and alternative allele frequency information, and integrates them into a single coherent model for the most powerful detection of CNVs. To our knowledge, this is among the first computational approaches that integrate the allelic intensity ratio information for CNV calling with WGS data. HadoopCNV was developed on Apache Hadoop, which is an open-source software infrastructure for distributed, scalable, and reliable computing. One potential improvement is that HadoopCNV can be further speed up by storing variant information from VCF files and read depth of every site on a whole genome from BAM files into a NoSQL database, such as HBase, Cassandra, MongoDB, and others. We have previously built a variant annotation warehouse called SeqHBase [90] that performs similar function in HBase, and we may implement this feature in a future release in HadoopCNV. However, we need to point it out here that although HadoopCNV is able to achieve similar performance as LUMPY, HadoopCNV is not adept at catching small CNVs, especially the ones with a few hundred bps, due to its binning method with 100bp bin, which is the same as CNVnator. Thus we developed a novel method to combine HadoopCNV result and LUMPY result, and demonstrated the combined method has a better performance than any of these compared tools. 57 / 66 In summary, HadoopCNV is an efficient and reliable CNV calling program. It demonstrated satisfactory performance on WGS data from a 10-member pedigree and the 1000 Genomes Project. With the rapid decline of sequencing cost for WGS and the continued adoption of WGS in clinical settings, we believe that HadoopCNV will play an important role in genome analysis and can facilitate the implementation of genomic medicine. 58 / 66 Conclusion Through my PhD, I co-designed and developed this tool and system called Phenolyzer, which integrates more than 14 databases including disease-gene, gene-gene, disease ontology, phenotype ontology, phenotype-disease databases. Based on Phenolyzer, I designed NeuroComplex logic for mental disorder variant and gene analysis. With the help of our collaborator, we implemented one tool under this logic called PsyPPRes, which integrates PSD protein complex interaction data produced by our collaborators, and also helps us discover differential clustering of PSD complexes during development. Lastly, I helped develop a tool aimed at parallelization of CNV calling within Hadoop framework. In summary, all my projects are informatics-centric, to help data analysis and information extraction for large scale genomics data, especially NGS data. As mentioned in Chapter 1, NGS is becoming more and more popular recently, leading to a large amount of discoveries, but also imposing lots of challenges and difficulties for biologists. Thus my accomplishment as a PhD mainly helps simplify information extraction process after the variants of a patient have been identified, and integrates a large amount of previous knowledge with a carefully designed framework, utilizing a machine learning framework to assign weights. Also my PhD work first time demonstrates the possibility of utilization of the most advanced parallelization framework Hadoop for CNV analysis, speeding the process up to 16 times faster or even more with more computation nodes. NGS data provides a new vision for neuroscience as well. A large number of GWAS loci and risk variants have been reported for autism, schizophrenia, depression and so on, gradually generating a more complicated picture for these major psychiatric disorders, imposing a big challenge and requirement for large-scale data analysis. Thus in addition to the general contribution to NGS methodology, our NeuroComplex framework and logic based on Phenolyzer could potentially facilitate neuroscience research, especially mental disorder research - as demonstrated by one implementation called PsyPPRes, it is able to help with visualization and intuitive access for our PSD complex protein interaction data. At the same time, Phenolyzer itself could be used for mental disorder gene prioritization such as autism. While HadoopCNV is extremely useful for large scale mental disorder CNV discovery. Thus, my PhD accomplishment is certainly a great resource and tool for neuroscience field, likely leading to novel neuroscience discoveries in the future. 59 / 66 Reference 1. Markowetz F: All biology is computational biology. PLoS biology 2017, 15(3):e2002050. 2. Yang H, Robinson PN, Wang K: Phenolyzer: phenotype-based prioritization of candidate genes for human diseases. Nature methods 2015, 12(9):841-843. 3. Yang H, Wang K: Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nature protocols 2015, 10(10):1556-1566. 4. Lyon GJ, Wang K: Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress. Genome Med 2012, 4(7):58. 5. Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research 2010, 38(16):e164-e164. 6. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM: A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 2012, 6(2):80-92. 7. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F: Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 2010, 26(16):2069-2070. 8. Jäger M, Wang K, Bauer S, Smedley D, Krawitz P, Robinson PN: Jannovar: a java library for exome annotation. Human mutation 2014, 35(5):548-555. 9. Habegger L, Balasubramanian S, Chen DZ, Khurana E, Sboner A, Harmanci A, Rozowsky J, Clarke D, Snyder M, Gerstein M: VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinformatics 2012, 28(17):2267-2269. 10. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews Genetics 2011, 12(11):745-755. 11. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L-C, De Moor B, Marynen P, Hassan B: Gene prioritization through genomic data fusion. Nature biotechnology 2006, 24(5):537-544. 12. Schlicker A, Lengauer T, Albrecht M: Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics 2010, 26(18):i561-i567. 13. Singleton MV, Guthery SL, Voelkerding KV, Chen K, Kennedy B, Margraf RL, Durtschi J, Eilbeck K, Reese MG, Jorde LB: Phevor Combines Multiple Biomedical Ontologies for Accurate Identification of Disease-Causing Alleles in Single Individuals and Small Nuclear Families. The American Journal of Human Genetics 2014, 94(4):599-610. 14. Javed A, Agrawal S, Ng PC: Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nature Methods 2014. 15. Safran M, Dalah I, Alexander J, Rosen N, Stein TI, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H: GeneCards Version 3: the human gene integrator. Database 2010, 2010:baq020. 60 / 66 16. Makita Y, Kobayashi N, Yoshida Y, Doi K, Mochizuki Y, Nishikata K, Matsushima A, Takahashi S, Ishii M, Takatsuki T: PosMed: ranking genes and bioresources based on Semantic Web Association Study. Nucleic acids research 2013, 41(W1):W109-W114. 17. Yue P, Melamud E, Moult J: SNPs3D: candidate gene and SNP selection for association studies. BMC bioinformatics 2006, 7(1):166. 18. Köhler S, Schulz MH, Krawitz P, Bauer S, Dölken S, Ott CE, Mundlos C, Horn D, Mundlos S, Robinson PN: Clinical diagnostics in human genetics with semantic similarity searches in ontologies. The American Journal of Human Genetics 2009, 85(4):457-464. 19. Robinson PN, Köhler S, Oellrich A, Wang K, Mungall CJ, Lewis SE, Washington N, Bauer S, Seelow D, Krawitz P: Improved exome prioritization of disease genes through cross- species phenotype comparison. Genome research 2014, 24(2):340-348. 20. Amberger J, Bocchini C, Hamosh A: A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Human mutation 2011, 32(5):564-567. 21. Rath A, Olry A, Dhombres F, Brandt MM, Urbero B, Ayme S: Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Human mutation 2012, 33(5):803-808. 22. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR: ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research 2014, 42(D1):D980-D985. 23. Pagon RA, Adam MP, Bird TD, Dolan CR, Fong C-T, Smith RJ, Stephens K: GeneReviews™. 1993. 24. Hindorff LA, Junkins HA, Mehta J, Manolio T: A catalog of published genome-wide association studies. National Human Genome Research Institute 2011. 25. Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi T, Chandrika K, Deshpande N, Suresh S: Human protein reference database as a discovery resource for proteomics. Nucleic acids research 2004, 32(suppl 1):D497-D501. 26. Geer LY, Marchler-Bauer A, Geer RC, Han L, He J, He S, Liu C, Shi W, Bryant SH: The NCBI biosystems database. Nucleic acids research 2010, 38(suppl 1):D492-D496. 27. Seal RL, Gordon SM, Lush MJ, Wright MW, Bruford EA: genenames. org: the HGNC resources in 2011. Nucleic acids research 2011, 39(suppl 1):D514-D519. 28. Bovolenta LA, Acencio ML, Lemke N: HTRIdb: an open-access database for experimentally verified human transcriptional regulation interactions. BMC genomics 2012, 13(1):405. 29. Davis AP, Wiegers TC, Rosenstein MC, Mattingly CJ: MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database. Database: The Journal of Biological Databases & Curation 2012, 2012. 30. Schriml LM, Arze C, Nadendla S, Chang Y-WW, Mazaitis M, Felix V, Feng G, Kibbe WA: Disease Ontology: a backbone for disease semantic integration. Nucleic acids research 2012, 40(D1):D940-D946. 61 / 66 31. Robinson PN, Mundlos S: The human phenotype ontology. Clinical genetics 2010, 77(6):525-534. 32. Burren OS, Adlem EC, Achuthan P, Christensen M, Coulson RM, Todd JA: T1DBase: update 2011, organization and presentation of large-scale data sets for type 1 diabetes research. Nucleic Acids Res 2011, 39(Database issue):D997-1001. 33. Lim JE, Hong K-W, Jin H-S, Kim YS, Park HK, Oh B: Type 2 diabetes genetic association database manually curated for the study design and odds ratio. BMC medical informatics and decision making 2010, 10(1):76. 34. Elding H, Lau W, Swallow DM, Maniatis N: Refinement in localization and identification of gene regions associated with Crohn disease. The American Journal of Human Genetics 2013, 92(1):107-113. 35. Liu H, Liu W, Liao Y, Cheng L, Liu Q, Ren X, Shi L, Tu X, Wang QK, Guo A-Y: CADgene: a comprehensive database for coronary artery disease genes. Nucleic acids research 2011, 39(suppl 1):D991-D996. 36. Chial H: Rare genetic disorders: Learning about genetic disease through gene mapping, SNPs, and microarray data. Nature education 2008, 1(1):192. 37. Saunders CJ, Miller NA, Soden SE, Dinwiddie DL, Noll A, Alnadi NA, Andraws N, Patterson ML, Krivohlavek LA, Fellis J: Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Science translational medicine 2012, 4(154):154ra135-154ra135. 38. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research 2011, 39(suppl 1):D945-D950. 39. De Rubeis S, He X, Goldberg AP, Poultney CS, Samocha K, Cicek AE, Kou Y, Liu L, Fromer M, Walker S: Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 2014, 515(7526):209-215. 40. Zhang R, Luan M, Shang Z, Duan L, Tang G, Shi M, Lv W, Zhu H, Li J, Lv H: RADB: a database of rheumatoid arthritis-related polymorphisms. Database 2014, 2014:bau090. 41. van der Harst P, Zhang W, Leach IM, Rendon A, Verweij N, Sehmi J, Paul DS, Elling U, Allayee H, Li X: Seventy-five genetic loci influencing the human red blood cell. Nature 2012, 492(7429):369-375. 42. Brastianos PK, Taylor-Weiner A, Manley PE, Jones RT, Dias-Santagata D, Thorner AR, Lawrence MS, Rodriguez FJ, Bernardo LA, Schubert L: Exome sequencing identifies BRAF mutations in papillary craniopharyngiomas. Nature genetics 2014, 46(2):161-165. 43. Chudasama KK, Winnay J, Johansson S, Claudi T, König R, Haldorsen I, Johansson B, Woo JR, Aarskog D, Sagen JV: SHORT syndrome with partial lipodystrophy due to impaired phosphatidylinositol 3 kinase signaling. The American Journal of Human Genetics 2013, 93(1):150-157. 44. Yang T-L, Chen X-D, Guo Y, Lei S-F, Wang J-T, Zhou Q, Pan F, Chen Y, Zhang Z-X, Dong S-S: Genome-wide Copy-Number-Variation Study Identified a Susceptibility Gene,< i> 62 / 66 UGT2B17</i>, for Osteoporosis. The American Journal of Human Genetics 2008, 83(6):663-674. 45. Chang X, Wang K: wANNOVAR: annotating genetic variants for personal genomes via the web. Journal of medical genetics 2012, 49(7):433-436. 46. Lyon GJ, Jiang T, Van Wijk R, Wang W, Bodily PM, Xing J, Tian L, Robison RJ, Clement M, Lin Y et al: Exome sequencing and unrelated findings in the context of complex disease research: ethical and clinical implications. Discov Med 2011, 12(62):41-55. 47. Bragin E, Chatzimichali EA, Wright CF, Hurles ME, Firth HV, Bevan AP, Swaminathan GJ: DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic acids research 2013:gkt937. 48. Calderone A, Castagnoli L, Cesareni G: mentha: a resource for browsing integrated protein-interaction networks. Nature methods 2013, 10(8):690-691. 49. Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nature genetics 2004, 36(5):431-432. 50. Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI: DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks. Bioinformatics 2010, 26(22):2924-2926. 51. Cross-national comparisons of the prevalences and correlates of mental disorders. WHO International Consortium in Psychiatric Epidemiology. Bull World Health Organ 2000, 78(4):413-426. 52. Kessler RC, Berglund P, Demler O, Jin R, Merikangas KR, Walters EE: Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the National Comorbidity Survey Replication. Arch Gen Psychiatry 2005, 62(6):593-602. 53. Akil H, Brenner S, Kandel E, Kendler KS, King M-C, Scolnick E, Watson JD, Zoghbi HY: The future of psychiatric research: genomes and neural circuits. Science (New York, NY) 2010, 327(5973):1580. 54. Coba MP, Komiyama NH, Nithianantharajah J, Kopanitsa MV, Indersmitten T, Skene NG, Tuck EJ, Fricker DG, Elsegood KA, Stanford LE: TNiK is required for postsynaptic and nuclear signaling pathways and cognitive function. Journal of Neuroscience 2012, 32(40):13987-13999. 55. Consortium SWGotPG: Biological insights from 108 schizophrenia-associated genetic loci. Nature 2014, 511(7510):421-427. 56. Kirov G, Pocklington A, Holmans P, Ivanov D, Ikeda M, Ruderfer D, Moran J, Chambert K, Toncheva D, Georgieva L: De novo CNV analysis implicates specific abnormalities of postsynaptic signalling complexes in the pathogenesis of schizophrenia. Molecular psychiatry 2012, 17(2):142. 57. Sullivan PF: The psychiatric GWAS consortium: big science comes to psychiatry. Neuron 2010, 68(2):182-186. 63 / 66 58. Fromer M, Pocklington AJ, Kavanagh DH, Williams HJ, Dwyer S, Gormley P, Georgieva L, Rees E, Palta P, Ruderfer DM: De novo mutations in schizophrenia implicate synaptic networks. Nature 2014, 506(7487):179-184. 59. Yang H, Chen G, Lima L, Fang H, Jimenez L, Li M, Lyon GJ, He M, Wang K: HadoopCNV: A Dynamic Programming Imputation Algorithm To Detect Copy Number Variants From Sequencing Data. bioRxiv 2017:124339. 60. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z: Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC bioinformatics 2013, 14(Suppl 11):S1. 61. Zarrei M, MacDonald JR, Merico D, Scherer SW: A copy number variation map of the human genome. Nat Rev Genet 2015, 16(3):172-183. 62. Estivill X, Armengol L: Copy number variants and common disorders: filling the gaps and exploring complexity in genome-wide association studies. PLoS Genet 2007, 3(10):1787- 1799. 63. Girirajan S, Campbell CD, Eichler EE: Human copy number variation and complex genetic disease. Annu Rev Genet 2011, 45:203-226. 64. Stankiewicz P, Lupski JR: Structural variation in the human genome and its role in disease. Annu Rev Med 2010, 61:437-455. 65. Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome research 2007, 17(11):1665-1674. 66. Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J: QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 2007, 35(6):2013-2025. 67. Winchester L, Yau C, Ragoussis J: Comparing CNV detection methods for SNP arrays. Brief Funct Genomic Proteomic 2009, 8(5):353-366. 68. Zhang X, Du R, Li S, Zhang F, Jin L, Wang H: Evaluation of copy number variation detection for a SNP array platform. BMC Bioinformatics 2014, 15:50. 69. Marenne G, Rodriguez-Santiago B, Closas MG, Perez-Jurado L, Rothman N, Rico D, Pita G, Pisano DG, Kogevinas M, Silverman DT et al: Assessment of copy number variation using the Illumina Infinium 1M SNP-array: a comparison of methodological approaches in the Spanish Bladder Cancer/EPICURO study. Hum Mutat 2011, 32(2):240-248. 70. Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, Lionel AC, Thiruvahindrapuram B, Macdonald JR, Mills R et al: Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol 2011, 29(6):512-520. 64 / 66 71. Zhao M, Wang Q, Jia P, Zhao Z: Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 2013, 14 Suppl 11:S1. 72. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature methods 2009, 6(9):677-681. 73. Abyzov A, Gerstein M: AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision. Bioinformatics 2011, 27(5):595-603. 74. Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification and comparison of structural variants. Bioinformatics 2009, 25(12):i222-i230. 75. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009, 25(21):2865-2871. 76. Layer RM, Chiang C, Quinlan AR, Hall IM: LUMPY: a probabilistic framework for structural variant discovery. Genome biology 2014, 15(6):R84. 77. Xie C, Tammi MT: CNV-seq, a new method to detect copy number variation using high- throughput sequencing. BMC Bioinformatics 2009, 10:80. 78. Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA et al: Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acac Sci U S A 2011, 108(46):E1128-1136. 79. Magi A, Benelli M, Yoon S, Roviello F, Torricelli F: Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res 2011, 39(10):e65. 80. Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome research 2011, 21(6):974-984. 81. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics 2012, 44(2):226-232. 82. Chen K, Chen L, Fan X, Wallis J, Ding L, Weinstock G: TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res 2014, 24(2):310-317. 83. Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, Handsaker RE, McCarroll SA, O'Donovan MC, Owen MJ et al: Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet 2012, 91(4):597-607. 84. Jiang Y, Oldridge DA, Diskin SJ, Zhang NR: CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic acids research 2015, 43(6):e39-e39. 65 / 66 85. Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 2011, 27(19):2648-2654. 86. Love MI, Mysickova A, Sun R, Kalscheuer V, Vingron M, Haas SA: Modeling read counts for CNV detection in exome sequencing data. Stat Appl Genet Mol Biol 2011, 10(1). 87. Samarakoon PS, Sorte HS, Kristiansen BE, Skodje T, Sheng Y, Tjonnfjord GE, Stadheim B, Stray-Pedersen A, Rodningen OK, Lyle R: Identification of copy number variants from exome sequence data. BMC Genomics 2014, 15:661. 88. Nordberg H, Bhatia K, Wang K, Wang Z: BioPig: a Hadoop-based analytic toolkit for large- scale sequence data. Bioinformatics 2013:btt528. 89. Zhang Z, Lange K, Ophoff R, Sabatti C: Reconstructing DNA Copy Number by Penalized Estimation and Imputation. The annals of applied statistics 2010, 4(4):1749-1773. 90. He M, Person TN, Hebbring SJ, Heinzen E, Ye Z, Schrodi SJ, McPherson EW, Lin SM, Peissig PL, Brilliant MH et al: SeqHBase: a big data toolset for family based sequencing data analysis. J Med Genet 2015. 91. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491(7422):56-65. 92. Huang W, Li L, Myers JR, Marth GT: ART: a next-generation sequencing read simulator. Bioinformatics 2012, 28(4):593-594. 93. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760. 94. Niemenmaa M, Kallio A, Schumacher A, Klemela P, Korpelainen E, Heljanko K: Hadoop- BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics 2012, 28(6):876-877. 95. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078- 2079. 96. Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome research 1998, 8(3):175-185. 97. Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK: Mapping copy number variation by population-scale genome sequencing. Nature 2011, 470(7332):59-65. 98. Layer RM, Chiang C, Quinlan AR, Hall IM: LUMPY: a probabilistic framework for structural variant discovery. Genome biology 2014, 15(6):R84. 99. Young T-L, Ives E, Lynch E, Person R, Snook S, MacLaren L, Cator T, Griffin A, Fernandez B, Lee MK: Non-syndromic progressive hearing loss DFNA38 is caused by heterozygous missense mutation in the Wolfram syndrome gene WFS1. Human molecular genetics 2001, 10(22):2509-2514. 66 / 66 100. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome research 2009, 19(9):1586-1592.
Abstract (if available)
Abstract
Before I start my thesis, I sincerely want to address the importance and contribution of computational biologists, because at times it is difficult for traditional biologists to appreciate their efforts. As Next Generation Sequencing became more and more popular in current days, it is bringing plenty of promises and hopes that a series of very difficult but vital questions about basic biology could be answered, and multiple severely diseases could be cured. In 2015, Obama administration initiated the Precision Medicine initiative, promoting the research and development of tailored medicine and treatment based on individual’s genomic information, disease history and physical conditions. However, the strong reliance on genomic sequencing brings new challenges to most traditional biologists and clinicians, especially considering the unprecedented scale of the data, the requirement of the programming and data processing skills, and the technical challenges for each individual analysis. Thus, bioinformatics and computational biology is never as important before as in the genomic era. ❧ At the same time, computational biologists are not generally appreciated by traditional biologists, especially when their work doesn’t include production of original data but only a meta-analysis or data mining from public data. For example, one author wrote a paper in PLOS Biology about that computational biologists making sense of published data are described as “research parasites” by the editor-in-chief of the New England Journal of Medicine. In this paper titled “All biology is computational biology”, he argued that “the next modern synthesis in biology will be driven by mathematical, statistical, and computational methods being absorbed into mainstream biological training, turning biology into a quantitative science.” ❧ In my view, claiming “all biology is computational biology” might be an overshoot, but there is no doubt that computational biology is a necessary and vital component in today’s biology world, and deserve more and more attention, not only including bioinformatics especially genomic data analysis, but also including computational neuroscience, synthetic biology, biological system simulation and so on. My Phd accomplishments could be categorized as a contribution to genomic data analysis, majorly including gene and variation discovery and prioritization, beneficiary to research scientists and clinicians with an overwhelming amount of information in front of them. ❧ My thesis is divided into three chapters, with the first one focusing on a tool developed majorly by myself called Phenolyzer, to help clinicians and scientists to utilize the power of phenotype information for gene and variant discovery. Second chapter is a collaborative work focusing on unveiling the interactomes of PSD (Post-Synaptic Density) proteins across different developmental stages in mouse, and a web server for users to access the protein interaction data interactively. The third chapter is about a Copy Number Variation (CNV) detection tool based on Hadoop framework, to address the scale-up issue widely existing in modern population genomic analysis.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Novel statistical and computational methods for analyzing genome variation
PDF
Two-step study designs in genetic epidemiology
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Understanding the characteristic of single nucleotide variants
PDF
Computational analysis of the spatial and temporal organization of the cellular environment
PDF
Computational transcranial magnetic stimulation (TMS)
PDF
Computational analysis of genome architecture
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Computational tools for large-scale analysis of brain function and structure
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Profiling transcription factor-DNA binding specificity
PDF
Automatic tracking of flies and the analysis of fly behavior
PDF
Data modeling approaches for continuous neuroimaging genetics
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Computational analysis of DNA replication timing and fork dynamics in S. cerevisiae
Asset Metadata
Creator
Yang, Hui
(author)
Core Title
Novel computational methods of disease gene and variant discovery, parallelization and applications
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
06/22/2017
Defense Date
03/17/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,Computational Biology,OAI-PMH Harvest,variant discovery
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Campbell, Daniel (
committee chair
), Lu, Wange (
committee member
), Thomas, Paul (
committee member
), Wang, Kai (
committee member
)
Creator Email
yanghui@alumni.usc.edu,yanghui@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-389456
Unique identifier
UC11265772
Identifier
etd-YangHui-5450.pdf (filename),usctheses-c40-389456 (legacy record id)
Legacy Identifier
etd-YangHui-5450.pdf
Dmrecord
389456
Document Type
Dissertation
Rights
Yang, Hui
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bioinformatics
variant discovery