Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
(USC Thesis Other)
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Orthogonal Shared Basis Factorization: Cross-species gene expression analysis using a common expression subspace by Amal Thomas A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY & BIOINFORMATICS) December 2022 Copyright 2022 Amal Thomas Perfection is not attainable, but if we chase perfection we can catch excellence. Vince Lombardi ii Acknowledgements Throughout my journey in graduate school, I have been immensely blessed to have the care and support of many wonderful people. Whenever I needed help, there was always someone to lend a helping hand and point me in the right direction. Without the support and guidance I received from many, this Ph.D. would not have been possible. First and foremost, I would like to thank my advisor Dr. Andrew Smith for his continuous guidance, patience, and immense knowledge, which helped me in my research and completing this dissertation. His drive to pursue the highest quality science has pushed me to be a better scientist. Working and studying under your guidance was a great privilege and honor. I would also like to thank Dr. Min Yu, Dr. Mark Chaisson, Dr. Michael “Doc” Edge, and Dr. Matt Dean for kindly agreeing to be part of my dissertation committee and providing helpful suggestions for my research. I want to thank Dr. Peter Calabrese for participating in my qualifying committee and providing me the opportunity to assist him in teaching QBIO classes. I am also grateful to Dr. Remo Rohs and Dr. Fengzhu Sun for their support throughout my graduate studies at USC. I am very grateful to my collaborators Dr. Min Yu and members of the Yu Lab at the Keck School, for providing me the opportunity to be part of circulating tumor cell projects. I would es- pecially like to thank Dr. Remi Klotz, Dr. Oihana Iriondo, Dr. Diane Kang, Dr. Mohamed Saleh, Dr. Desirea Mecenas, and Dr. Lin Li for generating cutting-edge data, providing me with valu- able insights regarding experimental techniques, and patiently going through my analysis results. Thanks to Yonatan Amzaleg, Sathish Kumar Ganesan, Dr. Veronica Ortiz, Dr. Pratishtha Rawat, and Grace Lee for the countless interactions and scientific discussions. The projects I worked on iii with the Yu Lab members fundamentally shaped the rest of my Ph.D.; I will forever be grateful for that. I want to thank current and former members of the Smith lab. Your excellent interactions and supportive and critical environment helped me grow as a scientist. I want to especially thank Dr. Guilherme De Sena Brandine, Dr. Saket Choudhary, Dr. Ben Decato, and Dr. Rishvant Prabakar for always being supportive, for endless science conversations, and beyond all, for being great friends. I am grateful to Masa Nakajima for his critical reviews of my writing and helpful suggestions. Thanks to the Smith lab members Meng Zhou, Wenzheng Li, Terence Li, Jianghan Qu, Chao Deng, Tim Daley, Egor Dolzhenko, Hoda Mirsafian, and Xiaojing Ji for their continued support and all the happy times we have spent together. I also want to thank Chaisson lab members Tony Lu, Jingwen Ren, Jianzhi Yang, and Keon Rabbani for their critical input during our group meetings. I would also like to acknowledge QBIO/MCB staff members Luigi Manna, Katie Boeck, Dough Burleson, Adolfo Dela Rosa, and Ashley Tozzi for the immense service you provide to help graduate students. Thank you very much for everything you do! My time in graduate school was made much easier by the supportive group of colleagues and friends I made early in my Ph.D. I thank my CBB colleagues Jitin Singla, Guilherme De Sena Brandine, and Liana Engie; my MCB colleagues Anne Nguyen, Michael Lough-Stevens, and Lisa Welter for helping me to reach the finish line. A special thanks to my Catholic Trojan Family: Kevin Vaz, Rinu Sebastian, Audrey Moore, Arathi Sunder, Venil Noronha, Dona Dmello, Fleur Lobo, and Suryakiran George for the fantastic time we spent together and helping me to adapt to a new life in the USA. Above all, I would like to thank my loving wife, Manju, for her endless love and support over these past five years. Thank you for saying yes and joining me on this adventure. I would not have gotten to where I am now without you. I am deeply indebted to my parents for always encouraging me to pursue my interest in science, always believing in me, and for endless prayers. Special thanks to Aby and Meena for showering me with lots of love and keeping me motivated. Thank you also to Thomas Sebastian, Solly Thomas, Ann Maria Thomas, Arun Johny, Davis Thomas, iv and Jeffin Thomas for welcoming me to the family and for your continuous and unparalleled love and support. I am very grateful to Nithin C for being a great friend and mentor. My heart fills with gratitude whenever I think about you. I have been fortunate to have an extended family here in LA. Aravind, Indu, Ragesh, Aishwarya, Mythili, Chethan, Jayakrishna, and Manasi have been integral to my Ph.D. life. I will forever cherish the time I spent with you at 2677. I am blessed and thankful to have so many incredible people in my life. It has made the work toward my Ph.D. more rich and meaningful. v Table of Contents Epigraph ii Acknowledgements iii List of Tables ix List of Figures xi Abstract xii 1 Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Gene expression profiling and analysis 7 2.1 Gene expression profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Gene expression datasets in public repositories . . . . . . . . . . . . . . . 12 2.2 Comparative analysis of gene expression across species . . . . . . . . . . . . . . . 12 2.2.1 Homology in cross-species gene expression comparisons . . . . . . . . . . 16 2.2.2 Orthology-function conjecture . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Matrix algebra models for gene expression analysis . . . . . . . . . . . . . 20 2.3 Common expression subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Orthogonal and linear properties of genetic networks . . . . . . . . . . . . 24 3 Orthogonal shared basis factorization 26 3.1 Method definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Estimating species-independent common expression subspace . . . . . . . 27 3.1.2 Minimizing total factorization error . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Method derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.4 Iterative algorithm to minimize factorization error . . . . . . . . . . . . . . 35 3.2 Properties of orthogonal shared basis factorization . . . . . . . . . . . . . . . . . . 38 3.3 Comparison with other joint matrix factorization methods . . . . . . . . . . . . . . 39 3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi 4 Identifying functionally relevant genes using common expression subspace 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Gene annotations and pre-processing data . . . . . . . . . . . . . . . . . . 44 4.2.2 Quality checks of RNA-Seq profiles . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Estimating inter-tissue correlation relationship . . . . . . . . . . . . . . . 45 4.2.4 Gene expression data analysis . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Inter-tissue correlation relationship across species . . . . . . . . . . . . . . 47 4.3.2 Species-independent phenotypic subspace for joint analysis . . . . . . . . 49 4.3.3 Properties of common expression subspace dimensions . . . . . . . . . . . 49 4.3.4 Identifying tissue-relevant genes . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.5 Homology-independent identification of tissue-specific expressed genes . . 54 4.3.6 Expression divergence of orthologs . . . . . . . . . . . . . . . . . . . . . 54 4.3.7 Discovery of non-homologs contributing to the phenotype . . . . . . . . . 55 4.3.8 Identifying tissue-relevant differentially expressed genes . . . . . . . . . . 56 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Cross-species gene expression query 61 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.1 Scoring search results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.2 Species clade definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.3 Cross-species gene expression search based on orthology . . . . . . . . . . 65 5.3.4 Clustering and gene expression analysis . . . . . . . . . . . . . . . . . . . 66 5.3.5 OSBF-based cross-species analysis . . . . . . . . . . . . . . . . . . . . . 67 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4.1 Database and query setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4.2 Within-species gene expression query . . . . . . . . . . . . . . . . . . . . 68 5.4.3 Cross-species gene expression query based on orthologous genes . . . . . 68 5.4.4 Limitation of orthology-based cross-species gene expression search . . . . 71 5.4.5 Cross-species gene expression search using OSBF . . . . . . . . . . . . . 72 5.4.6 Dimension of the common expression subspace . . . . . . . . . . . . . . . 76 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Summary and Future work 79 6.1 Summary of OSBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Summary of cross-species gene expression analysis . . . . . . . . . . . . . . . . . 81 6.3 Additional applications of the OSBF . . . . . . . . . . . . . . . . . . . . . . . . . 83 Bibliography 85 vii Appendices 102 A Appendix Figures: Orthogonal shared basis factorization 103 B Appendix Figures: Identifying functionally relevant genes 104 C Appendix Tables: Identifying functionally relevant genes 115 D Appendix Analysis: Cross-species gene expression query 127 viii List of Tables 3.1 Comparison of joint matrix factorization approaches . . . . . . . . . . . . . . . . 40 5.1 List of species and shared orthologous genes in different clades . . . . . . . . . . . 65 C.1 Inter-tissue correlation information in different dimensions of the common subspace 115 C.2 Gene ontology terms enriched for top human dimension 1 genes . . . . . . . . . . 115 C.3 Gene ontology terms enriched for top mouse dimension 1 genes . . . . . . . . . . 116 C.4 Gene ontology terms enriched for top dimension 2 genes . . . . . . . . . . . . . . 116 C.5 Gene ontology terms enriched for top dimension 4 genes . . . . . . . . . . . . . . 117 C.6 Gene annotation distribution of testis and brain TRGs . . . . . . . . . . . . . . . . 117 C.7 Gene annotation distribution of heart and kidney TRGs . . . . . . . . . . . . . . . 117 C.8 Previously characterized non-coding marker genes . . . . . . . . . . . . . . . . . 118 C.9 Previously characterized tissue-specific expressed human marker genes . . . . . . 118 C.10 Percentage of TRGs with orthologs . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.11 Previously characterized protein-coding marker genes in different clades . . . . . . 120 C.12 Number of homologous TRGs genes shared across different species clades . . . . . 122 C.13 Ortholog expression divergence example identified by OSBF analysis . . . . . . . 122 C.14 Mouse homologous TRGs with orthologs not identified in the other species . . . . 123 C.15 Number of TRGs with updated pairwise orthology mapping . . . . . . . . . . . . 124 C.16 TRGs and differential expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 C.17 Gene ontology analysis of DE and TRGs . . . . . . . . . . . . . . . . . . . . . . . 126 ix D.1 Gene ontology terms enriched for shared orthologous genes in 17 species . . . . . 135 D.2 OSBF-based cross-species search results for Drosophila and Zebrafish profiles . . . 135 D.3 Search performance using OSBF vs. orthologous genes . . . . . . . . . . . . . . . 136 D.4 Examples of orthology-based cross-species gene expression search results . . . . . 136 D.5 Search performance in common subspace for different tissue types . . . . . . . . . 136 D.6 Search performance for Drosophila and Zebrafish profiles . . . . . . . . . . . . . . 137 D.7 Comparison of cross-species gene-expression search methods . . . . . . . . . . . . 137 D.8 Accession IDs for cross-species gene expression search results . . . . . . . . . . . 138 x List of Figures 2.1 RNA-Seq profiles currently available at GEO/SRA . . . . . . . . . . . . . . . . . 13 2.2 Species-divergence time and the number of shared orthologous genes . . . . . . . 16 3.1 Graphical summary of OSBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Steps involved in the estimation of common expression subspace . . . . . . . . . . 30 4.1 Inter-tissue gene expression correlation relationship across species . . . . . . . . . 48 4.2 Analysis in the OSBF common subspace . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Properties of dimension 2 of the OSBF common subspace . . . . . . . . . . . . . 51 4.4 Tissue-relevant genes identified by the OSBF . . . . . . . . . . . . . . . . . . . . 53 4.5 Differentially expressed and tissue-relevant genes . . . . . . . . . . . . . . . . . . 57 5.1 Mean average precision (mAP) score for orthology-based cross-species search . . . 69 5.2 Cross-species search performance based on orthologous genes . . . . . . . . . . . 70 5.3 Proportion of gene expression variation explained by tissues and species factors . . 71 5.4 Cross-species gene expression search performance in the common expression sub- space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 Cross-species gene expression search performance for distant species in common expression subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6 Cross-species gene expression search results of additional tissue profiles . . . . . . 75 5.7 Cross-species search performance of GTEx expression profiles . . . . . . . . . . . 75 5.8 Search performance comparison with HO GSVD . . . . . . . . . . . . . . . . . . 76 xi A.1 Individual factors of OSBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.1 Inter-tissue gene expression correlation relationship across species . . . . . . . . . 105 B.2 Inter-tissue gene expression correlation relationship across species for shuffled counts106 B.3 Properties of the dimension 1 of the common subspace . . . . . . . . . . . . . . . 107 B.4 Properties of the dimension 2 of the common subspace . . . . . . . . . . . . . . . 107 B.5 Properties of the dimensions 3 and 4 of the common subspace . . . . . . . . . . . 108 B.6 Properties of the dimensions 5 and 6 of the common subspace . . . . . . . . . . . 109 B.7 Steps involved in the estimation of the OSBF common subspace and analysis . . . 110 B.8 Gene annotation classification of non-coding genes . . . . . . . . . . . . . . . . . 111 B.9 Distribution of species-specific and shared tissue-relevant genes . . . . . . . . . . 112 B.10 Human TRGs genes with orthologs not being tissue-specific expressed . . . . . . . 113 B.11 Tissue-relevant genes with no pairwise orthologs . . . . . . . . . . . . . . . . . . 114 D.1 Within species gene-expression search results of the curated profiles . . . . . . . . 128 D.2 Proportion of orthologous housekeeping genes . . . . . . . . . . . . . . . . . . . . 129 D.3 Clustering of gene expression profiles based on shared orthologous genes . . . . . 130 D.4 Clustering analysis of gene expression profiles of two divergent species . . . . . . 131 D.5 Cross-species gene expression search using OSBF for additional tissues . . . . . . 132 D.6 Search results using OSBF common subspace for GTEx profiles . . . . . . . . . . 133 D.7 Effect of the number of functionally similar tissues on OSBF-based cross-species gene expression search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 xii Abstract Cell phenotypes are supported by specific functions or pathways that involve the coordinated ac- tivity of multiple genes. When distantly related species share conserved cell phenotypes, those phenotypes depend on conserved multi-gene functions. Evolution does not require that all genes working to implement conserved functions have unique orthology between the species. Pathway function may adapt, incorporating the activity of additional genes along a lineage while retaining essential behaviors in development or cellular responses. In comparative studies of cell pheno- types, exclusive dependence on orthology will limit those functions that can be examined across long evolutionary time scales, and obscure adaptation. In this dissertation, we developed a novel joint matrix factorization that we call the orthog- onal shared basis factorization (OSBF) to enable orthology-independent comparative analyses of expression profiles from multiple species. OSBF decomposes expression profiles of functionally similar phenotypes from different species to estimate interpretable matrix factors. Our method uses a similar correlation structure within individual datasets to estimate a biologically meaningful expression subspace shared by all species. We show that the OSBF expression subspace provides a reduced dimensional space shared by different organisms that can be used to summarize the con- served gene expression patterns. The species-specific matrix factors in our method quantify the role of all genes in the phenotype without relying on the homology relationships of genes between species. The strategy of estimating different matrix factors, the interpretation of individual ma- trix factors, and the procedure to ensure that the reconstructed profile using the estimated factors closely recapitulates the original expression profiles are discussed. xiii Cross-species gene expression analysis of different tissue types from eight vertebrate species using OSBF identified functionally relevant genes for tissue phenotype, including known gene markers, non-coding genes, pseudogenes, and other less-studied genes. We find that more than 40% of the genes important for tissue phenotype are not one-to-one orthologs shared in all species. About 6-10% of these functionally essential genes do not have orthologs in other species. In addition, the method identifies several examples of gene expression divergence of orthologs and complements differential expression analysis to detect genes vital for the phenotype. The common expression subspace also provides a universal coordinate system for querying gene expression pro- files from multiple species. Using public RNA-seq datasets from 17 species, including vertebrates and invertebrates, we show that the OSBF-based cross-species search outperforms conventional methods in identifying functionally similar expression profiles across species. xiv Chapter 1 Introduction Comparative genomics, founded on models of molecular evolution, has impacted almost all ar- eas of biomedical science. Sequence similarity search in multi-species databases has long been an essential step in the functional inference of genes and annotating newly assembled genomes. Determining function is easier in model organisms, and validating function is far easier than de- termining function de novo. This principle is a pillar of comparative genomics. Comparative approaches integrating gene expression profiling data from multiple species hold equal promise in understanding cellular phenotypes. Gene expression profiling provides a snapshot of how a cell function by measuring the expres- sion level of thousands of genes at a specific time. The expression levels indicate the gene activity in a cell: when and where genes are turned on or off and their relative abundances. By collecting and comparing gene expression profiles of different tissues or cell types across various experimen- tal conditions, researchers can gain a deeper understanding of factors determining the identity of a cell and bridge the gap between genetic code and cellular phenotypes. Different species have different genomes and hence different transcriptomes. The number of genes annotated varies from species to species; hence the expression profiles from different species are high-dimensional vectors of varying lengths. In studies co-analyzing gene expression profiles from multiple species, comparing expression profiles of two species requires mapping genes be- 1 tween species. The usual strategy is to identify a subset of homologous genes (i.e., genes evolved from the same ancestor) and then construct a combined gene expression profile from the values associated with these shared genes. The assumption is that the shared genes, usually orthologs, perform similar functional roles in different species. The relationships between the gene expres- sion profiles of multiple species and their association with the phenotype are then investigated based on the shared genes between species. Orthology definition does not guarantee the conservation of sequence, structure, or function (Gabald´ on & Koonin, 2013). Independent strategies exist in nature to achieve the same molec- ular function. Although many cellular processes are conserved across species, some or even all components of a pathway can be replaced by genes with independent evolutionary origin (Koonin et al., 1996). There is evidence for homologous genes and regulatory networks adapted for dif- ferent functions in plants and animals (Hirano et al., 2007; Liu et al., 2010; Kajala et al., 2021; Carelli et al., 2018). The re-purposing of genes, proteins, and regulatory networks implies that orthology is not a necessary criterion to achieve equivalent functions. Moreover, information from all genes is essential for cross-species analysis, and homology-based approaches fail to account for the contribution of non-homologous genes to the phenotype. Homology-independent analysis requires mathematical methods to extract biological signals in gene expression profiles of different species with varying numbers of genes. Joint matrix factoriza- tion approaches offer a potential solution by factorizing gene expression data into species-specific matrix factors and a low-dimensional latent variable matrix factor. The latent variable matrix is shared in all species and represents a common expression subspace. The similarities in the bi- ological processes responsible for conserved phenotypes in different species can be leveraged to construct the common expression subspace. The common expression subspace has an identical di- mension in all species and cross-species gene expression profiles can be compared in this subspace without relying on any orthology mapping. Efforts have already been made to develop homology-independent cross-species comparative frameworks based on the idea of shared subspace (Alter et al., 2003; Ponnapalli et al., 2011). 2 Although these joint matrix factorization methods can be used for orthology-independent compar- isons, these approaches have several shortcomings. The common subspace in these methods is not designed to represent biological processes conserved across species. The species-specific factors and the shared subspace allow redundancy, making it challenging to differentiate the contribution of genes to different phenotypes. As a result, the biological interpretation of the species-specific factors and common subspace is difficult in these approaches. These drawbacks limit their appli- cation in cross-species gene expression analysis. This thesis addresses these limitations and defines a novel joint matrix factorization method that we call the orthogonal shared basis factorization (OSBF) to compare functionally similar phe- notypes across species. This homology-independent approach places cellular phenotypes in a com- mon coordinate system that can summarize gene expression patterns shared by different organisms and quantifies the role of all genes in the phenotype independent of their homology relationships and annotation. 1.1 Contributions 1. A method to exploit the similar correlation structure in multiple high-dimensional datasets: Though mathematical tools exist to compare multiple high-dimensional datasets with varying numbers of rows, they are not designed to model the correlation structure present in the data. We developed OSBF, which utilizes a similar correlation structure within individual datasets to estimate interpretable matrix factors. Utilizing the conserved correla- tion relationship between cellular phenotypes, the method provides a novel strategy to com- pare gene expression profiles across species. 2. Characterizing functionally important genes from cross-species gene expression datasets: Conservation of co-expression patterns in multiple species is a robust criterion for distin- guishing functionally important genes from a set of co-regulated genes. OSBF factorizes gene expression profiles of functionally similar phenotypes from multiple species to esti- 3 mate a common expression subspace and species-specific matrix factors. The different di- mensions of the common expression subspace represent one or more biological processes related to the phenotypes. The species-specific loading matrix quantifies the contribution of all genes to different subspace dimensions. Using gene loadings in the species-specific matrix, OSBF-based cross-species gene expression analysis identifies genes driving the phe- notype independent of their annotation and homology classification. 3. Evaluation of cross-species gene expression search using orthologous genes: Most cross- species comparative gene expression analyses have used orthology relationships between genes to map expression values from one species to another. While this approach has several theoretical shortcomings, it remains unclear how those shortcomings impact outcomes in practice, particularly in matching expression profiles of functionally equivalent phenotypes from different species. We systematically investigate how species divergence time influences the performance of cross-species gene expression search based on one-to-one orthologous genes. Using public RNA-Seq datasets from 17 species with divergence times ranging from <10 million years to >750 million years, we find that the accuracy of the orthology-based search approach for querying gene expression profiles decreases with species divergence time. Finding matching phenotypic profiles from distantly related species using existing orthology-based methods is ineffective. 4. An orthology-independent framework to query gene expression profiles across species: Projecting gene expression profiles from different species to the OSBF-common expression subspace transforms the high-dimensional expression vectors with different lengths to have identical dimensions. By comparing expression profiles in the OSBF-common expression subspace, we accurately identify matching expression profiles of functionally similar pheno- types in close and distant species without relying on gene mappings between species. Using gene expression datasets from 17 species, Genotype-Tissue Expression (GTEx) consortium, and other publically available expression profiles from distant species, we show that OSBF- 4 based cross-species gene expression search outperforms conventional approaches. 1.2 Outline In Chapter 2, we provide the background of gene expression profiling and how cross-species com- parative analysis of gene expression profiles is performed. We start with the history of the de- velopment of major assays to measure gene expression, primarily focussing on RNA-Seq assay and its applications. We discuss how high-throughput sequencing techniques have led to the ac- cumulation of massive amounts of RNA-Seq datasets in public databases. Next, we examine the different strategies to analyze cross-species gene expression profiles. We give a detailed history of the use of homology in cross-species comparisons and its shortcomings. We then discuss ma- trix algebra models, the potential of joint-matrix factorization approaches, and the limitations of existing methods. We finish the chapter by introducing the concept of common expression sub- space for cross-species gene expression comparisons, which is the foundation of the joint matrix factorization we develop. Chapter 3 introduces a novel computational method, OSBF, to jointly factorize gene expres- sion datasets from multiple species with varying numbers of rows. We discuss how the common expression subspace is estimated using similar inter-sample correlation relationships. We then provide the details of minimizing the total factorization error and critical mathematical properties of the estimated factors. The algorithm for iteratively updating the matrix factors and how the procedure estimates interpretable matrix factors are discussed. Finally, we compare our method with other matrix factorization approaches and discuss the advantages of our method over existing approaches. In Chapter 4, we use OSBF to compare expression profiles of six functionally similar tissue types from eight vertebrate species. The properties and interpretations of different dimensions of OSBF common expression subspace are demonstrated in a cross-species gene expression analysis context. We show that more than 40% of the genes important for tissue phenotype are not one- 5 to-one orthologs shared in all species. Furthermore, 6-10% of these genes do not have orthologs in other species. Without relying on homology and annotation classification, we identified func- tionally relevant genes for tissue phenotype, including known gene markers, non-coding genes, pseudogenes, and other less-studied genes. In Chapter 5, we define the problem of querying gene expression profiles across species and discuss existing computational approaches developed for cross-species gene expression search and their limitations. We discuss our strategy to set up the database and query set using publically avail- able expression profiles and scoring the search results. Next, we evaluate how species-divergence time influences the accuracy of orthology-based gene expression profile searches. We then discuss the cross-species search strategy using the OSBF-common expression subspace and its influencing factors. Finally, in Chapter 6, we summarize the thesis and propose directions for for future applica- tions of our method. We highlight the main results of Chapters 3-5 and describe the potential impact of the research produced in this dissertation. We discuss the scope of integrated analysis of other genomic datasets using OSBF and how the method could help to characterize cell type identification in single-cell RNA-Seq datasets. 6 Chapter 2 Gene expression profiling and analysis In this chapter, we discuss the background of gene expression profiling and how cross-species gene expression analyses are performed using high-throughput sequencing data. This chapter is divided into two parts. First, we will describe how gene expression is quantified, the growth of gene expression datasets in public repositories, and some biologically relevant questions that can be answered using gene expression studies. Second, we will discuss the importance and applications of cross-species gene expression studies. This section discusses two broad categories of cross- species gene expression analysis methods: homology-based and joint-matrix factorization-based approaches. The challenges of cross-species expression analysis and the limitations of the existing approaches are also introduced. 2.1 Gene expression profiling Gene expression is the fundamental process in which the information encoded in the DNA is con- verted into functional products. This process involves two key stages: transcription and translation. In the first step of gene expression, the genetic code is decoded to synthesize messenger RNAs (mRNAs) that carry information for protein synthesis and non-coding RNAs that serve other func- tions. During translation, proteins are manufactured using the information stored in the mRNAs. In unicellular organisms and multi-cellular organisms, gene expression is regulated and modulated 7 at several levels to determine when and how much a gene is expressed (Epstein & Beckwith, 1968; Gibney & Nolan, 2010). Quantifying the relative expression level of genes and how the expression level changes in normal and pathological conditions are essential for understanding the fundamen- tals of cell biology through to the targeted treatment of diseases. The amount of RNA transcript of a gene indicates how strongly a gene of interest is expressed in a cell. Depending on the physiological state and type of cell, a vast number of transcripts are produced in every cell at a given time. These RNA molecules reflect the state and identity of a cell and regulate biological activities within the cell (Young, 2011). Collectively defined as the transcriptome, the RNA transcripts help us to understand the relationship between gene expression and cellular phenotypes. Although transcription/translation programs are notoriously complex, measuring RNA abundance is considered a robust approach for quantifying gene expression. In general, the methods to measure transcript levels are based on transcript visualization, hy- bridization, and sequencing. Initially, studies were limited to detecting a few transcripts using low-throughput assays such as Northern blot and polymerase chain reaction (PCR). Northern blot assay (Alwine et al., 1977) involves gel electrophoresis to separate RNA molecules by size, de- tection of specific RNA molecules using complementary hybridization probes, and subsequent autoradiographic detection of the labeled RNA. The assay requires relatively large samples of total RNA (5-50µ g), and a limited number of genes can be detected using a combination of differently labeled probes. While revolutionary at the time of its inception, the technique is currently used for detecting transcript size, studying alternate splice products, and RNA degradation. In the early 90th of the last century, Kary Banks Mullis revolutionized molecular biology by introducing polymerase chain reaction (PCR) (Mullis, 1990). For the first time, scientists could perform accurate detection and amplification of a specific DNA fragment from a com- plex pool of DNA. With the advent of quantitative PCR (qPCR), even a small amount of RNA could be detected by converting to complementary DNA (cDNA) using reverse-transcription re- action followed by amplification using specific primers. The introduction of the concept of mon- itoring DNA amplification in real-time through fluorescence monitoring (Holland et al., 1991; 8 Holland et al., 1991) further improved the technology by decreasing post-processing steps and minimizing experimental error. Real-time reverse-transcription quantitative PCR is the method of choice for measuring gene expression of a moderate number of genes and remains the gold standard for validating discoveries. Since the early 1990s, a large amount of effort has focused on measuring mRNA abundance on a global scale. The development of hybridization-based DNA arrays allowed quantifying the expression of thousands of genes in parallel at a relatively low-cost (Schena et al., 1995). DNA microarrays consist of thousands of microscopic spots of oligonucleotides, each containing spe- cific DNA sequences called ‘probes’ arrayed on a solid support. Transcripts from the sample of interest are converted into cDNA and typically labeled with fluorophores. Labeled cDNAs with complementary sequences are then hybridized into the probes. Probe-target hybridization is de- tected and quantified based on fluorescent intensity. Using microarray technology, researchers could perform large-scale quantitative experiments to compare expression profiles across different tissues, experimental conditions, disease states, and genetic backgrounds. While powerful, microarrays do have several limitations (Hurd & Nelson, 2009). Microarray design requires prior knowledge of the sequences being interrogated. This affects array effective- ness, especially in the case of a species with incomplete or outdated genome annotations. A major limitation in microarray analysis is the cross-hybridization between similar sequences. This com- plicates the analysis of related features and restricts the analysis to the non-repetitive fraction of the genome. The signal-to-noise ratio and competitive hybridization are also relative measures. Thus the detection of low-abundance sequences and highly represented sequences is challenging using DNA microarrays. In recent years, sequencing-based approaches have emerged as the major gene expression pro- filing technique. The fact that the genetic material is directly sequenced removes cross-hybridization issues. Sequencing-based approaches offer single-nucleotide resolution and have higher sensitiv- ity in quantifying genes with low and high expression. Initially, tag-based approaches based on Sanger sequencing were used in gene expression studies, but these approaches were relatively low- 9 throughput. The development of massively parallel short-read DNA sequencing: next-generation sequencing (NGS) has provided a new method for quantifying transcriptomes. 2.1.1 RNA Sequencing Accurate RNA-abundance quantification, gene discovery, and robust transcript assembly can be achieved in a single high-throughput sequencing assay called RNA-Sequencing (RNA-Seq, Na- galakshmi et al., 2008). RNA-Seq provides a detailed and quantitative view of the whole tran- scriptome and has become an indispensable tool for genome-wide analysis of differential gene expression and splicing. A standard RNA-Seq experiment begins with isolating RNA from a bi- ological sample, fragmenting RNA molecules, cDNA synthesis, and preparing an adaptor-ligated sequencing library. The library, with or without amplification, is then sequenced to obtain short reads from one end (single-end sequencing) or both ends (pair-end sequencing) of the fragments. Depending on the DNA-sequencing technology, the reads are typically 30—400 bp, and 10—30 million reads are generated per sample. Currently, the Illumina HiSeq platform is the most com- monly applied high-throughput sequencing technology for RNA-Seq. Following sequencing, the reads are aligned to a reference genome or reference transcripts. If the organism does not have a known genome sequence, reads are first assembled into longer contigs to create a transcriptome, and the reads are aligned back for quantification. For example, the transcriptome of the Glanville fritillary butterfly is assembled using RNA-Seq reads (Vera et al., 2008). RNA-Seq provides information about how two exons are connected and the precise location of transcription boundaries. This allows us to unambiguously align reads to unique re- gions of the genome and quantify gene/transcript levels. Using qPCR and spike-in RNA controls of known concentrations, RNA-Seq has shown to be highly accurate for quantifying expression levels (Nagalakshmi et al., 2008; Mortazavi et al., 2008; Shi & He, 2014). The assay does not have an upper limit for quantification and can measure expression across a larger dynamic range (Wang et al., 2009). For example, the transcript detection range spanned five orders of magnitude in the RNA-Seq dataset of mouse tissues (Mortazavi et al., 2008). Considering all these advan- 10 tages, RNA-Seq is a robust assay to survey the entire transcriptome in a very high-throughput and quantitative manner. Over the past decade, RNA sequencing has become the primary method for genome-wide quantification of gene expression. Transcriptome profiling using RNA-Seq is used to investigate the molecular mechanism of several human diseases, such as Huntington’s disease (Lin et al., 2016), muscle disorders (Cummings et al., 2017), and cancer (Li et al., 2017; Aversa et al., 2016; Klotz et al., 2020). Recently, RNA-Seq is emerging as an important diagnostic tool, comple- menting other genomic assays in detecting different Mendelian disorders (Kremer et al., 2017; Murdock et al., 2021). The assay has been instrumental in characterizing novel non-coding tran- script molecules, including long non-coding RNA, miRNA, piRNA, and other small RNA classes (Trapnell et al., 2010). The technology provides valuable information to create novel transcript annotations for an uncharacterized organism (Grabherr et al., 2011) and to improve existing an- notations (Boley et al., 2014). The single-base resolution of RNA-Seq has been used to pre- cisely map5 ′ and3 ′ boundaries of annotated genes in different species (Nagalakshmi et al., 2008; Wilhelm et al., 2008). The comparison of gene expression profiles across different developmental stages, treatments, or disease conditions is a primary application of RNA-Seq. Using RNA-Seq data from multi- ple species, researchers have compared gene expression across species to study alternate splicing, and exon usage (Gelfman et al., 2012; Merkin et al., 2012), identify co-expressed gene modules (Gerstein et al., 2014), and detect novel transcripts (Guttman et al., 2010). Differential gene ex- pression (DGE) has been widely used to characterize biological processes in different organisms and systems, including Saccharomyces cerevisae (Nagalakshmi et al., 2008), Zea mays (Emrich et al., 2007), Arabiodopsis thaliana (Lister et al., 2008), Drosophila melanogaster (Graveley et al., 2011), Mus musculus (Mortazavi et al., 2008), and Homo sapiens (Mercer et al., 2012). 11 2.1.2 Gene expression datasets in public repositories The decreasing cost per base of sequencing has propelled the rapid accumulation of RNA-Seq data in public repositories. Massive databases of gene expression data from research labs, hospitals, and sequencing cores are accumulating in public databases such as Gene Expression Omnibus (GEO, Edgar et al., 2002), and ArrayExpress (Brazma et al., 2003). As of August 2022, more than 1.54 million RNA-Seq profiles are available in the GEO (Figure 2.1 a). Most of these profiles are generated from two species: human and mouse. More than two-thirds of the RNA-Seq datasets are from studies in human or mouse, and the remaining comes from more than 1500 species (Figure 2.1 b). Large consortium projects such as the Genotype-Tissue Expression (GTEx, Lonsdale et al., 2013) and The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013) also provide a tremendous amount of RNA-Seq data for healthy and cancer samples. This massive amount of publically avail- able expression datasets provides an unprecedented opportunity for data-driven discovery of co- expressed genes and characterization of cellular phenotypes that would be difficult to achieve from individual projects. Efforts to produce databases with uniformly processed gene expression pro- files for the fast retrieval of relevant datasets are already progressing (Collado-Torres et al., 2017; Lachmann et al., 2018). Soon the task of identifying the distinct phenotypes in a cell population will involve a standard step of directly querying expression profiles in databases. 2.2 Comparative analysis of gene expression across species Cross-species studies have an important role in understanding human biology, including our evo- lutionary history, developmental processes, and physiology. A wide range of organisms, from primates and rodents to invertebrates, has been used in experimental studies because of their sim- ilarities to humans regarding anatomy and genetics. Animal models have greatly improved our understanding of human diseases mechanisms and play a vital role in discovering targets for ther- apeutic drugs (Nestler & Hyman, 2010; Sharpless & DePinho, 2006; Pandey & Nichols, 2011; 12 a) b) 0e+00 1e+05 2e+05 3e+05 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 66 2,212 1,059 2,897 5,364 11,029 28,210 41,522 119,010 156,222 238,606 202,935 277,432 286,879 171,827 Number of RNA−Seq profiles in GEO 10 100 1,000 10,000 Number of GSE studies in GEO H.sapiens M.musculus A.thaliana D.melanogaster R.norvegicus D.rerio S.cerevisiae C.elegans G.gallus S.scrofa B.taurus E.coli O.sativa Z.mays M.mulatta S.pombe P.aeruginosa C.lupus C.albicans G.max O.aries X.laevis S.aureus S.lycopersicum M.tuberculosis s.construct P.falciparum X.tropicalis C.reinhardtii P.troglodytes C.hircus M.fascicularis N.furzeri N.crassa O.cuniculus S.mediterranea V.vinifera E.caballus T.aestivum B.subtilis S.enterica S.pneumoniae M.auratus A.mellifera S.pyogenes B.napus G.hirsutum D.discoideum C.glabrata A.baumannii 37.7% 36.2% 26.2% other species M.musculus H.sapiens Figure 2.1: RNA-Seq profiles currently available at GEO/SRA: a) Publicly available RNA-Seq profiles currently available in NCBI GEO as of 08/08/2022. b) The distribution of studies (GSE series) from different species in GEO that generated the RNA-Seq profiles. Only the top 50 species are shown in the barplot. 13 Zon & Peterson, 2005). Information derived from curated pathways, and gene ontology databases of model organisms is widely used to infer the molecular functions in other species (Ashburner et al., 2000; Kanehisa & Goto, 2000). Identifying and characterizing conserved and divergent biological features by comparing genomic and functional data across species has deepened our understanding of molecular evolution. Comparative analysis of gene expression profiles across species has several important appli- cations in biology and medicine. Cross-species gene expression studies have addressed many biomedical research questions, including characterizing cellular phenotype, disease states, and ag- ing (Ginis et al., 2004; Sweet-Cordero et al., 2005; De Magalh˜ aes et al., 2009; Mueller et al., 2017). Comparing gene expression data of different tissues from multiple species has identified conserved transcriptional programs driving the development and evolution of organs (Brawand et al., 2011; Merkin et al., 2012; Cardoso-Moreira et al., 2019). Conserved co-expression modules across or- ganisms and similarity of expression patterns in developmental stages were discovered across ani- mal phyla (Gerstein et al., 2014). The natural representation of the data table from a gene expression experiment is a matrix. The rows of a gene expression matrix correspond to genes annotated in a species, and the columns represent different experimental conditions of the biological sample. For example, the expression of m genes across n different experimental conditions is represented as a m× n matrix. The i-th row in this matrix corresponds to the expression values of a gene i across n experimental conditions. Thej-th column represents the expression profile of all m genes in thej-th condition. The number of genes expressed or detected for the same experimental condition can differ within and across species. In addition, different species have a different number of genes in the genome. In studies co-analyzing gene expression profiles across species, the usual strategy is to identify orthologs, almost always focusing on one-to-one mappings, and then construct a gene expression profile from the associated values. Studies involving multiple species use one-to-one orthologs shared across all species (all-way orthologs) to construct a combined gene expression profile from the values associated with those orthologs (Brawand et al., 2011; Merkin et al., 2012; 14 Barbosa-Morais et al., 2012; Cardoso-Moreira et al., 2019). The assumption is that the shared genes, usually restricted to one-to-one orthologs, perform similar roles in different species. The relationships between the transcriptomic profiles of multiple species and their association with the phenotype are then investigated based on the shared genes between species. The concept of orthology is often misused in comparative genomics as a description of func- tional equivalence of genes in different species (Theißen, 2002). Orthology definition does not guarantee the conservation of sequence, structure, or genomic context (Gabald´ on & Koonin, 2013). Even if all genes have unique orthologs between species, the corresponding expression values need not be equivalent. Species-divergence time also dramatically influences the number of shared or- thologous genes between species. As the divergence time between two species increases, the num- ber of orthologous genes shared decreases. For example, on an average∼ 72 % of human protein- coding genes share a one-to-one orthologous relationship with other primate species. In contrast, only 13.86 % of the human protein-coding genes are one-to-one orthologous with Drosophila (Fig- ure 2.2 a). One-to-many and many-to-many orthology relationships (paralogs) also increase with evolutionary distance resulting in a fewer number of one-to-one orthologous genes between evolu- tionarily distant species (Figure 2.2 a). Similarly, the number of genes we can assign to mutually orthologous sets becomes smaller as more species are added. Compared to the single-copy genes, genes with multiple copies are shown to have a high ex- pression diversity (Gu et al., 2004; Huminiecki & Wolfe, 2004), underlying that the information from all genes is essential for cross-species analysis. Unless the species are closely related, the homology-based approaches fail to account for a significant amount of information from non- orthologous genes. These observations led us to reconsider the use of orthology in cross-species gene expression analysis. In the following section, we explore the concept of homology, the re- lationship between orthology and function, and its implications in cross-species gene expression analysis. 15 b) a) 0 300 600 900 0 20 40 60 one2many many2many % coding orthologs in human Phylogenetic distance with human (Mya) 0 20 40 60 0 300 600 900 % coding orthologs in human Phylogenetic distance with human (Mya) Figure 2.2: Species-divergence time influences the number of shared orthologous genes: a) The percentage of protein-coding human genes that are one-to-one orthologous with 107 species in the Ensembl database (release v94). b) The percentage of human protein-coding with one-to-many and many-to-many orthologous relationships with 107 different species available in the Ensembl database (release v94). 2.2.1 Homology in cross-species gene expression comparisons Homology is a fundamental concept in comparative and evolutionary biology. Two features are considered homologs if they have a common evolutionary origin. Richard Owen first introduced the term homolog in 1843 to define the “same organs in different species under every variety of form and function” (Owen, 1843). Owen’s definition was based on structure and location rather than the common origin of homologous organs. Later Darwin’s work on natural selection led to the reinterpretation of homology as evidence of evolution (Darwin, 1859; Huxley, 1860). The idea of inferring phylogenetic relationships between species based on comparing anatomical structures became prominent. Moving forward a century, homology-based comparisons extended to other biological entities, such as sequences, protein, and genes (Watson & Kendrew, 1961; Margoliash, 1963; Pauling & Zuckerkandl, 1963; Walsh & Neurath, 1964). Closely related protein sequences were compared to reconstruct the ancestor sequence and phylogenetic trees (Fitch & Margoliash, 1967; Dayhoff, 1969). To distinguish how protein sequences diverged from their last common 16 ancestor, Walter Fitch introduced the concept of orthologs and paralogs, the two different classes of homologous relationship (Fitch, 1970). The notion of orthologs and paralogs started to receive extensive attention after whole-genome bacterial sequences were completed in 1995 (Fleischmann et al., 1995; Fraser et al., 1995). The availability of the complete genome opened a new era of studying the evolutionary processes at a molecular level. An accurate relationship between corresponding genes in two genomes needs to be established to understand the evolution of two species. Inferring the homology relationship of genes pro- vides clues to determine the species phylogeny. Suppose the genes in two species are derived from a single gene in the last common ancestor. In that case, their evolutionary history reflects the true phylogeny of the taxa from which they are obtained. These are orthologous genes arising from a speciation event: vertical descent from the last common ancestor. When pairs of genes start to diverge via non-speciation events, the evolutionary history of the genes needs to be dis- tinguished from the evolutionary history of species. For example, duplication events give rise to paralog genes within a genome while the ancestor gene is transmitted to two daughter lineage in a speciation event. Distinguishing orthologs and paralogs is essential to infer evolutionary ori- gins of two species. Orthologs can be further sub-classified into one-to-one, one-to-many, and many-to-many orthologs based on the extent of duplications after the speciation event. One- to-one pairwise orthology means that both genes in the pair have only one copy in the other species. A one-to-many relationship occurs when one gene in one species has more than one ortholog. A many-to-many relationship occurs when multiple orthologs are found in both species. Various combinations of speciation and duplication events often lead to intricate evolutionary relationships. In addition, different evolutionary processes, including gene fusion/fission, loss of genes, gene births, and horizontal gene transfers, complicate the relationship between genes and tend to blur the identification of true orthologs among a set of homologs (Koonin, 2005; Kuzniar et al., 2008). Homology relationships are usually inferred using sequence similarity searches. When two sequences or structures share more similarities than expected by chance, the possibility that they 17 originated from a common ancestor is high (Pearson, 2013). Identifying homologous sequences using sequence similarity searches is one of the first steps in characterizing new sequences. Se- quence similarity search tools (Altschul et al., 1990; Pearson & Lipman, 1988; Finn et al., 2011) detect homologs with significant sequence similarity scores. The detection of homologs of a pro- tein provides clues about structure, function, and evolution (Chen et al., 2018). However, ho- mologous features do not always share identical sequences. Consequently, as evolutionary dis- tance increases, homology detection becomes difficult as sequences may diverge beyond statistical recognition (Doolittle, 1981; Elhaik et al., 2006). While sequence similarity is a reliable strategy for inferring homology relationships of recently diverged sequences, the most similarity search seeks to answer a more challenging question: Do homologous sequences have similar functions? 2.2.2 Orthology-function conjecture Inferring function based on orthology is a vital aspect of determining homology relationships. Functional information from orthologs in the model organisms is commonly used to character- ize genes and proteins in newly sequenced organisms (Dolinski & Botstein, 2007). Although the original definition of orthology is based on how features are derived from an ancestor and does not imply the conservation of sequences, structure, and function, the “orthology-function” conjecture assumes orthologs tend to retain equivalent functions in different species (Gabald´ on & Koonin, 2013). The assumption of conserved function was initially supported when similar molecular effects were observed when mutations in orthologs were introduced in other model or- ganisms to study the role of genes in human diseases (Quiring et al., 1994; Franco et al., 2011; Goh et al., 2011). With the availability of complete genome sequences and advancements in se- quence similarity searches to identify homologous features (Altschul et al., 1997; Henikoff & Henikoff, 1992), researchers started to explore homology-phenotype association for different genes and regulatory sequences. The relationship between gene expression and phenotype, a relationship that involves several types of molecules, signaling patterns, and interactions, is deeply complex. Comparative genomics 18 has revealed many examples in which some or even all components of a pathway can be replaced by a functional equivalent that differs in its evolutionary origin (Gabald´ on & Koonin, 2013). The situation where unrelated or paralogous proteins perform the same critical functions in different species is referred to as non-orthologous gene displacement (Koonin et al., 1996). Indeed sev- eral cases of human disease-associated mutation are known to present no phenotypic effect in mouse (Gao & Zhang, 2003). There is evidence of functional divergence between human-mouse orthologs, including genes involved in DNA repair, inflammation pathways, and development (Hi- rano et al., 2007; Yashiro et al., 2000; Liu et al., 2010). Differences in splicing patterns and copy number variations across species are known to cause expression divergence between orthologs (Nurtdinov et al., 2003; Nguyen et al., 2006). Similarly, genes and regulatory networks adapted for a different function are observed in plants (Kajala et al., 2021). Liao & Zhang (2008) showed that more than 20% of the essential human genes have non-essential orthologs in mouse, resulting in different phenotypes. In fact, out of the 500-600 genes that are estimated to be present in the last common ancestor of all cellular life, only∼ 60 genes are present in all modern-day genomes (Koonin, 2003). Similarly, studies on enzymatic evolution have identified structurally distinct non-homologous enzymes catalyzing the same biochemical reaction (Omelchenko et al., 2010; Galperin & Koonin, 2012). Furthermore, many of the genes performing essential functions in bacterial groups do not have a shared evolutionary origin (Soucy et al., 2015). Independent strate- gies exist in nature to achieve the same molecular function. Concerted evolution and phenotypic convergence can lead to similar cellular functions among cell types without a common precursor (Arendt et al., 2016). Transcription factor binding sites may evolve de novo or exaptively to bring forth new traits (Wray, 2007). For example, a new transcription factor binding site at the locus of retrotransposon integration of mRNA leads to a variant or novel function (Boer et al., 1987; Soares et al., 1985). Similarly, the rewiring of the regulatory sequences: enhancer-to-promoter transformation leads to the origination of new genes and exons (Carelli et al., 2018). Notably, some of these new genes code for novel proteins and become functionally important (Chen et al., 2010). Collectively, the re-purposing of genes, proteins, and regulatory networks imply that 19 orthology is not a necessary criterion to achieve equivalent functions. 2.2.3 Matrix algebra models for gene expression analysis A different category of methods for gene expression analysis is based on matrix factorization (MF). MF, also called matrix decomposition, is a class of unsupervised techniques that infer low- dimensional structure from high-dimensional data while preserving as much information as pos- sible from the original data. In general, these approaches use generalizations of matrix algebra frameworks to dissect genome-scale biological signals into mathematically defined patterns. Us- ing these approaches, gene expression data can be formulated as the outcome of a simple linear network: superpositions of the effects of different sources, experimental or biological, that drive the expression of genes in the dataset. The linear superpositions are estimated such that they cor- relate with biological processes that affect the measured signal and cellular states that compose the signals. Singular value decomposition (SVD) is a prominent MF approach that has been widely ap- plied in the analysis of microarray and RNA-Seq datasets. SVD decomposes high-dimensional gene expression profiles into mathematically decorrelated and decoupled patterns. Given a gene expression matrix with genes in rows and sample phenotypes in columns, SVD decomposes the matrix into three factor matrices: a right basis matrix summarizing gene expression patterns across different samples, a diagonal matrix with non-negative singular values representing the relative importance of the corresponding patterns, and a left basis matrix containing uncorrelated lin- ear combinations of genes as the columns. The model provides a mathematical framework to distinguish similar features from dissimilar ones among genome-wide expression datasets (Al- ter et al., 2000). SVD is often used to implement the principal component analysis (PCA), a method for dimensional reduction and data visualization. In addition, SVD has wide-ranging applications in gene expression analysis, including identifying and filtering experimental arti- facts, estimating missing data, identifying transcript length distribution, clustering, and charac- terizing cellular phenotypes (Lukk et al., 2010; Li & Klevecz, 2006; Bertagnolli et al., 2013; 20 Ringn´ er, 2008). The other prominent MF approaches include independent component analysis (ICA) and non- negative matrix factorization (NMF). In ICA (Comon, 1994), the goal is to extract the original signals from a mixture of signals. ICA decomposes an input data matrix as a linear combination of statistically independent latent feature variables or components. ICA has been used in gene expression context as a dimension reduction and gene-functional discovery tool (Liebermeister, 2002; Engreitz et al., 2010). NMF (Lee & Seung, 1999) models the data matrix as the product of two non-negative matrices. The non-negative constraints ensure that only additive combinations of non-negative vectors reconstruct the data. NMF also outputs sparse results, providing a compact and local representation of the data (Lee & Seung, 2000). NMF has been successfully applied in a variety of applications in computational biology. Examples include dimensionality reduction, molecular pattern discovery, clustering analysis, and classifying tumor subtypes (Devarajan, 2008). A natural extension of MF methods to simultaneously analyze multiple high-dimensional data is joint matrix factorization (JMF). Given a set ofk data matrices (D i ), a joint decomposition fac- torizes individual data matrices into a data-specific factor matrix W i and a common latent variable matrixH. Formally, JMF can be summarized as D = D 1 D 2 . . . D k =W i H fori=1,...,k. Depending upon the initial datasets and the properties of factor and latent matrices, individual decomposition can be exact (D i = W i H) or approximate (D i = W i H + ϵ i ). In non-exact decomposition, theϵ i represents the reconstruction error for thei-th dataset, and the factorization error is usually minimized by minimizing the Frobenius norm of the error, represented as∥ϵ i ∥ F . In cases where exact decomposition is not possible, JMF approaches attempt to find a solution by minimizing the total reconstruction error. 21 JMF-based methods have been applied in gene expression analysis. Integrative analysis of gene expression, DNA methylation, and miRNA expression datasets using joint NMF (jNMF) is used to identify clinically distinct patient subgroups with shared genomic features (Zhang et al., 2012). Tamayo et al. (2007) used a non-negative matrix (NMF) factorization to learn a low-dimensional approximation of the different microarray expression datasets and used the reduced space to com- pare leukemia and lung cancer expression profiles. Similarly, the simultaneous decomposition of multiple transcriptomics data is used to find biomarker genes and differentially expressed genes (Wang et al., 2015; Fujita et al., 2018). In the final section of this chapter, we will explore the idea of a common expression subspace in the context of JMF and the critical properties of this subspace that will enable efficient cross- species gene expression analysis. 2.3 Common expression subspace The natural representation of gene expression data is a matrix of the measured values (gene/transcript counts) in rows and individual samples in columns. The columns correspond to different ex- perimental conditions or cell states of the sample. One or more biological processes are shared across samples resulting in related structures in the data. The experimental replicates or samples with similar phenotypes have correlated expression profiles. The relationship between biologi- cal processes and the similarities of gene expression patterns in these processes constrain high- dimensional gene expression datasets to have a low-dimensional structure. The number of mRNA, proteins, and concurrently active pathways within any cell is constrained by several factors, such as the cell-cycle stages, cell volume, energy, and free-molecule limitation (Klumpp et al., 2013; Heyn et al., 2015). Most notably, the presence of transcriptional modules, where genes are co- regulated, indicates an effective low dimensionality in gene expression data (Eisen et al., 1998; Segal et al., 2003). Thus, for a gene expression dataset where columns share biological processes, most of the shared signals can be summarized by a small number of basis vectors relative to the 22 original measurement dimension defined by the number of genes in the genome. The idea of low-dimensional representations of gene expression profiles can also be extended to analyze gene expression profiles from multiple species. The global transcriptional patterns are more similar between homologous organs of different species than between different organs from the same species (Brawand et al., 2011; Merkin et al., 2012). The similarities in the biological processes responsible for conserved phenotypes in different species can be used to construct a low-dimensional common expression subspace. Similar to the JMF definition, cross-species gene expression matrices can be factorized to esti- mate a low-dimensional common expression subspace. Consider a set ofk species where species i has a gene-expression matrix D i with m i genes in the rows and n columns. The corresponding column of each gene-expression matrix represents an expression profile of a functionally similar phenotype. Given this setup, joint factorization is applied to obtain D i = W i H + ϵ i , where W i is a species-specific factor matrix and H represents a common expression subspace. The species- specific factor has the same dimensions as that of the original data matrix, while the shared sub- space has identical dimensions for all k factorizations. The different directions of the common expression subspace capture one or more biological processes related to a conserved phenotype in all k species. The common expression subspace provides a universal co-coordinate system to analyze gene expression profiles across species without relying on orthology mapping. In this JMF setup, thep-th column of eachD i matrix corresponds to the gene expression of a functionally similar profile representing a similar phenotype across species. Matching expression profiles of functionally similar tissues and cell types can be used. For example, Malpighian tubules in Drosophila and kidneys in mammals are shown to perform similar excretory functions (Jung et al., 2005). Kidney development and essential functions are implemented through a conserved genetic architecture, and the basic structural units of kidneys, the nephron, are also conserved in vertebrates (Lechner & Dressler, 1997; Lindstr¨ om et al., 2018). Many invertebrate species have excretory organs composed of nephron-like structures that perform similar functions to those of the vertebrate kidney. 23 The two main JMF approaches where the concept of the common subspace is used are general- ized singular value decomposition (GSVD) (Van Loan, 1976) and higher-order GSVD (HO GSVD) (Ponnapalli et al., 2011). Both of these approaches have been used to analyze cross-species gene expression data. Using GSVD, Alter et al. (2003) compared cell cycle expression profiles of human and yeast. GSVD was initially developed to factorize two matrices simultaneously, and the method was further extended to more than two matrices to develop HO GSVD. These methods decompose expression profiles of functionally similar phenotypes to estimate a shared factor: a common ex- pression subspace. These methods showed that orthology-independent cross-species analysis could be performed using this common subspace. Using cell-cycle gene expression datasets, these ap- proaches have shown examples of genes with highly conserved sequences across species but with significantly different cell-cycle peak times. Although these methods offer a potential solution for orthology-independent analysis, the esti- mated common expression subspace in GSVD/HOGSVD does not necessarily represent biological processes conserved across species. Most common expression subspace dimensions represent bio- logical processes specific to individual species. As a result, identifying functionally relevant genes corresponding to the column phenotypes is not possible. Moreover, the steps involved in estimating the common subspace and comparing the expression profiles using these methods require complex procedures. This makes the biological interpretation of the species-specific factors and common subspace difficult in these approaches. We discuss the limitations of these methods in Chapter 3. Next, we discuss some important properties of the common expression subspace and the factor matrices for cross-species gene expression analysis. 2.3.1 Orthogonal and linear properties of genetic networks Orthogonality and linearity are two critical conditions for modeling genome-scale biological sig- nals. High-dimensional data is reduced to a low-dimensional subspace by a linear transformation, and orthogonality is usually desired for the basis of this subspace. Orthogonality allows an effi- cient design of genetic networks by reducing redundancy. Evidence of orthogonality is found in 24 core metabolic pathways. Integrating gene expression data with the metabolic network structure in yeast, Ihmels et al. (2004) investigated the role of isozymes in metabolic pathways. Isozymes are chemically distinct but functionally similar enzymes. They found that most isozyme pairs were separately co-regulated with distinct processes to reduce crosstalk between pathways that use a common chemical reaction. Using orthogonality and linearity, the cross-species gene expression dataset is formulated as linear superpositions of mathematical patterns (common expression subspace), which correlate with independent biological processes. In this design, the columns of the common-expression sub- space and that of the species-factor matrices are mutually independent. Using decorrelated math- ematical patterns, the overlap or crosstalk between the genome-scale effects of the corresponding cellular elements or modules is negligible. Thus the model separates the cellular elements respon- sible for different biological processes. In the next chapter, we discuss our strategy to estimate biologically meaningful common ex- pression subspace and species-specific factors. We discuss how the limitations of the existing ap- proaches are addressed using our novel joint matrix factorization approach and discuss the details of minimizing the reconstruction error. 25 Chapter 3 Orthogonal shared basis factorization This chapter discusses the details of the joint matrix factorization we developed, how the different factors are estimated, and their interpretation. The method factorizes the gene expression profiles of functionally similar phenotypes from multiple species into a species-specific loading factor, a species-specific diagonal factor, and a common expression subspace. For each species, the load- ings for independent linear combinations of genes are stored in different columns of the loading factor. The amount of correlation information represented by different dimensions of the common expression subspace is stored in the diagonal factor. Finally, these factors are estimated such that the reconstructed expression profiles using these factors are close to the original gene expression profiles. 3.1 Method definition Consider a set of k real matrices D i ∈ R m i × n , each with full column rank. Each D i represents the expression ofn profiles (columns) in species i withm i genes (rows) annotated for that species. The p-th column of each D i matrix corresponds to the gene expression profile of a functionally similar phenotype across species. The number of rows is different for eachD i matrix, but they all 26 have the same number of columns. We define orthogonal shared basis factorization (OSBF) as D 1 =U 1 ∆ 1 V T +ϵ 1 , D 2 =U 2 ∆ 2 V T +ϵ 2 , . . . D k =U k ∆ k V T +ϵ k . EachU i ∈R m i × n is a species-specific loading matrix with orthonormal columns ( U T i U i = I) and ∆ i ∈ R n× n is a diagonal matrix with positive values. The right basis matrix V ∈ R n× n is an orthogonal matrix and identical in all thek matrix factorizations (Appendix Figure A.1). In OSBF, the right basis matrixV is a subspace of the row space (R n ) shared by allk matrices representing correlation relationships between the columns. An orthonormal basis of this shared subspace is used as the axes of the common space. We estimate the components of the matrix factorization by minimizing the total reconstruction error given by P k i=1 ∥ϵ i ∥ 2 F = P k i=1 ∥D i − U i ∆ i V T ∥ 2 F . We designed an alternating least square update strategy to minimize the total factor- ization error (section 3.1.2). In OSBF, data-specific factor matrix is U i ∆ i and the common latent variable matrix is represented by V T . A schematic summarizing the joint factorization of gene expression matrices from different species using OSBF is shown in Figure 3.1. 3.1.1 Estimating species-independent common expression subspace In OSBF, the initial estimate of the common expression subspace V is computed based on the inter-sample correlations within each species. The gene expression profiles are first standardized to compute the inter-sample correlation relationship (R i ). Let X i be the gene expression profiles of speciesi with mean-centered columns and unit variance. We can write X i =C i D i S i − 1 , 27 ≈ ≈ ≈ ≈ Dim 1 Dim 2 Dim 3 gene 1 gene 2 gene m 1 gene 1 gene 2 gene m 4 gene 1 gene 2 gene 2 gene m 2 gene 1 gene m 3 gene loadings Figure 3.1: Graphical summary of OSBF: The cartoon shows the joint factorization of four (k = 4) D i matrices each with three tissues (n = 3) as columns. OSBF is a joint matrix factorization ofk D i ∈R m i × n matrices. EachD i matrix is decomposed into a species-specific loading matrix U i ∈ R m i × n , a species-specific diagonal matrix ∆ i ∈ R n× n , and a shared orthogonal right basis matrixV . The columns ofU i are orthonormal, and∆ i has positive entries. The columns of theV matrix represent the different dimensions of the common subspace shared by all species. where C i =I m i − m i − 1 1 m i 1 m i T 28 is a centering matrix andS i =diag(s 1 ,...,s n ) is a diagonal scaling matrix. The standard deviation of j-th column of D i is represented by s j . Within species i, the correlation between expression profiles for the n profile is R i =X T i X i /m i . Each R i matrix has a dimension of n× n independent of the number of genes annotated in the species. Since the corresponding column of each D i matrix represents a similar phenotype, we define an expected correlation matrix across the species: E =E(R)= 1 k P k i=1 R i . The eigenvalue decomposition of E is determined, where E = VΛ V T and the eigenvectors of the E matrix form the initial estimates for the different dimensions of the common expression subspace. The right basis matrix V is identical in all the k species, and the dimensions of V represent the correlation relationship between the phenotypes independent of the species. The steps in estimating the common expression subspace are shown in Figure 3.2. Given the initial estimate of the common expression subspace, we find the best configuration of V ,U i , and∆ i that minimize the total reconstruction error P k i=1 ∥ϵ i ∥ 2 F . The columns of the U i matrix store the loadings for independent linear combinations of genes that transform gene expression profiles to the V space. The genes with significant contributions to the individual dimensions of the common expression subspace can be identified using coefficients (loadings) values in the U i matrices. The diagonal∆ i matrix indicates the amount of correlation information represented by different dimensions of the common expression subspace. In summary, our methodology has easily interpretable factors and employs a systematic strategy to estimate the expression subspace shared by all species. 3.1.2 Minimizing total factorization error In OSBF, we define our shared factor as a subspace capturing inter-sample correlation relation- ships. We also require the independence of the columns of the species-specific factors as it helps 29 Dim4 Dim5 Dim2 Dim1 Figure 3.2: Steps involved in the estimation of common expression subspace: Estimating com- mon expression subspace (V ) based on gene expression profiles of five tissue types ( n = 5) for three (k = 3) species. The D i matrices are first standardized to obtain the X i matrices. The inter-tissue correlation within a species (R i ) and the expectedR i matrixE is then computed. The eigenvectors of the E matrix span different dimensions of the expression subspace shared by all species. to differentiate the contribution of genes/features across different dimensions of the shared factor. The diagonal matrices represent the amount of correlation information represented by individual dimensions. Given these conditions, an exact factorization is not always possible. This introduces errors in individual factorizations, and we want to minimize the total factorization error. We estimate the shared factor V from the expected inter-sample correlation matrix across species. The different dimensions of V represent the correlation relationship between tissues in- dependent of the species. An orthonormal basis of this shared subspace is used as the axes of the common space. GivenV , we compute initial estimates ofU i and∆ i by solving the linear systems D i V = U i ∆ i . This gives us an exact solution for all k factorizations. The initial estimates of U i ’s do not have independent columns. Orthonormalizing U i ’s introduces errors and depending 30 upon the D i matrices, an exact factorization is not possible. Our goal in the optimization is that, given our initial estimate of meaningful common expression subspace, we want to find the closest configuration of U i , and∆ i , which minimizes the total factorization error. In the following sections, we will discuss how to achieve the best configuration of V , U i , and ∆ i that minimizes the total reconstruction error P k i=1 ∥ϵ i ∥ 2 F . First, we will discuss our objective function, the constraints, and the general strategy to minimize the error. In OSBF, we want to solve the following optimization problem: minimize: P k i=1 ∥D i − U i ∆ i V T ∥ 2 F , subject to: U T i U i =I, for1≤ i≤ k, V T V =VV T =I, ∆ i =diag(δ i,1 ,...,δ i,n ), δ i,j ≥ 0, for1≤ i≤ k. The objective function can be reformulated as P k i=1 tr(D T i D i )− 2tr D T i U i ∆ i V T +tr(∆ 2 i ) LetΦ ,Ψ i , andΘ i be matrices with Lagrange multipliers for constraintsV T V− I =0,U T i U i − I =0 andδ i,j ≥ 0. The LagrangeL is L(U,∆ ,V)= k X i=1 tr(D T i D i − 2D T i U i ∆ i V T +∆ 2 i )+tr(Φ( V T V− I))+tr(Ψ i (U T i U i − I))+tr(Θ i ∆ i ). Solving for an individualU i using the partial derivative ofL with respect toU i gives ∂ ∂U i L(U,∆ ,V)=− 2D i V∆ i +U i (Ψ T i +Ψ i ) U i =D i V∆ i ((D i V∆ i ) T (D i V∆ i )) − 1/2 . (3.1) 31 SolvingV using the partial derivative ofL with respect toV gives ∂ ∂V L(U,∆ ,V)= k X i=1 − 2D T i U i ∆ i +V(Φ T +Φ) V = k X i=1 D T i U i ∆ i (( k X i=1 D T i U i ∆ i ) T ( k X i=1 D T i U i ∆ i )) − 1/2 . (3.2) Solving for an individual∆ i using the partial derivative ofL with respect to∆ i gives ∂ ∂∆ i L(U,∆ ,V)=− 2U T i D i V +2∆ i +Θ i ∆ i =U T i D i V. (3.3) 3.1.3 Method derivation The section 3.1.2 shows that given the other two factors, we can find the optimal third factor: left basis matrices (equation 3.1), the right basis matrix (equation 3.2), and the diagonal matrices (equation 3.3). Given any two factors, we will discuss how to efficiently update the third factor and minimize the total factorization error. In practice, we do not compute the square root inverse to update U i and V . We can make the optimal update of our matrix in each iteration based on the singular value decomposition (SVD) of other matrices. The first two results below show that, given either the left or right basis matrix, we can find the other matrix (optimal) with the required properties. Proposition 1. Consider matrices A∈R m× n with rank n and Q∈R n× n . If XΣ Y T is the SVD of AQ T , then among matrices B ∈R m× n with orthonormal columns,∥A− BQ∥ F is minimized whenB =XY T . Proof. The norm and its square have the same minimum value, so we consider the minimum of ∥A− BQ∥ 2 F =tr(A T A)+tr(Q T Q)− 2tr(AQ T B T ), 32 which coincides with the maximum value oftr(AQ T B T ). WithXΣ Y T the SVD ofAQ T , define Z = Y T B T X. Since B T B = I, we have ZZ T = I. We can bound the maximum value of tr(AQ T B T ) as follows: tr(AQ T B T )=tr(XΣ Y T B T )=tr(ZΣ)= P n i=1 Z ii σ i ≤ P n i=1 σ i , and this bound is attained whenZ =I, and thusB =XY T . In OSBF updates, we use the proposition 1 to find the optimal left basis matrices. In proposition 1, let us substitute A = D i ,B = U i , and Q = ∆ i V T . Now we have AQ T = D i V∆ i = XΣ Y T . XY T gives the U i with orthonormal columns. The same can also be obtained by substituting D i V∆ i =XΣ Y T in equation 3.1: U i =D i V∆ i ((D i V∆ i ) T (D i V∆ i )) − 1/2 U i =XΣ Y T ((XΣ Y T ) T (XΣ Y T )) − 1/2 =XY T . Proposition 2. Consider matricesA∈R m× n with rankn andB∈R m× n . IfXΣ Y T is the SVD of A T B, then among orthogonal matricesQ∈R n× n ,∥A− BQ T ∥ F is minimized whenQ=XY T . Proof. The norm and its square have the same minimum value, so we consider the minimum of ∥A− BQ T ∥ 2 F =tr(A T A)+tr(B T B)− 2tr(A T BQ T ), which coincides with the maximum value oftr(A T BQ T ). WithXΣ Y T the SVD ofA T B, define orthogonal matrixZ =Y T Q T X. We bound the maximum oftr(A T BQ T ) as tr(A T BQ T )=tr(XΣ Y T Q T )=tr(ZΣ)= P n i=1 Z ii σ i ≤ P n i=1 σ i , and this bound is attained whenZ =I and thusQ=XY T . Using proposition 2, we can find the orthogonal right matrix for each factorization ( Ω i ) involved 33 in the OSBF. In proposition 2, let us substitute A = D i ,B = U i ∆ i , and Q = Ω i . Now we have A T B = D T i U i ∆ i = XΣ Y T and orthogonal Ω i is given by XY T . The computed matrices using proposition 2 are not shared and also does not guarantee that the total factorization error involved P k i=1 ∥D T i U i ∆ i − Ω i ∥ 2 F is minimized. We will further expand the proposition 2 to find the optimal shared right basis matrix. Proposition 3. For a set of n matrices A 1 ,A 2 ,...,A n , where A i ∈ R k× k , the orthogonal matrix Q∈R k× k minimizing P n i=1 ∥A i − Q∥ 2 F is given byQ=XY T , where SVD of P n i=1 A i =XΣ Y T . Proof. We seek to minimize P n i=1 ∥A i − Q∥ 2 F subject to Q T Q=I. Defining A= P n i=1 A i , this objective function can be rewritten tr( P n i=1 A T i A i )− 2tr(A T Q)+ntr(Q T Q). LetΦ be a matrix of Lagrange multipliers forQ T Q− I =0. The LagrangianL is then L(Q,Φ)=tr( P n i=1 A T i A i )− 2tr(A T Q)+tr(Q T Q)+tr(Φ( Q T Q− I)). SinceQ is orthogonal, the partial derivatives ofL(Q,Φ) with respect toQ are as follows: ∂L(Q,Φ) ∂Q =− 2A+Q(Φ T +Φ) . Setting this equation to0 yields: Q=A Φ T +Φ 2 − 1 and A=Q Φ T +Φ 2 . 34 Again becauseQ is orthogonal, A T A= Φ T +Φ 2 2 and Φ T +Φ 2 =(A T A) 1/2 , which impliesQ=A(A T A) − 1/2 . SinceXΣ Y T is the SVD ofA, we conclude Q=XΣ Y T (YΣ 2 Y T ) − 1/2 =XY T . In OSBF, V is shared across the k factorizations. To find the closest shared V , we will use proposition 3. In proposition 3, let us substitute A i = D T i U i ∆ i and Q = V . Now we have P n i=1 A i = P k i=1 D T i U i ∆ i = XΣ Y T and the optimal shared right basis in OSBF is given by V =XY T . 3.1.4 Iterative algorithm to minimize factorization error Based on the equations 3.1-3.3 and propositions 1-3, we devised an iterative algorithm to minimize the total decomposition error. We make the optimal update in each iteration, and since the objective function value is bounded below by zero, our procedure will converge to a stable local optimum. Steps for the iterative updates of the factorization are shown below: 35 Algorithm 1: Input: D i ,i=1,...,k 2: Output: U i ,∆ i , andV , whereU T i U i =I andV T V =VV T =I 3: InitializeU j i ,∆ j i , andV j . Initializeϵ =∞. Let∥ϵ j,j,j ∥ F = P k i=1 ∥D i − U j i ∆ j i V j T ∥ F 4: UpdateU i : U j+1 i ← ZY T , where SVD ofD i V j ∆ j i =ZΣ Y T . Compute∥ϵ j+1,j,j ∥ F If∥ϵ j+1,j,j ∥ F <ϵ : U i ← U j+1 i ϵ ←∥ ϵ j+1,j,j ∥ F 5: Update ∆ i : ∆ j+1 i ← diag((U i ) T D i V j ). Compute∥ϵ j,j+1,j ∥ F If∥ϵ j,j+1,j ∥ F <ϵ : ∆ i ← ∆ j+1 i ϵ ←∥ ϵ j,j+1,j ∥ F 6: UpdateV: V j+1 ← MQ T , where SVD of P k i=1 D T i U i ∆ i =MΦ Q T . Compute∥ϵ j,j,j+1 ∥ F If∥ϵ j,j,j+1 ∥ F <ϵ : V ← V j+1 ϵ ←∥ ϵ j,j,j+1 ∥ F 7: Repeat steps 4-6 until convergence. For a gene expression profile Y i from species i, minimizing the factorization error provides us the configuration of U i ’s and ∆ i ’s such that when we project a gene expression profile using Y T i U i ∆ − 1 i , the projected profiles lie in a space close to the estimated common expression subspace V . In addition, our update strategy reduces the initial factorization error by maintaining the prop- erties of the factors. For the common expression subspace, let the initial estimate of the common expression subspace using the eigenvalue decomposition of the expected E(R) matrix be V . Let the optimized shared subspace be ˜ V . We have R θ V = ˜ V, where R θ ∈ R n× n is a matrix. In OSBF, based on proposition 2 and 3, we express all V ’s, including our initial estimate as the product of a XY T , where X is a left singular matrix and Y is a right singular matrix. For a givenV , this is, in fact, the SVD with the diagonal matrix storing singular values being the identity matrix. So the singular values are 1; hence, all the eigenvalues are 1. 36 Thus the determinant ofV and ˜ V is 1. This makesR θ a pure rotational matrix. We can show that optimized subspace ˜ V represents the expected inter-tissue correlation relationship in this rotated space. For instance, if we apply the same rotation to our standardized gene expression matrices, we haveXR T θ . Now we have, E(R)= 1 k k X i=1 1 m i (X i R T θ ) T X i R T θ = 1 k k X i=1 1 m i R θ X T i X i R T θ =R θ 1 k k X i=1 1 m i X T i X i R T θ =(R θ V)Λ( R θ V) T = ˜ VΛ ˜ V In summary, the optimization strategy can be summarized as a procedure where we apply rotation to the initial estimate of common expression subspace and finding the corresponding optimal U i ’s and ∆ i ’s. We also show that the optimized V has similar properties to that of the initial estimate and optimized U i ’s and ∆ i ’s are easily interpretable using cross-species gene expression analysis (Chapter 4). Optimal learning rate in each update In gradient descent, we have the general update rule: θ t+1 =θ t − η t ∇F(θ t ), where t is the iteration counter, η is the learning rate at iteration t, and∇F is the gradient of the objective function. In our algorithm, the gradient of the objective function with respect to U is 37 − 2DV∆ . So we have the update rule as U t+1 =U t +η t ′ DV∆ . By settingη t ′ = XY T − Ut DV∆ , we haveU t+1 = DV∆ , where SVD ofDV∆ = XΣ Y T . From propo- sition 3.1, this is the optimal value forU, for a givenD i ,V , and∆ i . The learning rate determines the size of the steps. If the value of U t is very different from XY T , η will be high, and we will take larger steps. As it becomes closer toXY T , the learning rate also decreases. We have a similar case for updatingV and∆ i . 3.2 Properties of orthogonal shared basis factorization In this section, we discuss some critical mathematical properties of the factors estimated in the OSBF. Property 1. The matrix E is non-defective, the eigenvectors and eigenvalues (λ i ) are real with λ i >0. Proof. We haveE = E(R) = 1 k P k i=1 R i , whereR i = X T i X i /m i . LetA i = X T i X i . Now for any non-zero vectory, we have y T A i y =y T X T i X i y =(X i y) T (X i y)=∥X i y∥ 2 . Columns of X i are linearly independent, then y T A i y > 0 for y ̸= 0. So A i is a positive definite matrix. Also, the sum ofA i is positive definite. Thus E(R) is a non-singular matrix with positive eigenvalues. Property 2. The shared right basis matrixV is an orthogonal matrix. Proof. E is a symmetric non-defective matrix. Eigenvectors ofE are orthogonal. 38 Property 3. The species-specific U i matrices are orthonormal. Orthonormal columns ofU i have the following advantages. • The species-specific matrix U m× n = u 1 u 2 ... u n is a non-square matrix withm≫ n. To project expression profiles to the common expression subspace V , computation of the inverse ofU i matrices is required. Since the column vectors of theU i matrix are orthonormal, we can achieve this without computing a generalized inverse. Thus, a gene expression profile Y j from speciesj can be projected to the common expression subspace usingY T i U i ∆ − 1 i . • The orthonormal property of the U i matrices summarizes the column space of D using un- correlated vectors. Each u i corresponds to loadings for an independent linear combination of genes that transforms D to the common expression subspace. The genes with signif- icant contributions to the individual dimensions of the common expression subspace can be identified using coefficients (loadings) values. The un-correlated u i ’s help to minimize the redundancy of the biological processes represented by the common subspace and easily differentiate the contribution of genes/features across different dimensions of the common expression subspace. 3.3 Comparison with other joint matrix factorization methods Comparison of OSBF with Joint non-negative factorization (jNMF, Zhang et al., 2012) and HO GSVD (Ponnapalli et al., 2011) is summarized in Table 3.1. In OSBF, we achieve the following goals. 1. Biologically meaningful common expression subspace: The OSBF common expression sub- space is a low-dimensional space capturing conserved inter-sample correlation relationships. In OSBF, how individual dimensions of the common subspace relate to the columns of the gene expression matrices from different species and what conserved biological processes individual dimensions represent can be directly answered. 39 2. Interpretable factors: The independence of the columns of the species-specific factors and the common subspace helps to differentiate the contribution of genes/features across differ- ent dimensions of the common subspace. The diagonal matrices represent the amount of correlation information represented by individual dimensions of the common subspace. 3. Reducing the factorization error: Like many other joint matrix factorization approaches, OSBF is not an exact factorization. The total factorization error is minimized using an iterative algorithm. Table 3.1: Comparison of joint matrix factorization approaches OSBF HO GSVD jNMF Definition P k i=1 D i ≈ U i ∆ i V T P N i=1 D i =U i Σ i V T P i X i ≈ WH i Factorization type Not exact Exact Not exact Left basis Orthonormal Not orthonormal Not orthonormal,W ≥ 0 Right basis Orthogonal Not orthogonal Not orthogonal,H i ≥ 0 Shared basis estimation E =VΛ V T , where E = 1 k P k i=1 R i S =VΛ V T , where S = 1 N(N− 1) P N i=1 P j>i (A i A − 1 j +A j A − 1 i ) W ia = Wia( P i XiH T i )ia W( P i (HiH T i )ia H i = HiW T Xi W T WHi Compared with jNMF-based approaches, the gene loading matrices in OSBF can have positive and negative loadings. In our analysis, we take advantage of the signs of genes to distinguish whether a gene is contributing to the phenotype represented by the common expression subspace or not. In jNMF, the shared factor cannot be associated with a general mathematical property related to the columns of the data matrix. In OSBF, independent of data, the common expression subspace dimensions represent directions maximizing the inter-sample correlation relationships. In HO GSVD, the steps in estimating the shared factor require complex procedures. As shown in Table 3.1, estimating the HO GSVD shared factor requires the computation of all pairwise quotients and their arithmetic mean. The pairwise quotient is defined as A i A − 1 j , whereA i =D T i D i . Compared to the inter-sample correlation relationship in OSBF, the pairwise quotients do not have any biological interpretations. Also, the estimated shared factor in HO GSVD is non-orthogonal and different dimensions of the HO GSVD shared factor do not necessarily represent properties 40 related to conserved phenotypes. As a result, the biological interpretation of the shared factor is difficult in HO GSVD. Moreover, the independence of the columns of the species-specific factors is not guaranteed, making it challenging to differentiate the contribution of genes/features across different dimensions of the shared factor. The non-independence also requires the computation of generalized inverses to project expression profiles to the HO GSVD common subspace. These limitations restrict its application in cross-species gene expression analysis. A comparison of cross- species analysis of OSBF with HO GSVD using expression profiles of different tissue types is demonstrated in R package vignette. 3.4 Implementation The OSBF factorization algorithm is implemented as an R package and is available at https: //github.com/amalthomas111/SBF under GNU General Public License. The package includes several vignettes demonstrating examples of cross-species gene expression analysis us- ing OSBF. Using public gene expression datasets from different species, the properties of factors before and after applying the optimization procedure, and analysis comparisons of cross-species expression profiles with other matrix factorization approaches are also shown. 41 Chapter 4 Identifying functionally relevant genes using common expression subspace 4.1 Introduction Genes participating in the same pathway or part of the same protein complex are often co-regulated and exhibit correlated expression profiles (Kim et al., 2001; Segal et al., 2003). However, similar expression patterns do not necessarily imply that genes are functionally related. For instance, apparent co-expression of genes can occur by chance as a result of transcriptional regulation. As a result, in studies limited to single species, it would be challenging to distinguish functionally important genes from those accidentally regulated (Stuart et al., 2003). Conservation of co-expression patterns in multiple species is a powerful criterion to distinguish functionally important genes from a set of co-regulated genes. Several studies have used gene co- expression information across different species to infer gene functions (Bergmann et al., 2004; Lefebvre et al., 2005; van Dam et al., 2015; Obayashi et al., 2019). In studies co-analyzing gene expression profiles across species, the usual analysis strategy starts with selecting a set of homol- ogous genes shared by the species. The assumption is that the shared genes, usually restricted to one-to-one orthologs, perform similar roles in different species. The relationships between the 42 transcriptomic profiles of multiple species and the genes contributing to phenotypes are then in- vestigated. The relationship between gene expression and phenotype is rarely simple, and predicting phe- notypic relationships based on orthologs is not always straightforward. Although many cellular functions are conserved across species, one, more, or even all components of a pathway can be replaced by genes with independent evolutionary origin (Koonin et al., 1996). Studies on enzy- matic evolution have identified structurally distinct non-homologous enzymes catalyzing the same biochemical reaction (Omelchenko et al., 2010; Galperin & Koonin, 2012). Many genes per- forming essential functions in bacterial groups do not have a shared evolutionary origin (Soucy et al., 2015). Functional divergence of human-mouse orthologs has been observed, including genes involved in DNA repair, inflammation pathways, and development (Hirano et al., 2007; Yashiro et al., 2000; Liu et al., 2010). There is evidence for homologous genes and regula- tory networks adapted for different functions in both plants and animals (Kajala et al., 2021; Carelli et al., 2018). Species or lineage-specific genes potentially coding for novel proteins may also arise by partial gene duplication, shuffling of domains between genes, or completely de novo from non-coding regions (Zhang et al., 2004; Toll-Riera et al., 2009; Tautz & Domazet-Loˇ so, 2011; Chen et al., 2010). Collectively, the re-purposing of genes, proteins, and regulatory networks im- plies that independent strategies exist in nature to achieve the same molecular function, and or- thology does not guarantee equivalent functions. Hence, cross-species comparative approaches focussing just on orthologs fail to identify a complete set of functionally relevant genes related to a phenotype. In this chapter, we discuss how orthogonal shared basis factorization (OSBF) can be used to identify functionally important genes from cross-species gene expression datasets. OSBF estimates an expression subspace common to all species representing conserved correlation relationships between the phenotypes. The method factorizes the gene expression profiles of functionally similar phenotypes from multiple species into a species-specific loading factor, a species-specific diagonal factor and a common expression subspace. For each species, independent linear combinations of 43 genes are stored in different columns of the species-specific loading factor. The loadings are used to quantify the contribution of all genes to different phenotypes associated with the gene expression profiles, allowing us to identify functionally relevant genes. Using OSBF, cross-species gene expression analysis of six tissue types from eight vertebrate species shows that more than 40% of the genes important for tissue phenotype are not one-to- one orthologs shared in all species. Furthermore, 6-10% of these genes do not have orthologs in other species. Several examples of gene expression divergence of orthologs that can be identi- fied by comparing homologs are also detected in our analysis. Without relying on homology and annotation classification, our approach identifies functionally relevant genes for tissue phenotype, including known gene markers, non-coding genes, pseudogenes, and other less-studied genes. 4.2 Methods 4.2.1 Gene annotations and pre-processing data The genome assembly (FASTA file), GTF files, and orthology annotation were obtained from Ensembl v94 (Cunningham et al., 2019). Orthology mapping information was obtained using Ensembl biomaRt (v2.48.3, Durinck et al., 2009). The raw RNA-Seq FASTQ files for eight species were collected from NCBI GEO and EBI ArrayExpress (Additional file 3). Reads were mapped to the corresponding genome using STAR (v2.7.1a) (Dobin et al., 2013) after trimming adapters using Cutadapt (v1.16) (Martin, 2011). The library preparation protocol strand type of each library was inferred using infer experiment.py module in RSeQC (v4.0, Wang et al., 2012). Genes annotated in the GTF files were quantified using HTSeq-count (v0.13.5, Anders et al., 2015). Project-level and species-level quality checks were performed to detect outlier profiles and fix mislabeling, if any. 44 4.2.2 Quality checks of RNA-Seq profiles For each project, a species-wise exploratory analysis based on principal component analysis and hi- erarchical clustering were performed to detect outlier profiles and fix mislabeling. In each project, both principal component analysis (PCA) and hierarchical clustering heatmaps for all samples be- longing to different species were generated separately for three normalization approaches: variance stabilizing transformation after DESeq2 (v1.36.0) normalization (Love et al., 2014), TMM normal- ization, and upper-quartile normalization. TMM and upper-quartile normalization were performed using edgeR package (Robinson et al., 2010). For projects where data from multiple species are present, one-to-one orthologous protein-coding genes shared by all species were used to combine the gene expression matrix for quality checks. Profiles from different projects belonging to the same species were analyzed to check any project-specific biases due to different experimental conditions. After appropriate normalization, samples clustered by tissue type in hierarchical clustering and PCA analysis indicated no strong biases. The analysis was repeated by performing batch correction at the project level using sva v3.34.0 (Leek et al., 2012), and similar results were observed. 4.2.3 Estimating inter-tissue correlation relationship The gene expression profiles are first standardized to compute the inter-tissue correlation relation- ship (R i ). Let X i be the gene expression profiles of species i with mean-centered columns and unit variance. We can write X i = C i D i S i − 1 , where C i = I m i − m i − 1 1 m i 1 m i T is a centering matrix and S i = diag(s 1 ,...,s n ) is a diagonal scaling matrix. The standard deviation of j-th column of D i is represented by s j . The profiles from functionally similar tissue types present in all the species are used to calculate the pairwise R i relationship between species. Inter-tissue gene expression correlation within a species is calculated based on the mean expression of tis- sue types, and only genes that are expressed in both tissues (mean log 2 (TPM) ≥ 1) are used to calculate the correlation between two tissues in a species. The scatter plot between two species is generated based on corresponding R i entries (e.g., human brain-heart correlation with mouse 45 brain-heart correlation) from each species. The symmetric and diagonal entries of R i are ex- cluded. The Mantel test is used to check the correlation between two R i matrices. The percent- age of correlation information (p ij ) represented by a common subspace dimension is defined as p ij =δ 2 ij / P 6 j=1 δ 2 ij × 100, where∆ i =diag(δ i1 ,...,δ i6 ). 4.2.4 Gene expression data analysis The OSBF common subspace is estimated using the mean expression profiles of six tissues (brain, heart, kidney, liver, lung, and testis) present in all eight species. When we perform dimension reduction using OSBF, we attempt to associate the resulting eigengenes back to some biological process. We do this using functional annotations of gene sets (Gene Ontology Consortium, 2019; Kanehisa et al., 2017). Throughout our analysis, we assume pathways are concerted activities of multi-gene sets that contribute to the phenotype of a cell; we allow the same pathway to exist in distinct species independent of orthology among the comprising genes. The joint-matrix factorization is computed using the ‘SBF’ function call in our R package with parameters ‘orthogonal = TRUE’ and ‘transform matrix = TRUE’. After projecting the profiles in the common subspace, the heatmap is generated based on clustering (centroid linkage) using the ComplexHeatmap package (v2.8, Gu et al., 2016). Euclidean distance between the projected profiles is used to score the similarity. GO enrichment of top genes is performed using the goseq Bioconductor package (v1.44 Young et al., 2010) with shared orthologous genes in all eight species as the background. For each species, a custom mapping of Ensembl genes to Ensembl GO terms was used for GO enrichment analysis. The human housekeeping gene list is obtained from Eisenberg & Levanon (2013) and Hounkpe et al. (2021). The p-value for overlap between gene sets is calculated using the hypergeometric test. For dimension 1, the genes are ranked based on a score given by the absolute value of the U i coefficient. For dimensions 2-6, the genes in the corresponding U i are ranked based on a score (S) computed using the U i coefficient and expression specificity ( τ ) (Yanai et al., 2005). The score for a geneg in dimensioni is defined as S g i =|U g i |τ g , where theτ g is the tissue expression 46 specificity index. τ g is computed based on the mean expression profile of tissue types. The rank is calculated separately for the genes with positive and negative U i coefficients. Since the D i values are non-negative, genes contributing to the tissue of interest will have the same sign in the U i coefficients as the tissue centroid in the V space. The location of the tissue centroids in the common subspace depends on the dataset and the estimation parameters. The GO enrichment and overlap with housekeeping genes are performed based on the top 100 ranked genes. The tissue-relevant genes are identified as follows. Using the null distribution of the score gen- erated using the OSBF factorization of permuted expression profiles, an empirical p-value is calcu- lated for genes in different dimensions. The genes with significant p-value ( ≤ 0.001) are identified as TRG genes. The intersection plots are generated using UpSetR package (v1.4, Lex et al., 2014). Tissue-enriched genes from The human protein atlas (https://www.proteinatlas.org) are used for the previously reported tissue-specific markers (Uhl ´ en et al., 2015). In each species, pairwise differential analysis between tissues (testis, brain, heart, and kidney) is performed using DESeq2 (v1.36.0, (Love et al., 2014)) after controlling for the project variable. For each tissue, dif- ferentially expressed (DE) genes were identified (log fold change ≥ 2, FDR≤ 0.01) with respect to the other tissue. Those genes identified as DE in all three analyses were used as the common DE genes. All analysis are performed using R (v4.1.3, R Core Team, 2022). Using different cross-species RNA-Seq datasets, the R package vignettes demonstrate the analysis steps. 4.3 Results 4.3.1 Inter-tissue correlation relationship across species We first examine the inter-tissue correlation within a species ( R i for speciesi) and their relationship across species. We curated a total of 1,018 bulk RNA-Seq profiles from public databases consisting of data from eight vertebrate species, including nine different tissue types. All eight species have profiles of six different tissue types (brain, heart, kidney, liver, lung, and testis) in common. We used public RNA-Seq datasets to compare the R i relationship across species pairs. The pairwise 47 R i relationship among species is calculated based on the expression profiles of tissue types present in all the species. 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 H.sapiens M.mulatta P.troglodytes M.musculus R.norvegicus B.taurus S.scrofa G.gallus 0.00 0.25 0.50 0.75 1.00 R = 0.88 R = 0.86 R = 0.97 R = 0.95 R = 0.95 R = 1 R = 0.93 R = 0.93 R = 0.98 R = 0.90 R = 0.98 R = 0.90 R = 0.93 R = 0.97 R = 0.98 R = 0.95 R = 0.90 R = 0.93 R = 0.94 R = 0.94 R = 0.97 R = 0.88 R = 0.90 R = 0.77 R = 0.76 R = 0.79 R = 0.66 R = 0.74 Figure 4.1: Inter-tissue gene expression correlation (R i ) relationship across species: Pair- wise scatter plot shows theR i relationship between eight species: human, chimpanzee, macaque, mouse, rat, cow, pig, and chicken. For each species, R i is calculated based on six tissues: brain, heart, kidney, liver, lung, and testis. A data point in the scatter plot represents the correlation be- tween the same tissues within a species for two species. When comparing expression profiles of six tissues present in all eight different species, R i within a 48 species were highly correlated across species (meanρ = 0.90; Figure 4.1). Similarly, a correlated pattern (mean ρ = 0.83) is observed for the pairwise analysis of five species with nine common tissue types (Appendix Figure B.1). The correlated relationship is not observed when the RNA-Seq counts are permuted (Appendix Figure B.2). These observations show that the inter-tissue gene expression relationship is similar across species. 4.3.2 Species-independent phenotypic subspace for joint analysis We applied our joint matrix factorization approach to analyze the RNA-Seq expression profiles of the eight (k =8) species: human, chimp, macaque, mouse, rat, cow, pig, and chicken. TheV space common to the eight species is estimated using the expression profiles of six tissues ( n = 6). The estimated V space is six-dimensional. The columns of V , the right basis vectors (eigenvectors), represent the different dimensions of the common subspace, providing a shared summary of the expression of genes in the six tissues of the eight species. Once the species-specific U i and ∆ i matrices are estimated, the joint analysis of gene expression profiles from the eight species is performed by projecting individual expression profiles into the V space. A gene expression profile Y i ∈ R m i × 1 from species i is projected to the shared subspace by computing Y T i U i ∆ − 1 i . The gene expression profiles from different species projected into the common subspace have identical dimensions (n× 1), independent of the number of genes annotated in individual species. Also, the projected profiles in the V space cluster by tissue type, independent of species of origin and species-divergence time (Figure 4.2 a). The species independence and the identical dimension of gene expression profiles in the V space enable efficient comparisons across species. 4.3.3 Properties of common expression subspace dimensions We next investigated the properties of different dimensions of theV space and the genes contribut- ing to these dimensions. Along dimension 1, the inter-tissue correlation is the highest, resulting in close positioning of the projected expression profiles of different tissue types. (Appendix Table C.1, Figure 4.2 b). The genes with high coefficients in the U 1 are ubiquitously expressed in all 49 a) distance 0 0.5 1 1.5 2 tissue species brain heart kidney liver lung testis species cow chicken human macaque mouse chimp rat pig Gene expression profiles in common subspace Gene expression profiles in common subspace -0.8 -0.4 0.0 0.4 0.2 0.3 0.4 0.5 Dim 1 brain heart kidney liver lung testis human chimp macaque mouse rat cow pig chicken b) Dim2 Figure 4.2: Analysis in the OSBF common subspace: a) Heatmap showing clustering of pro- jected expression profiles in the common subspace. b) The projected expression profiles along dimensions 1 and 2 of the common subspace. Gene expression profiles of different tissue types lie close to each other along dimension 1 of the common subspace. tissues and have low tissue specificity scores (expression specificity: τ ≤ 0.25, Appendix Figures B.3 a-b). The ontology analysis of genes with high U 1 loadings shows enrichment for essential cellular functions for different species (Appendix Tables C.2, C.3). Also, a significant fraction of human housekeeping genes is observed among the human genes with high U 1 loadings (hy- pergeometric p-value = 7.29e-12). These observations confirm that dimension 1 of the common expression subspace is dominated by housekeeping genes and, thus, represents essential cellular processes related to translation, cellular transport, and cell cycle shared in all species. In dimension 2, the testis profiles separate from the rest of the tissues (Figure 4.3a). The centroid of the testis profiles is located in the negative axis of the dimension 2, and the genes that show testis-specific expression tend to have high negative loadings in the corresponding loading matrix of each species (Figure 4.3 b-c, Appendix Figures B.4 a-b). The gene ontology (GO) analysis confirms the enrichment of similar testis-related functions for the top U 2 genes in different species (Figure 4.3 d, Appendix Figures B.4 a-b, Table C.4). We similarly identified genes contributing to brain, heart, and kidney phenotype from dimen- 50 a) d) Dim 2 brain heart kidney liver lung testis human chimp macaque mouse rat cow pig chicken -0.8 -0.4 0.0 0.4 -0.5 0.0 0.5 1.0 0.02 0.04 0.06 0.08 padj spermatogenesis flagellated sperm motility cell differentiation meiotic cell cycle spermatid development piRNA metabolic process homologous chromosome pairing at meiosis male meiotic nuclear division 0 0.08 0.17 0.25 0.33 Ratio of genes in GO category Dim 3 b) 0.00 0.25 0.50 0.75 1.00 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 H.sapiens -1 0 1 2 Z−score Expression specificity ( ) coefficient c) U 2 Coefficient Expression specificity ( τ) -2 -1 0 1 2 Z-score -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.00 0.25 0.50 0.75 1.00 G.gallus Figure 4.3: Dimension 2 of the OSBF common subspace: a) The projected expression profiles along dimensions 2 and 3 of the common subspace. b-c) Scatter plot showing the relationship be- tween gene loadings ofU 2 and gene expression specificity ( τ ) for human and chicken, respectively. Each gene is colored using its testis z-score expression value. d) GO enrichment analysis for top human genes in dimension 2. sions 3, 5, and 6, respectively (Appendix Figures B.5 a-c, and B.6 a-d). Along dimension 4, we observed a weak signature for lung-related genes (Appendix Figures B.5 a, d and Table C.5). These observations suggest that individual dimensions among the 2-6 of theV space represents a similar inter-tissue relationship in all species. Taken together, for all species, each dimension of the com- 51 mon expression subspace represents a similar property related to the column phenotypes of the gene expression matrices, and the contribution of each gene to this property can be measured using the gene loadings. A summary of cross-species gene expression analysis steps using the OSBF is shown in Appendix Figure B.7. 4.3.4 Identifying tissue-relevant genes Using the loadings in OSBF species-specific matrices, we next identified genes with a significant contribution to the tissue phenotype for each species: tissue-relevant genes (TRGs). We focused on the testis, brain, heart, and kidney-specific genes identified from the OSBF’s four dimensions (2, 3, 5, and 6). We first examined the annotation classification of the TRGs, and how many of these genes were characterized by previous studies. The identified genes include protein-coding, non- coding, and pseudogenes (Additional file 4). For all the four tissues, nearly 90% of the TRGs are protein-coding in all eight species (Appendix Tables C.6, C.7). Compared to the fraction of protein- coding genes annotated in different species, the protein-coding genes are over-represented in our TRG list (min hypergeometric p-value≤ 1.34e-80, Appendix Tables C.6, C.7). Among the non- coding TRGs, long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and other non-coding RNAs comprise the majority in the brain, heart, and kidney. In contrast, testis non-coding TRGs are mainly lncRNAs and pseudogenes (Appendix Figure B.8). Our analysis identified several non-coding genes previously characterized as tissue-specific (Appendix Table C.8). The testis TRGs include 56 genes out of the 62 testis-specific genes identified using deep sequencing and immunohistochemistry-based analysis in the Djureinovic et al. (2014) study (hypergeometric p- value = 2.82e-129). Similarly, a significant number of organ-specific genes reported in the human protein atlas (Uhl´ en et al., 2015) are identified among our brain, heart, and kidney TRGs (Appendix Table C.9). In summary, OSBF-based analysis identifies coding as well as non-coding TRGs in different species, including many organ-specific genes identified by previous studies. 52 a) b) c) d) 66 31 11 10 7 6 6 6 5 5 4 4 4 4 4 4 3 3 3 3 0 20 40 60 0 100 200 human chimp rhesus mouse rat cow pig chicken Genes common Human testis-specific genes 0 10 20 30 40 50 ncRNA protein-coding pseudogene H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus Testis # of non orthologous genes H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus 0 25 50 75 100 % of allway orthologous genes testis brain heart kidney brain heart kidney testis brain heart kidney testis H.sapiens G.gallus 0 2 4 6 8 10 12 log 2 TPM Cox8c Figure 4.4: Tissue-relevant genes identified by the OSBF: a) Distribution of testis TRGs unique to human and shared with other species. A human testis-TRG is considered to be shared with a species if the corresponding ortholog is also identified as testis-specific. Only the top 20 in- tersection sets are shown. b) Among the four tissues, the Cox8c gene is highly expressed in the testis in human, while the corresponding ortholog is chicken is over-expressed in the heart. c) The percentage of the OSBF-identified TRGs that are shared in all eight species (allway orthologs) in the Ensembl database (v94). More than 40% (on an average testis: 80.15%, brain 41.78%, heart 40.68%, and kidney 51.99%) of the TRGs are not allway orthologous. d) Gene annotation classi- fication of TRGs that do not have any pairwise orthologs in the other seven species in the Ensembl database. 53 4.3.5 Homology-independent identification of tissue-specific expressed genes The estimation of the OSBF common subspace does not rely on gene orthology mapping be- tween species. Thus, the OSBF-based analysis should be able to identify genes with prominent tissue-specific expression patterns independent of their orthology classification. We first investi- gated whether our method can identify homologous tissue-specific expressed genes that can be detected with routine cross-species analysis using ortholog mapping information between genes. The OSBF TRGs include species-specific genes as well as orthologs conserved across different species (Figure 4.4a, Appendix Figure B.9). We observed a higher number of orthologs that show tissue-specific expression shared between close species, and chicken has fewer orthologs identi- fied as TRGs (Appendix Table C.10). We identified several homologous protein-coding genes that show tissue-specific expression in different species clades and all eight species (Appendix Tables C.11 and C.12). Among the well-studied non-coding genes, the homologs of the host gene of miR- 124a are identified in both human and mouse brain. MicroRNA miR-124a is known to be highly expressed in the brain and plays a vital role in hippocampal axonogenesis (Sanuki et al., 2011). The cardiac myocytes and skeletal muscle-specific miR-133 (Liu et al., 2008) is identified in rhe- sus, chimpanzee, pig, and chicken heart. MicroRNA-196, a known miRNA enriched in kidney (Meng et al., 2016) is also identified in mouse, pig, and cow. These examples show that the OSBF can detect the homologous TRGs without using orthology-mapping information between species. 4.3.6 Expression divergence of orthologs We examined whether OSBF-based analysis could detect expression divergence of orthologs. We explored those cases where orthologs are identified as TRGs by OSBF but in different organs. In our analysis, Cox8c gene coding for one of the isoforms of the cytochrome c oxidase subunit is found to be a TRG in the heart of mouse, rat, cow, pig, and chicken. The human ortholog of the Cox8c is identified as a TRG in the testis but not in the heart (Figure 4.4b). Hallmann et al. (2016) observed that Cox8c transcript is expressed in the human testis and three other tissues but not in the heart or muscle. Similarly, C1qtnf9 is identified to be heart-specific in mouse (Su et al., 2013), 54 while the ortholog in chicken is kidney-specific. We identified three other cases of orthologs with different expression patterns (Appendix Table C.13). We also examined the cases where a gene is identified as TRG in one species but not the ortholog in other species. When comparing each species with human, we found that on average, 52% of the brain, 44% of the testis, 36% of the kidney, and 22% of the heart homologous TRGs in human belong to this category (Appendix Figure B.10). Similarly, when comparing each species with the mouse, a lower percentage of heart homologous TRGs seems to show expression divergence (Appendix Table C.14). 4.3.7 Discovery of non-homologs contributing to the phenotype Next, we investigated the non-homologous genes identified by the OSBF that cannot be detected using homology-based analysis. Notably, more than 40% of the TRGs are not one-to-one or- thologous (allway orthologous) genes shared in all eight species (Figure 4.4c). Thus any joint cross-species analysis based on the allway orthologous genes would not be able to identify these genes. Next, we examined whether the identified TRGs have at least one ortholog in the other seven species in the Ensembl database. More than 89% of the TRGs have at least one ortholog in the other seven species. In the category of genes with no reported orthologs with the other seven species, we identified a total of 427 TRGs in eight species (testis 238, brain 55, heart 45, and kidney 88) (Appendix Figure B.11 a). Most of these genes are less studied, and any pairwise analysis between species based on orthology mapping would not be able to detect these genes. These genes belong to protein-coding, non-coding RNAs as well as pseudogenes and to be exper- imentally confirmed annotation classification (Figure 4.4d, Appendix Figures B.11 b-d). Among these genes include previously characterized human testis-specific lncRNA SPATA42 (Xie et al., 2020) and miRNA miR-202 (Ro et al., 2007). In mouse testis, genes with no ortholog in the other seven species include protein-coding gene Cypt4 (Kato & Nozaki, 2012), lncRNAs Rbakdn (Liu et al., 2021), and pseudogene 1700019M22Rik (Kato & Nozaki, 2012). The brain-specific genes include scRNA BCYRN1 (Mu et al., 2020) and lncRNA MIR124-1HG (Liu et al., 2016) in human, lncRNAs Meg3 (Tan et al., 2017), Miat (He et al., 2020), and Mir124a-1hg (Sanuki et al., 2011) 55 in mouse brain. Protein-coding CEND1 involved in neuron differentiation (Gaitanou et al., 2019) is identified in the chicken brain. Chicken CEND1 does not have a reported ortholog in primates, rodents, and mammals. We identified lncRNAs LINC00881 (Li et al., 2015), TRDN-AS1 (Zhang et al., 2018), and pseudogene NPY6R (Burkhoff et al., 1998) in human heart. The mouse heart- specific genes with no ortholog in the other seven species include lncRNAs Mhrt (Han et al., 2014) and C130080G10Rik (Ponnusamy et al., 2019). The kidney-specific non-orthologs include lncR- NAs LINC01187 (Skala et al., 2020), EMX2OS (Jiang et al., 2020), and pseudogene RPS24P17 (Tomaszewski et al., 2015) in the human and mouse lncRNA D630029K05Rik (Sanchez-Martin et al., 2021). In addition, 61 organ-specific genes were identified by the OSBF (Ensembl release 94) that did not have any pairwise ortholog in the other seven species have at least one ortholog in the latest annotation release 105 (Appendix Table C.15). Taken together, these findings show that OSBF can identify genes contributing to the phenotype independent of their annotation and homology classification. 4.3.8 Identifying tissue-relevant differentially expressed genes Further, we investigated how TRGs identified by the OSBF can complement within-species differ- ential expression analysis. For each tissue in a species, pairwise differential expression analyses with other tissues were performed, and common differentially expressed (cDE) genes were iden- tified (log fold change ≥ 2, FDR≤ 0.01). On average, 93% (brain 97%, heart 94%, testis 94%, kidney 88%) of the TRGs identified by OSBF are cDE genes across eight species (Appendix Table C.16). These are the DE genes with relatively high expression, tissue specificity, and high loadings (Figure 4.5 a-c). The remaining TRGs that are not cDE have high expression in two tissues, thus not appearing in all pairwise DE analyses. Interestingly, the DE genes that are also identified as TRGs only constitute a small portion (13%) of the DE genes (Appendix Table C.16). So we com- pared the properties of the DE genes that are also TRGs with those DE genes that are not TRGs and have low loadings. The ontology analysis of non-TRG DE genes with low loadings for all four tissues shows enrichment of biological processes unrelated to the tissue phenotype (Appendix 56 Table C.17). On the contrary, the DE genes that are also TRGs show tissue-related functions (Ap- pendix Table C.17). This confirms that OSBF’s loadings help to identify DE genes with significant contributions to the phenotype. a) b) c) −0.04 −0.03 −0.02 −0.01 0.00 U 2 Coefficient U 3 Coefficient 0.00 0.01 0.02 0.03 0.04 U 5 Coefficient U 6 Coefficient −0.06 −0.04 −0.02 0.00 0.00 0.02 0.04 0.06 DE DE+TRG DE DE+TRG DE DE+TRG DE DE+TRG testis brain heart kidney H.sapiens −0.04 −0.02 0 0.02 0.00 0.25 0.50 0.75 1.00 Dim2 Coefficient Expression specificity DE + TRG DE Not DE DE + TRG DE Not DE 0 5 10 15 0.00 0.25 0.50 0.75 1.00 Expression specificity testis log2(TPM) H.sapiens Figure 4.5: Differentially expressed and tissue-relevant genes: Pairwise DE analysis of the human testis with other four human tissues is performed (LFC≥ 2, FDR≤ 0.01), and common DE genes were identified. a) Scatter plot showing the relationship between gene loadings of U 2 and gene expression specificity ( τ ) in human. The genes are color based on the DE and TRG status. b) Same as (a) with the x-axis representing actual gene expression values in the human testis. c) Distribution of corresponding gene loadings of DE and DE + TRGs in the human testis, brain, heart, and kidney, respectively. 57 4.4 Discussion Cross-species analysis of gene expression is challenging as different species have varying num- bers of genes, and complex relationships exist between genes of two species – a problem dra- matically exacerbated when co-analyzing more than two species. The usual strategy to compare gene expression profiles across species is to rely on orthology relationships between genes, as- suming functional equivalence of genes across species. The accumulation of evolutionary events such as gene expansions, losses, and other genomic rearrangements complicate the homology re- lationships between the genes (Lynch & Conery, 2003; Koonin, 2005; Kuzniar et al., 2008) and restricts orthology-based comparisons to evolutionary close species. Furthermore, independent strategies exist in nature to achieve the same molecular function. In this study, we developed a novel mathematical approach called the orthogonal shared basis factorization (OSBF) to analyze gene expression profiles across species. Our method estimates a shared biological space capturing conserved properties across species and enables cross-species comparisons of expression profiles without relying on gene mapping between species. The conventional cross-species gene expression analysis aims to identify genes important for the phenotypes using a “genes selection - compare phenotype - identify genes” strategy. These approaches first select a subset of shared genes across species and then compares the expression profiles to identify genes essential for the phenotype. In contrast, the OSBF cross-species anal- ysis employs a “phenotype to genes” approach. Our joint matrix factorization starts with gene expression profiles of functionally matching phenotypes (columns) from different species. The correlation relationships between the phenotypes in individual species and their similarity across species are leveraged to estimate a species-independent expression subspace. The common ex- pression subspace can be considered a subspace representing correlation relationships between functionally similar gene expression profiles in a pseudo-ancestor species. Cross-species analysis based on OSBF uses information from all genes and does not require prior selection of shared genes. Importantly, no prior assumption of functional equivalence of genes across species is made in our analysis. In our estimation procedure, the relative weights of genes in the species-specific 58 matrix are distributed such that the genes with high loadings have similar biological functions in all species. Thus, the genes with significant contributions to the different dimensions of the common subspace, and hence to the phenotype can be easily identified. Using OSBF, we performed a joint analysis of gene expression profiles of six tissues from eight species, including primates, rodents, mammals, and bird. Without relying on orthology mapping between species, we estimated a species-independent six-dimensional expression subspace com- mon for all species. OSBF offers a common coordinate system for cross-species comparisons as gene expression profiles from different species projected into the shared subspace have identical dimensions. We showed that a similar biological property is represented along individual dimen- sions of the common expression subspace, independent of the species and species-divergence time (Figures 3c-d, S5-S8). Genes relevant to organ functions were determined using loading values in the species-specific matrix. The identified genes include well-studied genes as well as less-studied genes from diverse annotation classifications. Besides the genes that can be identified using routine cross-species analysis based on orthologs, we identified several tissue-relevant genes that do not have orthologs. In summary, cross-species analysis using OSBF identifies a comprehensive list of genes driving the phenotype independent of their annotation and homology classification. The cross-species gene expression analysis using the OSBF requires the availability of func- tionally similar profiles in different species. Finding similar functional profiles from multiple species is challenging as most of the existing publicly available expression datasets are from a few model organisms. Recently single-cell genomics has made complete transcriptomic profiles of dif- ferent species available (Fincher et al., 2018; Farrell et al., 2018). This provides an unprecedented opportunity to curate a rich collection of matching cell types across a wide array of species. Ex- tending our approach from tissue to cell type resolution will increase the number of functionally similar profiles and allows cross-species analysis of diverse species using the OSBF. In multi-species studies, data associated with the individual species are not independent – a consequence of their shared evolutionary history (Felsenstein, 1985). In our OSBF analysis in- volving expression profiles of different tissue types from eight species, we have not yet been able 59 to detect any drastic phylogenetic artifact in the projection plots of each species’ tissues within the common expression subspace. The projection plots clearly reflect tissue organization within the data. However, for certain phylogenies, we expect that within tissue-derived clusters, we will be able to see a residual effect of the phylogeny. In the future, incorporating the phylogenetic effect on the inter-tissue relationships across species could be a potential improvement. Our study uses previously published RNA-Seq datasets from eight species. Our cross-species gene expression analysis is limited due to the differences in tissue extraction and profiling method- ology used by different laboratories, confounding effects due to age, sex, and differences in cell population between species. 4.5 Conclusion Our novel matrix factorization approach enables joint analysis of cross-species gene expression profiles without relying on orthology mapping between species. Genes/transcripts of multiple species driving the phenotype can be identified using OSBF independent of their annotation and homology classification. Homology-independent cross-species comparisons will become very use- ful as new cell types and tissue types of less-studied species are characterized. We believe that OSBF will be a valuable tool for researchers for cross-species gene expression comparisons. 60 Chapter 5 Cross-species gene expression query 5.1 Background Similarity search in multi-species databases has long been essential in functional inference for genes and annotating newly assembled genomes (Altschul et al., 1990; Finn et al., 2011). Com- parative approaches integrating gene expression data from multiple species hold equal promise in characterizing cell phenotypes and their interrelations within cell populations. Nevertheless, compared to sequence similarity-based searches, finding similar expression profiles in the public domain matching a newly profiled cell phenotype in a gene expression dataset is not prominent. The first requirement for leveraging existing public gene-expression data to help interpret new data is the availability of the repositories. Fortunately, biological data repositories such as Gene Expression Omnibus (GEO; Edgar et al., 2002) and ArrayExpress (Brazma et al., 2003) are grow- ing at astounding rates, fueled mainly by advances in sequencing technology. A massive amount of gene expression datasets from various species (> 1500 species) is available in the public domain, including cell-type atlases of many tissues and transcriptomic profiles of whole organisms from diverse phyla (Achim et al., 2018; Cao et al., 2017; Fincher et al., 2018; Schaum et al., 2018). The steady growth of public databases in size and diversity presents us with an unprecedented opportu- nity for data-driven biological discoveries and characterization of cellular phenotypes that would 61 be difficult to achieve from individual projects (Costa, 2014). The second requirement for effectively reusing public data is a search algorithm that efficiently finds similar expression profiles within and across species. Different groups have proposed algo- rithms based on gene set enrichment and co-expression analysis to find similar expression profiles within a species. The correlation structure between genes is used to find genes co-regulated with the query in the SEEK approach (Zhu et al., 2015). ProfileChaser (Engreitz et al., 2011) is a web server designed for querying microarray expression profiles by comparing differential expression profiles of the query and database. CellMontage (Fujibuchi et al., 2007) is another web-based tool that finds similar microarray expression profiles based on rank correlation. GEO and ArrayExpress also allow users to query based on metadata and filter results by species. Compared to within-species search, limited advancements have been made to identify similar profiles from a different species by querying the gene expression profile of an unknown sample against a database. Usual cross-species search strategies are based on assigning higher weights to orthologous genes expressed at very high or shallow levels (Le et al., 2010), determining the similarity using top-expressed orthologous genes (Zinman et al., 2013) and gene set enrichment analysis for cell-type identification (Kabir et al., 2018). Most existing approaches and web servers designed to query gene expression profiles are developed for microarray datasets, focusing on a sin- gle species or limited to very few model organisms, primarily human and mouse. For cross-species comparisons, almost all methods rely on a subset of genes shared between species, primarily focus- ing on one-to-one orthologs and thus discarding information from the rest of the genes. Similarly, metadata and gene set enrichment-based approaches fail to utilize the information provided by the actual expression values. The lack of a robust cross-species gene expression query strategy hinders our ability to efficiently use publicly available data to expedite discoveries. One of the major challenges in implementing a cross-species search is that the number of genes annotated varies from species to species. As a result, the expression profiles from different species are high-dimensional vectors of varying lengths. In this chapter, we discuss how orthogonal shared basis factorization offers a common coordinate system for cross-species comparisons. Without 62 relying on gene mapping between species, our approach provides a mathematical framework to utilize prior knowledge available in existing public datasets to analyze and query transcriptomic profiles across species. First, we systematically evaluate the effect of phylogenetic distance in identifying similar expression profiles across species using one-to-one orthologous genes. Using public RNA-Seq datasets from 17 species with divergence times ranging from<10 million years to >750 million years, we show that our novel strategy identifies similar expression profiles in close and distant species. 5.2 Problem statement Consider a set of k species S = {s 1 ,s 2 ,...,s k } with m j genes annotated in species s j . A col- lection of gene expression profiles from k species available in the database is represented by C, where C ={y s 1 1 ,y s 1 2 ,...,y s 2 1 ,y s 2 2 ,...,y s k 1 ,...}. Here y s j i ∈ R m j × 1 is the expression profile i from species s j . y s j i is a column from the gene expression matrix of species s j , where the rows are genes, and the columns represent phenotype. For a gene expression profile of interest x q ∈R mq× 1 withs q ∈S, the cross-species gene expression query against the database comparesx q with profiles from other species in the C. The best cross- species hit, the profile with the highest similarity, will be given by argmin l,s ′ ̸=q f T q (x q )− T s ′ (y s ′ l ) . Here T is any transformation function applied to the genes of individual species so that the di- mensions of the transformed query and transformed database profiles are identical. The similarity between query and database profiles is calculated using any distance function represented by f. We use one-to-one orthology mapping as the transformation function for orthology-based cross- species queries. 63 5.3 Methods 5.3.1 Scoring search results To calculate the similarity between query and database expression profiles, five different distance metric scoring were used. They are Euclidean distance, cosine distance, 1 - Pearson correlation, 1 - Spearman correlation, and the square root of the Jensen-Shannon divergence. For each search result, average precision was calculated, and the mean average precision (mAP) score was deter- mined for the entire query set. IfQ is the set of queries and set of all relevant results forq j ∈Q is {d 1 ,...,d m j } such thatR jk is the set of ranked results from the top until you get a relevant result d k , then mAP(Q)= 1 |Q| |Q| X j=1 1 m j m j X k=1 precision(R jk ) mAP score (ranging from 0 to 1) is a widely used performance measure in information retrieval to summarize a set of query results. mAP score of 1 indicates all the relevant hits in the database appear as the best hits. Average precision can be calculated only when at least one true positive is available in the database. 5.3.2 Species clade definition Hierarchically organized clades were defined based on the divergence time with humans. The clades listed from most recent to ancient/topmost are primates, treeshrew, rodents, mammals, bird, tetrapods, vertebrate, and fly (Table 5.1). Each clade is a superset consisting of all species that are evolutionary recent to that clade. E.g., clade rodents consists of species from rodents, treeshrew, and primates. For each clade, log2 transformed transcripts per million (TPM) normalized counts based on the protein-coding orthologous genes shared by all species in that clade were used for clustering and gene expression search. The phylogenetic distance of a clade is taken as the average divergence time of all species in the clade with humans. 64 Table 5.1: List of species and shared orthologous genes in different clades Species Clade #species at each clade Clade-wise shared protein-coding genes Homo sapiens primates 6 12,359 Pan troglodytes primates 6 12,359 Macaca mulatta primates 6 12,359 Gorilla gorilla primates 6 12,359 Pan paniscus primates 6 12,359 Callithrix jacchus primates 6 12,359 Tupaia belangeri treeshrew 7 9,550 Mus musculus rodents 10 8,046 Rattus norvegicus rodents 10 8,046 Oryctolagus cuniculus rodents 10 8,046 Bos taurus mammals 12 7,465 Sus scrofa mammals 12 7,465 Gallus gallus bird 13 5,879 Xenopus tropicalis tetrapods 15 4,671 Anolis carolinensis tetrapods 15 4,671 Danio rerio vertebrate 16 3,303 Drosophila melanogaster fly 17 964 5.3.3 Cross-species gene expression search based on orthology For cross-species search in each clade, gene expression profiles from that clade were randomly split into a database (70%) and query set (30%) based on each tissue type. For each clade, sampling of profiles into the database and query set was performed n (=100) times, and queries were carried out each time to determine mAP. A median mAP score was used for all analyses and plots. For cross- species queries, only search results (hits) from a species different from the query were evaluated. To check the effect of a varying number of orthologous genes, profiles from primates clades (7 species) were queried using shared orthologous genes corresponding to different clades. We also repeated the search using only profiles from other clades and observed a similar trend as that shown by primate profiles. The impact of a varying number of orthologous genes on search per- formance was also evaluated in a lower-dimensional space defined by principal components (PC). For each clade, principal components directions and rotation matrix were estimated by factorizing the database profile matrix (70% data) using singular value decomposition. A reduced dimension 65 space based on top PCs was determined such that selected PCs explain at least 80% of the to- tal variance. The database and query profiles were then projected into this lower dimension, and the nearest neighbors were identified using different distance metrics. For cross-species search evaluation, different parts of the brain (cerebellum, frontal cortex, etc.) were considered as brain tissue, and different parts of the intestine (ileum, duodenum, etc.) were considered as the intestine. Drosophila profiles of malpighian tubules and oenocytes were considered functionally similar to kidney and liver, respectively, as these tissues have been reported to have high similarities with the vertebrate counterparts (Gutierrez et al., 2007; Jung et al., 2005). 5.3.4 Clustering and gene expression analysis Gene ontology (GO) biological process enrichment for shared orthologous genes at the fly clade was calculated using the goseq Bioconductor package v1.44 (Young et al., 2010). For GO analysis, the shared orthologous gene set at the primate clade that is not present in the fly clade was taken as the background. Principal component analysis (PCA) was performed using the prcomp R function by choosing subsets of genes capturing the most variance across the entire samples (row-wise variance of the gene expression matrix: genes× samples) and maximizing the projection score (Fontes & Soneson, 2011). We fitted a two-way ANOV A model to calculate the relative proportion of gene expression variance explained by tissue and species factors for the shared genes across k species. The fraction of variance explained by species and tissue levels is calculated similarly to the variance decomposition approach in Breschi et al. (2016). Precision-Recall curves were generated using the PRROC package (v1.3) in R (Grau et al., 2015). For the search results, AUCPR plots were created based on cosine distance. The null distribution of distance scores was obtained by querying random profiles, and the empirical p-value of search results was obtained by comparing it to the null. 66 5.3.5 OSBF-based cross-species analysis The common expression subspace (V ) was estimated using the mean expression of five tissues (brain, heart, kidney, liver, and testis) present in k species. Profiles from individual species were randomly split into training (database, 70%) and testing (query 30%) set and the OSBF is esti- mated based on the training datasets. Using species-specific U i and∆ i matrices, individual profiles from speciesi is projected to the common expression subspace. The sampling of profiles into the database and query set was repeatedn=100 times, and the cross-species search was performed in each case. The median mAP score is used for all analyses and plots. Using the median expression of tissues instead of the mean expression for estimating the common subspace produced a similar search performance. The heatmap plots are generated based on the sample distance of projected profiles in the V space. For comparison of HO GSVD-based search with OSBF, profiles were projected onto common subspace using generalized inverse for HO GSVD. To calculate tissue specificity, the top 1% genes with the highest coefficient in the left basis matrix were identified for each species. Tissue specificity metrics: Tau ( τ ) (Yanai et al., 2005) and Gini coefficient (Ceriani & Verme, 2012) were calculated for each species based on the av- erage expression of five shared tissue types. GO enrichment of top genes was performed with shared orthologous genes in all species as the background. The top 100 genes with the highest coefficient in the different columns of the left basis were used for ontology analysis and heatmap. Heatmap based on mean log2 TPM expression values in different tissues was generated using the ComplexHeatmap package v2.8 (Gu et al., 2016). 5.4 Results 5.4.1 Database and query setup We curated 1,750 bulk RNA-Seq profiles from 88 projects consisting of data from 17 different species, including 38 different tissue types. Previous cross-species gene expression studies have 67 shown that the transcriptional patterns between homologous organs of different species are more similar than other organs of the same species (Brawand et al., 2011; Merkin et al., 2012; Barbosa- Morais et al., 2012; Cardoso-Moreira et al., 2019). A search result entry (hit) is considered a true positive if the tissue type matches the query. 5.4.2 Within-species gene expression query We first evaluated the performance of within-species gene-expression search by restricting the query and database to profiles from the same species. For within-species gene expression search, profiles from one species were queried against all other profiles from the same species using a complete set of genes. A high accuracy (mean average precision (mAP) score≥ 0.95) is observed (Appendix Figure D.1 a) across all species. To check for project-specific biases, we repeated the search by removing the same tissue type profiles from the database set belonging to the same project as the query. This was to verify whether the search could identify functionally similar tissue types instead of hits from other tissue types from the same project. No project-specific biases for the search results are observed in our curated datasets (Appendix Figure D.1 b). 5.4.3 Cross-species gene expression query based on orthologous genes We define seven hierarchically organized clades based on the species divergence time to study the impact of a varying number of orthologous genes shared across different species in finding similar cross-species gene expression profiles. The number of one-to-one protein-coding genes shared by species belonging to different clades ranged from 12359 genes for the most recent clade (primates) to 964 genes for the ancient clade (fly) (Table 5.1). The gene expression profiles in each clade are split into the database (70%) and query (30%) set, and the cross-species gene expression search is carried out based on the shared number of orthologous genes. Matching tissue types were correctly identified as the top hits for profiles from the recent clades (primates 0.97 mAP and rodents 0.93 mAP) but the accuracy declined for the ancient clades (vertebrates mAP 0.83 and fly 0.55 mAP) (Figure 5.1 a). The mAP scores for the search results decreased linearly with the phylogenetic 68 distance (Figure 5.1 b). Euclidean Cosine mAP 0.6 0.7 0.8 0.9 1.0 primates mammals bird vertebrate fly primates mammals bird vertebrate fly a) mAP Phylogenetic distance (Mya) Euclidean Cosine Pearson Spearman JSD 1/2 0.5 0.6 0.7 0.8 0.9 1.0 200 400 600 800 b) Figure 5.1: Cross-species gene expression search performance based on orthologous genes: a) The distribution of mean average precision (mAP) score for orthology-based search based on cosine and Euclidean distance. For each clade, the database and queries consist of gene-expression profiles from species belonging to that clade, and queries are performed based on the shared orthol- ogous genes corresponding to that clade. The search is carried out n (n=100) times by sampling the entire gene expression profiles into the database (70%) and query (30%). b) Median mAP scores of the cross-species search results at different clades based on orthologous genes. The mean phylogenetic distance of different species in each clade with the human is shown on the x-axis. Identifying similar gene expression profiles across species became less effective in ancient clades compared to the recent clades, and the accuracy drops (Figure 5.2 a). Similarly, a drop in the accuracy is observed for ancient tiers when mAP is evaluated based on top hits Figure 5.2 b). A search in the principal component space based on the orthologous genes also showed a similar trend for the ancient clades. Additionally, when only profiles from primate clade are searched at different clade levels, using subsets of orthologous genes corresponding to respective clades, a low precision at ancient clades is observed (Figure 5.2 c). This suggests that the decreased performance in the ancient clades is not due to the broader database and query sets. We also evaluated pairwise search between distant species and observed lower accuracy in finding similar expression profiles (Figure 5.2 d). 69 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Recall Precision primates (AUC 0.93) mammals (AUC 0.78) bird (AUC 0.73) vertebrates (AUC 0.66) fly (AUC 0.39) a) 0.8 0.6 0.4 0.2 0.0 primates mammals bird vertebrate fly primates mammals bird vertebrate fly top30 top50 mAP Euclidean Cosine b) mAP Euclidean Cosine Human Human Mouse Mouse 1.0 0.8 0.6 0.4 0.2 0.0 Drosophila Zebrafish 1.0 0.8 0.6 0.4 0.2 0.0 Recall 1.0 0.8 0.6 0.4 0.2 primates (AUC 0.93) mammals (AUC 0.93) bird (AUC 0.93) vertebrates (AUC 0.9) fly (AUC 0.64) random (AUC 0.86) Precision c) d) Figure 5.2: Cross-species gene expression search performance based on 1-to-1 orthologous genes: a) Precision-recall curve for orthology-based search for different clades. b) Cross-species orthology-based search performance when the average precision is evaluated for each query’s top 30 and 50 search hits. c) Precision-Recall curve showing the performance when primate profiles are searched at different clades. Random indicates when the same number of orthologous genes at the fly clade (n=964) are randomly selected from the shared orthologous genes at the primate clade, and the search is performed using these random sets of orthologous genes. d) Pairwise cross-species search accuracy when human and mouse profiles are searched against drosophila and zebrafish profiles. Pairwise 1-to-1 coding orthologs are used to perform the query. 70 5.4.4 Limitation of orthology-based cross-species gene expression search We observed that the proportion of housekeeping orthologous genes increases with the species divergence time, resulting in fewer shared tissue-specific orthologous genes among distant species (Appendix Figure D.2 a). Almost half of the orthologous genes shared in all 17 species (fly clade) are housekeeping genes (Appendix Figure D.2 b). Proportion of variance across tissues 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 Proportion of variance across species Figure 5.3: Proportion of gene expression variation explained by tissues and species fac- tors: The proportion of gene expression variation explained by tissues and species factors in a two- way ANOV A model for the shared orthologous 964 genes acrossk =17 species. For the combined gene expression pro- file constructed based on shared orthologous genes across17 species, we calculated the rela- tive proportion of variance explained by tissue and species factors using a two-way ANOV A model. The analysis showed that the expres- sion of the shared orthologous genes mainly varies across species (Figure 5.3). This shows that most shared genes do not show tissue- specific expression and are likely to be house- keeping genes. Gene ontology analysis also in- dicates that genes shared in all species are pri- marily associated with essential cellular func- tions (Appendix Table D.1). Comparative transcriptomic studies have observed organ-dominated clustering patterns suggesting that the expression profile of func- tionally similar tissues of different species are more similar than different tissues of the same species (Barbosa-Morais et al., 2012; Merkin et al., 2012; Brawand et al., 2011). We also observed clustering by tissue types is observed for profiles belonging to clades that share a substantial num- ber of orthologous genes (Appendix Figure D.3 a). In contrast, no clear separation of tissue types is noticed when a limited number of orthologous genes are shared between species, especially in 71 ancient clades (Appendix Figures D.3 b-c). Clear separation for all tissue types is also not observed when primate profiles are clustered using shared orthologous genes corresponding to the fly clade (Appendix Figure D.3 d). Similarly, pairwise clustering analysis between distant species showed no grouping of similar tissues as profiles from the same species tend to cluster together (Appendix Figure D.4). Sequence similarity is the sole criterion to infer orthology. For distant species, sequences may diverge beyond statistical recognition. Accumulation of evolutionary events such as gene expan- sions, losses, and other genomic rearrangements tends to blur the recognition of true orthologs among homologs (Lynch & Conery, 2003; Koonin, 2005; Kuzniar et al., 2008). These limitations also influence the accuracy of cross-species gene expression queries using orthologous genes. As species divergence time increases, identifying conserved transcriptomic patterns becomes increas- ingly difficult. Only partial information is used by restricting the comparison to a subset of shared genes. When profiles of distant species are involved, a significant fraction of information is lost by restricting the analysis to orthologous genes. In summary, cross-species gene expression search based on one-to-one orthologous genes often fails to identify similar expression profiles of distant species accurate; hence not an effective strategy to query expression profiles. 5.4.5 Cross-species gene expression search using OSBF When the query (testing) set consists of profiles from tissue types that are shared across all species, cross-species search using OSBF achieved high accuracy across all clades (mAP≥ 0.9) (Figure 5.4a). Matching tissue types are the top hits for queries from all species. A notable increase in accuracy is observed when profiles from all 17 species are queried using the common expression subspace compared to the one-to-one orthologous gene-based cross-species search at the fly clade (Figure 5.4b). Similarly, when the profiles from Drosophila and Zebrafish are queried, accurate gene-expression profiles matching query tissue types are identified (Figure 5.5 a-d; Appendix Table D.2) and achieved higher accuracy than orthology-based search (Appendix Tables D.3 and D.4). 72 primates treeshrew rodents mammals bird tetrapods vertebrate fly primates treeshrew rodents mammals bird tetrapods vertebrate fly Euclidean Cosine Euclidean Cosine 0.0 0.2 0.4 0.6 0.8 1.0 mAP mAP primates fly primates fly 0.75 0.80 0.85 0.90 0.95 1.00 OSBF 1-to-1 orthologs a) b) Figure 5.4: Cross-species gene expression search performance in the common expression sub- space: a) Performance of cross-species search in OSBF common subspace across different clades defined in this study. b) Comparison of performance of OSBF and orthology-based cross-species search at the primate and fly clade. At the fly clade, profiles from all 17 species are present. When profiles from 38 tissue types are queried, including tissue types that are not involved in the estimation of the shared subspace, an overall mAP of 0.75 is achieved compared to 0.55 for orthology-based search at the fly clade (Appendix Figure D.5 a). Although the common expression subspace is estimated using a small subset of shared tissues: brain, heart, liver, kidney, and liver (n = 5, 13.16%) across 17 species, queries from 14 tissue types (36.84%) showed an mAP score ≥ 0.9 (Appendix Table D.5, Appendix Figures D.5 b-d). For example, when the muscle profile (GSM2464051) from chicken is searched, the top results consist of muscle profiles from primates, rodents, and other mammals (Figure 5.6 a). When profiles from distant species are searched against the rest of the species, the accuracy improves for the OSBF-based query compared to the shared orthologs-based query (Appendix Table D.6). This suggests that OSBF-based cross-species search is not restricted to tissue types involved in the estimation of the common subspace, and compared to the shared gene-based approach, the accuracy improves, especially for distantly related species. 73 a) b) c) d) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 S.scrofa M.musculus M.musculus M.musculus H.sapiens M.musculus S.scrofa S.scrofa M.musculus M.musculus 0.0025 0.0035 0.0036 0.0042 0.0044 0.0045 0.0046 0.0046 0.0050 0.0050 0.0093 0.0097 0.0010 0.0101 0.0101 0.0102 0.0102 0.0103 0.0103 0.0103 D.melanogaster oenocyte (GSM3058508) liver liver liver liver liver liver liver liver liver liver Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.mulatta H.sapiens M.mulatta M.mulatta R.norvegicus M.musculus C.jacchus S.scrofa S.scrofa R.norvegicus 0.0061 0.0076 0.0100 0.0116 0.0117 0.0120 0.0121 0.0127 0.0128 0.0130 0.0108 0.0115 0.0126 0.0134 0.0135 0.0136 0.0137 0.0140 0.0140 0.0141 D.melanogaster malpighian tubule (GSM1414456) kidney kidney kidney kidney kidney kidney kidney kidney kidney kidney Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 O.cuniculus D.melanogaster T.belangeri M.musculus R.norvegicus M.musculus M.musculus M.musculus M.musculus M.musculus 0.0002 0.0004 0.0006 0.0006 0.0007 0.0007 0.0007 0.0007 0.0007 0.0009 0.0084 0.0085 0.0086 0.0086 0.0086 0.0086 0.0086 0.0086 0.0086 0.0087 heart heart heart heart heart heart heart heart heart heart D.rerio heart (GSM1215619) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.mulatta P.troglodytes M.mulatta C.jacchus M.mulatta G.gallus M.mulatta B.taurus O.cuniculus C.jacchus 0.0006 0.0007 0.0008 0.0009 0.0010 0.0011 0.0011 0.0011 0.0011 0.0011 0.0086 0.0087 0.0087 0.0087 0.0087 0.0088 0.0089 0.0089 0.0088 0.0088 brain brain brain brain brain brain brain brain brain brain D.rerio brain (GSM1620933) Figure 5.5: Cross-species gene expression search performance for distant species using OSBF common subspace: a,b) Top search results for a Malpighian and oenocyte profile from Drosophila. c,d) Top search results for a heart and brain profile from Zebrafish. The accession ids for the search hits in (a) -(d) can be found in Appendix Table S12. 74 a) G.gallus muscle (GSM2454051) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 B.taurus P.troglodytes P.troglodytes S.scrofa M.musculus M.musculus M.mulatta M.musculus M.mulatta S.scrofa muscle muscle muscle muscle muscle muscle muscle muscle muscle muscle 0.0018 0.0056 0.0058 0.0060 0.0061 0.0068 0.0069 0.0073 0.0074 0.0075 0.0090 0.0106 0.0106 0.0107 0.0108 0.0111 0.0111 0.0113 0.0113 0.0114 b) kidney (GTEX.13112.2126.SM.5GCO4) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 P.troglodytes S.scrofa M.musculus M.mulatta M.musculus R.norvegicus S.scrofa D.rerio B.taurus R.norvegicus kidney kidney kidney kidney kidney kidney kidney kidney kidney kidney 0.0076 0.0118 0.0124 0.0146 0.0147 0.0148 0.0149 0.0154 0.0161 0.0168 0.0115 0.0135 0.0150 0.0151 0.0152 0.0152 0.0155 0.0159 0.0163 0.0164 Figure 5.6: Cross-species gene expression search using common expression subspace: a) Top cross-species search results when a test dataset muscle expression profile (GSM2464051) from chicken is searched in our database. b) Top search results when a GTEx kidney expression profile is queried against our database. The accession ids for search hits in (a) and (b) can be found in Appendix Table D.8. 0.0 0.2 0.4 0.6 0.8 1.0 brain heart liver kidney testis 2642 861 226 89 361 GTEx tissues mAP Figure 5.7: Cross-species search perfor- mance of GTEx expression profiles: Cross- species search performance for different tissue type profiles from the GTEx study. The num- ber of profiles belonging to each tissue in the GTEx study is shown in the bars. We evaluated the performance of the OSBF cross-species search strategy on additional pub- lic datasets. We queried 12537 bulk RNA-Seq profiles of 17 tissues from the GTEx consortium against our curated datasets. The OSBF common subspace is estimated based on the training pro- files ( n = 5 shared tissue types) in our curated datasets, and the GTEx profiles are projected to the common subspace for querying. Consider- ing GTEx profiles belonging to the shared tis- sue types, an overall precision (mAP) ≥ 0.98 is achieved (Figure 5.7). Matching tissue types from close and distant species were correctly identified for different GTEx profiles (Figure 5.6 75 b; Appendix Figure D.6). We also compared our cross-species search performance with the HO GSVD algorithm (Ponnapalli et al., 2011). Our approach achieved a higher accuracy for differ- ent species sets (Appendix Figures 5.8 a-b). A summary of the OSBF-based cross-species query strategy compared to existing gene expression search methods is shown in Appendix Table D.7. a) b) 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 primates mammals bird vertebrates fly OSBF HO GSVD Recall Precision 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 primates mammals bird vertebrates fly False positive rate True positive rate OSBF HO GSVD Figure 5.8: Comparison of cross-species search performance with HO GSVD: Comparison of search performance for OSBF vs. HO GSVD approach. a) The precision-recall curve for the cross-species search for the two methods. For OSBF, AUC values for the primates, mammals, bird, vertebrates, and fly are 0.92, 0.89, 0.87, 0.81 and 0.79, respectively; while for HO GSVD the AUC values are 0.71, 0.60, 0.60, 0.48 and 0.40. b) The ROC curve for the cross-species search for the two methods. For OSBF, AUC values for primates, mammals, bird, vertebrates, and fly are 0.99, 0.99, 0.99, 0.98, and 0.97, respectively; while for HO GSVD the AUC values are 0.94, 0.92, 0.92, 0.86, and 0.83. 5.4.6 Dimension of the common expression subspace For the joint analysis of gene expression profiles from a divergent set of species, a limited number of functionally similar tissue types are available in the public databases. The number of shared functionally equivalent tissues (n) among k species determines the dimension of the OSBF com- mon subspace and, thus, the dimension of the projected expression profiles ( n× 1). We observed that the proportion of true positives retrieved as the top hits increases withn (Appendix Figures D.7 76 a-b), and the overall performance of the cross-species search also improves (Appendix Figures D.7 c-d). Thus capturing diverse inter-tissue relationships provides a higher resolution in accurately detecting the phenotype of new profiles in the common subspace. 5.5 Discussion In this study, we introduced a cross-species gene expression query strategy based on OSBF. Using expression values of complete genes from individual species, we can query profiles across multiple species without relying on gene mappings information between species. Our procedure does not require prior selection of a subset of genes such as differentially expressed genes, highly expressed genes, and ad hoc cutoffs. Instead, the OSBF estimates the common expression subspace shared by all species and provides a co-ordinate system to compare expression profiles from multiple species without gene mapping between species. By using the actual expression values, our approach is a “content-based” search rather than being based on a set of gene names or metadata information. The OSBF-based search uses simple distance metrics such as Euclidean distance and cosine dis- tance to calculate the similarity between expression profiles. We have shown that our approach achieves high accuracy, especially for comparisons involving distant species, using testing datasets from 17 species in our curated datasets, GTEx consortium, and other publicly available expression profiles from distant species. The approach is also helpful for phenotypes (like tissue types) that are not involved in estimating the species-independent space. In summary, our method effectively uses prior knowledge available in existing public datasets to accurately identify invariant biological features in close and distant species. It can be employed for joint analysis of expression datasets from multiple species and pairwise analysis. The limiting factor for joint analysis of gene expression data from multiple species in our ap- proach is the availability of functionally similar tissues. The study shows that the accuracy of cross- species queries using OSBF increases with the number of shared tissue types. Recently, single-cell genomics has made complete transcriptomic profiles of different species available (Fincher et al., 77 2018; Farrell et al., 2018). This provides an unprecedented opportunity to curate a rich collection of equivalent cell types across a wide array of species. Extending our approach from tissue to cell type resolution could increase the number of functionally equivalent profiles across different species. We believe the cross-species query using OSBF will be a valuable tool for researchers to characterize new expression profiles with further improvements and additions. 5.6 Conclusion We have demonstrated that OSBF common subspace summarizes gene expression patterns shared by different species and maps cellular phenotypes into a common coordinate system. Projecting individual expression profiles to the OSBF common subspace allows efficient comparison of cross- species gene expression data without using any gene mapping between species. In summary, OSBF has the potential to be used as an effective computational approach for querying gene expression datasets generated from diverse species. 78 Chapter 6 Summary and Future work The overarching goal of this dissertation is to enable efficient gene expression comparisons across species. In this thesis, we developed a novel joint-matrix factorization algorithm that compares multiple high-dimensional datasets tabulated as large-scale matrices with different numbers of rows. In chapter 2, we discussed that the principal component analysis using singular value de- composition is a common dimensionality reduction approach for high-dimensional datasets. The PCA approach is readily interpretable as individual principal components are estimated to maxi- mize the variance in the data. In this study, we extend this idea to develop a novel joint matrix factorization to reduce multiple high-dimensional matrices to estimate a shared subspace jointly. 6.1 Summary of OSBF Given a set of real matrices D i with the same number of columns and, in general, with different numbers of rows, OSBF decomposes individual matrices as D i = U i ∆ i V T + ϵ i . The left basis matrixU i is a data-specific matrix with orthonormal columns. Each ∆ i is a positive definite diag- onal matrix. The right basis matrix V is an orthogonal matrix identical in all factorizations. The shared V is estimated using the eigensystem EV = VΛ of the expected inter-sample correlation matrixE. Theϵ i is the decomposition error associated with each factorization. The right basis matrix V is a subspace of row space (R n ) shared by all input matrices rep- 79 resenting the correlation relationships between the columns. The factors are estimated such that dimension 1 of the common subspace represents the maximum correlation relationship between the columns of the input matrices, dimension 2 with the next highest correlation relationship and independent of the first dimension, and so on. The entries of the diagonal matrix indicate the amount of correlation information represented by dimensions of the common subspace. OSBF is usually estimated using a set of input matrices where there is a one-to-one correspondence between columns of input matrices. The fact that the V is shared in all species and represents the corre- lation relationship between the columns guarantees that the individual dimensions of this space represents a similar property across all input matrices. Each left basis matrix stores the loading values associated with different dimensions of the common subspace for all the features (genes) present in the rows of the input matrix. The columns of the left basis matrix correspond to load- ings for an independent linear combination of genes that transforms input matrices to the common subspace. Using the loading values, the features that are important for the individual dimensions of the common subspace can be determined. In OSBF, the factors are estimated such that the reconstructed matrices lie close to the input data matrices. We employ an iterative algorithm to estimate the unknown factors: U i , ∆ i , and, V that minimize the total factorization error. Given the input matrices, we update one of the unknown factors using the other two unknown factors in each iteration. The process is repeated until the factorization error cannot be decreased further. In our update strategy, we make the optimal update for each factor in each iteration, reducing the error, and guaranteeing convergence. In the case of a single matrix, the procedure reduces to a singular value decomposition of the column standardized data. To summarize, OSBF utilizes the inter-sample correlation structure within the data to estimate a shared reduced dimensional subspace and data-specific factors. The estimated factors are in- terpretable and allow efficient comparisons of multiple high-dimensional datasets represented as matrices with a varying number of rows. 80 6.2 Summary of cross-species gene expression analysis The second part of this dissertation illustrates the application of OSBF in analyzing gene expression profiles from multiple species without requiring the mapping of genes of one species onto those of another. To identify genes important for the phenotype from cross-species gene expression data, the traditional homology-based approaches use a “orthologous gene selection - compare phenotype - gene selection” strategy. In these approaches, an inherent assumption of functional equivalence be- tween orthologs of two species is made. There are many known examples of functional divergence of orthologs, cases of non-orthologous genes performing similar functions, and rewiring and re- purposing of genes and proteins (Koonin et al., 1996; Gabald´ on & Koonin, 2013; Liu et al., 2010; Kajala et al., 2021). These evidences imply that many independent strategies exist in nature to achieve the same function, and orthology is not a necessary criterion to perform equivalent func- tions. Species-divergence time also dramatically influences the number of shared orthologous genes between species. As the divergence time between two species increases, the number of orthologous genes shared decreases. Unless the species are closely related, the homology-based approaches fail to account for a significant amount of information from non-orthologous genes. In contrast, the OSBF-based cross-species analysis employs a “phenotype to genes” approach to identify functionally relevant genes. Identifying functionally similar phenotypes across species, such as cell type, tissue, or a matching developmental stage, is comparatively easier than determin- ing conserved molecular features. In OSBF, the gene expression profiles of functionally matching phenotypes from different species are jointly factorized to estimate species-specific factors and a shared factor. No prior assumption of functional equivalence of genes across species is made. The relative weights of genes in the species-specific factors are distributed such that the genes with high loadings represent similar biological functions in all species. Thus, the genes with significant contributions to the phenotype can be easily identified without using orthology mapping between species. Using OSBF, we compared gene expression profiles of six tissues from eight species, includ- 81 ing primates, rodents, mammals, and bird. The analysis identified a comprehensive list of genes essential for the tissue phenotype, independent of their annotation and homology classification. The tissue-relevant genes include known markers, non-coding genes, pseudogenes, and other less- studied genes. More than 40% of the genes important for tissue phenotype are not one-to-one orthologs shared in all species. Furthermore, 6-10% of these genes do not have orthologs in other species. Several examples of gene expression divergence of orthologs that can be identified by comparing homologs are also detected in our analysis. The similarity of conserved biological processes across species allows gene expression profiles to be defined by a species-independent lower dimensional subspace represented by the shared factor. This shared factor, which we call a common expression subspace, is estimated based on the mean correlation relationship between the phenotypes across species. Independent of the number of genes annotated in a species, the common expression subspace is an-dimensional space, where n is the number of functionally similar expression profiles we start with. The gene expression profiles from different species projected into the common subspace have identical dimensions (n× 1). The species independence and the identical dimension of the projected profiles in the common subspace enable efficient cross-species gene expression search. The cross-species query using OSBF does not require prior selection of a subset of genes such as differentially expressed genes, highly expressed genes, and ad hoc cutoffs. We have shown that OSBF-based query achieves high accuracy, especially for comparisons involving distant species, using testing datasets from 17 species, GTEx consortium, and other publicly available expression profiles from distant species. We also showed that the cross-species gene expression query in OSBF common subspace outper- forms the search performed using the HO GSVD common factor. In summary, the method provides us an opportunity to utilize the massive gene expression data accumulating in public repositories for data-driven discovery. 82 6.3 Additional applications of the OSBF Cell type identification in single-cell RNA sequencing (scRNA-Seq) data: scRNA-Seq is trans- forming our ability to characterize and profile diverse cell populations in different non-model or- ganisms. Determining the state and identity of cells from a complex cell mixture is one of the core applications of scRNA-Seq. The common approach for cell type assignment is to first apply an unsupervised clustering to partition cells based on the similarity of their gene expression patterns and assign a cell type to each cluster based on the differentially expressed marker genes (Pasquini et al., 2021). Traditional marker genes are insufficient as the basis for defining cell types. The molecular classification of cells into sub-groups (i.e.: cell types) is based on a limited number of characteris- tic marker genes, which are often used for ontology or pathway enrichment tests or by experts for identification. Identification of cell type-specific gene markers from literature and curated knowl- edge databases requires manual strategies. Moreover, determining cell type identity based on a few marker genes affects annotation accuracy due the following; a) Many marker genes being expressed in more than one cell type b) Marker gene discovery can often be protocol and technology-specific. For example, droplet technologies often fail to amplify small genes. The OSBF common expression subspace provides a uniform coordinate system to compare global expression profiles across species. Using functionally similar cell types from different species available in the public databases, the species-specific factors can be estimated to project expression profiles to the common expression subspace. The common expression subspace effec- tively summarizes the expression of all genes. Performing queries in this space could be a potential improvement compared to cell type annotation using a handful of marker genes. Integrated analysis of multiple genomics datasets: Next-generation sequencing technologies have facilitated the characterization of biological systems at multiple levels, e.g., epigenomics, transcriptomics, proteomics, and metabolomics. Different consortium projects have generated genome-wide data from multiple assays for the same samples. For example, The Cancer Genome 83 Atlas (TCGA) project has profiled single-nucleotide polymorphisms, DNA methylation, gene ex- pression, and micro-RNA expression for the same set of tumor samples. As sequencing cost de- creases, genome-wide characterization of multiple aspects of the same set of samples will become standard practice. Integrated analysis of different genomics data requires mathematical frameworks that can han- dle differences in the units of the data and the number of features. As data are generated from the same set of samples, the global correlation relationship between the biological samples will remain similar across different assays. We can leverage this property to analyze data from multiple assays using OSBF jointly. The method does not require the matching of features between datasets. It can be used to identify a subset of features from different types of measurements that are important for the phenotype. 84 Bibliography Achim K, Eling N, Vergara HM, Bertucci PY , Musser J, V opalensky P, Brunet T, Collier P, Benes V , Marioni JC et al. (2018) Whole-body single-cell sequencing reveals transcriptional domains in the annelid larval body. Molecular Biology and Evolution 35:1047–1062. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sci- ences 97:10101–10106. Alter O, Brown PO, Botstein D (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proceedings of the National Academy of Sciences 100:3351–3356. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215:403–410. Altschul SF, Madden TL, Sch¨ affer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Re- search 25:3389–3402. Alwine JC, Kemp DJ, Stark GR (1977) Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proceedings of the National Academy of Sciences 74:5350–5354. Anders S, Pyl PT, Huber W (2015) HTSeq a Python framework to work with high-throughput sequencing data. Bioinformatics 31:166–169. Andersen PS, Hedley PL, Page SP, Syrris P, Moolman-Smook JC, McKenna WJ, Elliott PM, Christiansen M (2012) A novel myosin essential light chain mutation causes hypertrophic car- diomyopathy with late onset and low expressivity. Biochemistry Research International 2012. Arber S, Hunter JJ, Ross Jr J, Hongo M, Sansig G, Borg J, Perriard JC, Chien KR, Caroni P (1997) Mlp-deficient mice exhibit a disruption of cardiac cytoarchitectural organization, dilated cardiomyopathy, and heart failure. Cell 88:393–403. Arendt D, Musser JM, Baker CV , Bergman A, Cepko C, Erwin DH, Pavlicev M, Schlosser G, Widder S, Laubichler MD et al. (2016) The origin and evolution of cell types. Nature Reviews Genetics 17:744–757. 85 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al. (2000) Gene ontology: tool for the unification of biology. Nature Genet- ics 25:25–29. Aslani F, Modarresi MH, Soltanghoraee H, Akhondi MM, Shabani A, Lakpour N, Sadeghi MR (2011) Seminal molecular markers as a non-invasive diagnostic tool for the evalua- tion of spermatogenesis in non-obstructive azoospermia. Systems Biology in Reproductive Medicine 57:190–196. Aversa R, Sorrentino A, Esposito R, Ambrosio MR, Amato A, Zambelli A, Ciccodicola A, D’Apice L, Costa V (2016) Alternative splicing in adhesion-and motility-related genes in breast cancer. International Journal of Molecular Sciences 17:121. Baker K, Gordon SL, Melland H, Bumbak F, Scott DJ, Jiang TJ, Owen D, Turner BJ, Boyd SG, Rossi M et al. (2018) Syt1-associated neurodevelopmental disorder: a case series. Brain 141:2576–2591. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY , Gueroussov S, Lee LJ, Slobodeniuc V , Kutter C, Watt S, C ¸ olak R et al. (2012) The evolutionary landscape of alternative splicing in vertebrate species. Science 338:1587–1593. Bergmann S, Ihmels J, Barkai N, Eisen M (2004) Similarities and differences in genome-wide expression data of six organisms. PLOS Biology 2:e9. Bertagnolli NM, Drake JA, Tennessen JM, Alter O (2013) SVD identifies transcript length dis- tribution functions from DNA microarray data and reveals evolutionary forces globally affecting GBM metabolism. PLOS One 8:e78913. Boer PH, Adra CN, Lau Y , McBURNEY MW (1987) The testis-specific phosphoglycerate kinase gene pgk-2 is a recruited retroposon. Molecular and Cellular Biology 7:3107–3112. Boison D, Bussow H, D’Urso D, Muller H, Stoffel W (1995) Adhesive properties of prote- olipid protein are responsible for the compaction of cns myelin sheaths. Journal of Neuro- science 15:5502–5513. Boley N, Stoiber MH, Booth BW, Wan KH, Hoskins RA, Bickel PJ, Celniker SE, Brown JB (2014) Genome-guided transcript assembly by integrative analysis of RNA sequence data. Nature Biotechnology 32:341–346. Borges K, Dingledine R (1998) Ampa receptors: molecular and functional diversity. Progress in Brain Research 116:153–170. Brawand D, Soumillon M, Necsulea A, Julien P, Cs´ ardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M et al. (2011) The evolution of gene expression levels in mammalian organs. Nature 478:343. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG et al. (2003) ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Research 31:68–71. 86 Breschi A, Djebali S, Gillis J, Pervouchine DD, Dobin A, Davis CA, Gingeras TR, Guig´ o R (2016) Gene-specific patterns of expression variation across organs and species. Genome Biol- ogy 17:151. Brown AC, Hallouane D, Mawby WJ, Karet FE, Saleem MA, Howie AJ, Toye AM (2009) RhCG is the major putative ammonia transporter expressed in the human kidney, and RhBG is not expressed at detectable levels. American Journal of Physiology-Renal Physiol- ogy 296:F1279–F1290. Burkhoff AM, Linemeyer DL, Salon JA (1998) Distribution of a novel hypothalamic neuropep- tide Y receptor gene and its absence in rat. Molecular Brain Research 53:311–316. Byrne LM, Rodrigues FB, Blennow K, Durr A, Leavitt BR, Roos RA, Scahill RI, Tabrizi SJ, Zetterberg H, Langbehn D et al. (2017) Neurofilament light protein in blood as a potential biomarker of neurodegeneration in Huntington’s disease: a retrospective cohort analysis. The Lancet Neurology 16:601–609. Cao J, Packer JS, Ramani V , Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ et al. (2017) Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357:661–667. Cardoso-Moreira M, Halbert J, Valloton D, Velten B, Chen C, Shao Y , Liechti A, Ascenc ¸˜ ao K, Rummel C, Ovchinnikova S et al. (2019) Gene expression across mammalian organ development. Nature 571:505–509. Carelli FN, Liechti A, Halbert J, Warnefors M, Kaessmann H (2018) Repurposing of promoters and enhancers during mammalian evolution. Nature Communications 9:4066. Ceriani L, Verme P (2012) The origins of the Gini index: extracts from Variabilit` a e Mutabilit` a (1912) by Corrado Gini. The Journal of Economic Inequality 10:421–443. Chen J, Guo M, Wang X, Liu B (2018) A comprehensive review and comparison of differ- ent computational methods for protein remote homology detection. Briefings in Bioinformat- ics 19:231–244. Chen S, Zhang YE, Long M (2010) New genes in drosophila quickly become essential. sci- ence 330:1682–1685. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Lang- mead B, Leek JT (2017) Reproducible RNA-seq analysis using recount2. Nature Biotechnol- ogy 35:319–321. Comon P (1994) Independent component analysis, a new concept? Signal Process- ing 36:287–314. Costa FF (2014) Big data in biomedicine. Drug Discovery Today 19:433–440. Cummings BB, Marshall JL, Tukiainen T, Lek M, Donkervoort S, Foley AR, Bolduc V , Waddell LB, Sandaradura SA, O’Grady GL et al. (2017) Improving genetic diagnosis in mendelian disease with transcriptome sequencing. Science Translational Medicine 9. 87 Cunningham F, Achuthan P, Akanni W, Allen J, Amode MR, Armean IM, Bennett R, Bhai J, Billis K, Boddu S et al. (2019) Ensembl 2019. Nucleic Acids Research 47:D745–D751. Darwin C (1859) On the origin of species london. J. Murray . Dayhoff MO (1969) Computer analysis of protein evolution. Scientific American 221:86–95. De Magalh˜ aes JP, Curado J, Church GM (2009) Meta-analysis of age-related gene expression profiles identifies common signatures of aging. Bioinformatics 25:875–881. Devalla HD, G´ elinas R, Aburawi EH, Beqqali A, Goyette P, Freund C, Chaix MA, Tadros R, Jiang H, Le B´ echec A et al. (2016) TECRL, a new life-threatening inherited arrhythmia gene associated with overlapping clinical features of both LQTS and CPVT. EMBO Molecular Medicine 8:1390–1408. Devarajan K (2008) Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLOS Computational Biology 4:e1000029. Djureinovic D, Fagerberg L, Hallstr¨ om B, Danielsson A, Lindskog C, Uhl´ en M, Pont´ en F (2014) The human testis-specific proteome defined by transcriptomics and antibody-based profiling. Molecular Human Reproduction 20:476–488. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21. Dolinski K, Botstein D (2007) Orthology and functional conservation in eukaryotes. Annual Review of Genetics 41:465–507. Dong L, Pietsch S, Tan Z, Perner B, Sierig R, Kruspe D, Groth M, Witzgall R, Gr¨ one HJ, Platzer M et al. (2015) Integration of cistromic and transcriptomic analyses identifies nphs2, mafb, and magi2 as wilms’ tumor 1 target genes in podocyte differentiation and maintenance. Journal of the American Society of Nephrology 26:2118–2128. Doolittle RF (1981) Similar amino acid sequences: chance or common ancestry? Sci- ence 214:149–159. Durinck S, Spellman PT, Birney E, Huber W (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols 4:1184. Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30:207–210. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome- wide expression patterns. Proceedings of the National Academy of Sciences 95:14863–14868. Eisenberg E, Levanon EY (2013) Human housekeeping genes, revisited. Trends in Genet- ics 29:569–574. 88 Elhaik E, Sabath N, Graur D (2006) The “inverse relationship between evolutionary rate and age of mammalian genes” is an artifact of increased genetic distance with rate of evolution and time of divergence. Molecular Biology and Evolution 23:1–3. Emrich SJ, Barbazuk WB, Li L, Schnable PS (2007) Gene discovery and annotation using LCM- 454 transcriptome sequencing. Genome Research 17:69–73. Engreitz JM, Chen R, Morgan AA, Dudley JT, Mallelwar R, Butte AJ (2011) ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression. Bioinformatics 27:3317–3318. Engreitz JM, Daigle Jr BJ, Marshall JJ, Altman RB (2010) Independent component analysis: mining microarray data for fundamental human gene expression modules. Journal of Biomedical Informatics 43:932–944. Epstein W, Beckwith JR (1968) Regulation of gene expression. Annual Review of Biochem- istry 37:411–436. Fagerberg L, Hallstr¨ om BM, Oksvold P, Kampf C, Djureinovic D, Odeberg J, Habuka M, Tah- masebpoor S, Danielsson A, Edlund K et al. (2014) Analysis of the human tissue-specific expres- sion by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & Cellular Proteomics 13:397–406. Faggioni M, Knollmann BC (2012) Calsequestrin 2 and arrhythmias. American Journal of Physiology-Heart and Circulatory Physiology 302:H1250–H1260. Farrell JA, Wang Y , Riesenfeld SJ, Shekhar K, Regev A, Schier AF (2018) Single-cell recon- struction of developmental trajectories during zebrafish embryogenesis. Science 360:eaar3131. Fearn A, Allison B, Rice SJ, Edwards N, Halbritter J, Bourgeois S, Pastor-Arroyo EM, Hilde- brandt F, Tasic V , Wagner CA et al. (2018) Clinical, biochemical, and pathophysiological analysis of slc 34a1 mutations. Physiological Reports 6:e13715. Felsenstein J (1985) Phylogenies and the comparative method. The American Natural- ist 125:1–15. Fincher CT, Wurtzel O, de Hoog T, Kravarik KM, Reddien PW (2018) Cell type transcriptome atlas for the planarian schmidtea mediterranea. Science 360:eaaq1736. Finn RD, Clements J, Eddy SR (2011) Hmmer web server: interactive sequence similarity search- ing. Nucleic Acids Research 39:W29–W37. Fitch WM (1970) Distinguishing homologous from analogous proteins. Systematic Zool- ogy 19:99–113. Fitch WM, Margoliash E (1967) Construction of phylogenetic trees. Science 155:279–284. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512. 89 Fontes M, Soneson C (2011) The projection score-an evaluation criterion for variable subset selection in PCA visualization. BMC Bioinformatics 12:307. Franco AT, Malaguarnera R, Refetoff S, Liao XH, Lundsmith E, Kimura S, Pritchard C, Marais R, Davies TF, Weinstein LS et al. (2011) Thyrotrophin receptor signaling dependence of braf-induced thyroid tumor initiation in mice. Proceedings of the National Academy of Sci- ences 108:1615–1620. Frank D, Rangrez AY , Friedrich C, Dittmann S, Stallmeyer B, Yadav P, Bernt A, Schulze-Bahr E, Borlepawar A, Zimmermann WH et al. (2019) Cardiac α -actin (ACTC1) gene mutation causes atrial-septal defects associated with late-onset dilated cardiomyopathy. Circulation: Genomic and Precision Medicine 12:e002491. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM et al. (1995) The minimal gene complement of Mycoplasma genitalium. Science 270:397–404. Fujibuchi W, Kiseleva L, Taniguchi T, Harada H, Horton P (2007) CellMontage: similar expres- sion profile search server. Bioinformatics 23:3103–3104. Fujita N, Mizuarai S, Murakami K, Nakai K (2018) Biomarker discovery by integrated joint non-negative matrix factorization and pathway signature analyses. Scientific Reports 8:1–10. Gabald´ on T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nature Reviews Genetics 14:360–366. Gaitanou M, Segklia K, Matsas R (2019) CEND1, a story with many tales: from regulation of cell cycle progression/exit of neural stem cells to brain structure and function. Stem Cells International 2019. Galperin MY , Koonin EV (2012) Divergence and convergence in enzyme evolution. Journal of Biological Chemistry 287:21–28. Gao L, Zhang J (2003) Why are some human disease-associated mutations fixed in mice? Trends in Genetics 19:678–681. Gelfman S, Burstein D, Penn O, Savchenko A, Amit M, Schwartz S, Pupko T, Ast G (2012) Changes in exon-intron structure during vertebrate evolution affect the splicing pattern of exons. Genome Research 22:35–50. Gene Ontology Consortium (2019) The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Research 47:D330–D338. Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ et al. (2014) Comparative analysis of the transcriptome across distant species. Na- ture 512:445–448. Gibney E, Nolan C (2010) Epigenetics and gene expression. Heredity 105:4–13. 90 Ginis I, Luo Y , Miura T, Thies S, Brandenberger R, Gerecht-Nir S, Amit M, Hoke A, Carpenter MK, Itskovitz-Eldor J et al. (2004) Differences between human and mouse embryonic stem cells. Developmental Biology 269:360–380. Goh AM, Coffill CR, Lane DP (2011) The role of mutant p53 in human cancer. The Journal of Pathology 223:116–126. Gong K, Xie T, Luo Y , Guo H, Chen J, Tan Z, Yang Y , Xie L (2021) Comprehensive analysis of lncRNA biomarkers in kidney renal clear cell carcinoma by lncRNA-mediated ceRNA network. PLOS ONE 16:e0252452. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Ray- chowdhury R, Zeng Q et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29:644–652. Grau J, Grosse I, Keilwagen J (2015) PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31:2595–2597. Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW et al. (2011) The developmental transcriptome of Drosophila melanogaster. Nature 471:473–479. Gu Z, Rifkin SA, White KP, Li WH (2004) Duplicate genes increase gene expression diversity within and between species. Nature Genetics 36:577–579. Gu Z, Eils R, Schlesner M (2016) Complex heatmaps reveal patterns and correlations in multidi- mensional genomic data. Bioinformatics 32:2847–2849. Gutierrez E, Wiggins D, Fielding B, Gould AP (2007) Specialized hepatocyte-like cells regulate drosophila lipid metabolism. Nature 445:275. Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C et al. (2010) Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nature Biotechnology 28:503–510. Hallmann K, Kudin AP, Zsurka G, Kornblum C, Reimann J, St¨ uve B, Waltz S, Hattingen E, Thiele H, N¨ urnberg P et al. (2016) Loss of the smallest subunit of cytochrome c oxidase, COX8A, causes Leigh-like syndrome and epilepsy. Brain 139:338–345. Han P, Li W, Lin CH, Yang J, Shang C, Nurnberg ST, Jin KK, Xu W, Lin CY , Lin CJ et al. (2014) A long noncoding RNA protects the heart from pathological hypertrophy. Nature 514:102–106. Hart PE, Glantz JN, Orth JD, Poynter GM, Salisbury JL (1999) Testis-specific murine centrin, cetn1: genomic characterization and evidence for retroposition of a gene encoding a centrosome protein. Genomics 60:111–120. He J, Xue Y , Wang Q, Zhou X, Liu L, Zhang T, Shang C, Ma J, Ma T (2020) Long non- coding RNA MIAT regulates blood tumor barrier permeability by functioning as a competing endogenous RNA. Cell Death & Disease 11:1–18. 91 Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proceed- ings of the National Academy of Sciences 89:10915–10919. Herr JC, Thomas D, Bush LA, Coonrod S, Khole V , Howards SS, Flickinger CJ (1999) Sperm mitochondria-associated cysteine-rich protein (SMCP) is an autoantigen in Lewis rats. Biology of Reproduction 61:428–435. Heyn P, Kalinka AT, Tomancak P, Neugebauer KM (2015) Introns and gene expression: cellular constraints, transcriptional regulation, and evolutionary consequences. Bioessays 37:148–154. Hirano R, Interthal H, Huang C, Nakamura T, Deguchi K, Choi K, Bhattacharjee MB, Arimura K, Umehara F, Izumo S et al. (2007) Spinocerebellar ataxia with axonal neuropathy: consequence of a Tdp1 recessive neomorphic mutation? The EMBO journal 26:4732–4743. Holland PM, Abramson RD, Watson R, Gelfand DH (1991) Detection of specific polymerase chain reaction product by utilizing the 5’—-3’exonuclease activity of Thermus aquaticus DNA polymerase. Proceedings of the National Academy of Sciences 88:7276–7280. Hounkpe BW, Chenou F, de Lima F, De Paula EV (2021) HRT Atlas v1. 0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Research 49:D947–D955. HPA (v21.0) The Human Protein Atlas. Huminiecki L, Wolfe KH (2004) Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome Research 14:1870–1879. Hurd PJ, Nelson CJ (2009) Advantages of next-generation sequencing versus the microarray in epigenetic research. Briefings in Functional Genomics and Proteomics 8:174–183. Huxley TH (1860) The origin of species. Westminster Review 17:1862. Iguchi T, Niino N, Tamai S, Sakurai K, Mori K (2017) Comprehensive analysis of circulating microRNA specific to the liver, heart, and skeletal muscle of cynomolgus monkeys. International Journal of Toxicology 36:220–228. Ihmels J, Levy R, Barkai N (2004) Principles of transcriptional control in the metabolic network of saccharomyces cerevisiae. Nature Biotechnology 22:86–92. Jiang H, Chen H, Wan P, Song S, Chen N (2020) Downregulation of enhancer RNA EMX2OS is associated with poor prognosis in kidney renal clear cell carcinoma. Aging (Albany NY) 12:25865. Jung AC, Denholm B, Skaer H, Affolter M (2005) Renal tubule development in Drosophila: a closer look at the cellular level. Journal of the American Society of Nephrology 16:322–328. Kabir MH, Djordjevic D, O’Connor MD, Ho JW (2018) C3: An R package for cross-species compendium-based cell-type identification. Computational Biology and Chemistry 77:187–192. 92 Kajala K, Gouran M, Shaar-Moshe L, Mason GA, Rodriguez-Medina J, Kawa D, Pauluzzi G, Reynoso M, Canto-Pastor A, Manzano C et al. (2021) Innovation, conservation, and repurposing of gene function in root cell type development. Cell 184:3333–3348. Kanehisa M, Furumichi M, Tanabe M, Sato Y , Morishima K (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45:D353–D361. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28:27–30. Kato Y , Nozaki M (2012) Distinct DNA Methylation Dynamics of Spermatogenic Cell-Specific Intronless Genes Is Associated with CpG Content. PLOS ONE 7:1–10. Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, Wylie BN, Davidson GS (2001) A gene expression map for Caenorhabditis elegans. Science 293:2087–2092. Kirita Y , Wu H, Uchimura K, Wilson PC, Humphreys BD (2020) Cell profiling of mouse acute kidney injury reveals conserved cellular responses to injury. Proceedings of the National Academy of Sciences 117:15874–15883. Kiyonaga-Endou K, Oshima M, Sugimoto K, Thomas M, Taketani S, Araki M (2016) Localiza- tion of neurensin1 in cerebellar purkinje cells of the developing chick and its possible function in dendrite formation. Brain Research 1635:113–120. Klotz R, Thomas A, Teng T, Han SM, Iriondo O, Li L, Restrepo-Vassalli S, Wang A, Izadian N, MacKay M et al. (2020) Circulating tumor cells exhibit metastatic tropism and reveal brain metastasis drivers. Cancer Discovery 10:86–103. Klumpp S, Scott M, Pedersen S, Hwa T (2013) Molecular crowding limits translation and cell growth. Proceedings of the National Academy of Sciences 110:16754–16759. Konrad M, Nijenhuis T, Ariceta G, Bertholet-Thomas A, Calo LA, Capasso G, Emma F, Schling- mann KP, Singh M, Trepiccione F et al. (2021) Diagnosis and management of bartter syndrome: executive summary of the consensus and recommendations from the european rare kidney disease reference network working group for tubular disorders. Kidney International 99:324–335. Koonin EV (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Reviews Microbiology 1:127–136. Koonin EV (2005) Orthologs, paralogs, and evolutionary genomics. Annual Review of Genet- ics 39:309–338. Koonin EV , Mushegian AR, Bork P (1996) Non-orthologous gene displacement. Trends in Genetics 12:334–336. Kremer LS, Bader DM, Mertes C, Kopajtich R, Pichler G, Iuso A, Haack TB, Graf E, Schwarz- mayr T, Terrile C et al. (2017) Genetic diagnosis of Mendelian disorders via RNA sequencing. Nature Communications 8:1–11. 93 Kuzniar A, van Ham RC, Pongor S, Leunissen JA (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends in Genetics 24:539–551. Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A (2018) Massive mining of publicly available rna-seq data from human and mouse. Nature Communications 9:1–10. Le HS, Oltvai ZN, Bar-Joseph Z (2010) Cross-species queries of large gene expression databases. Bioinformatics 26:2416–2423. Lechner MS, Dressler GR (1997) The molecular basis of embryonic kidney development. Mech- anisms of Development 62:105–120. Lee D, Seung HS (2000) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems 13. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for remov- ing batch effects and other unwanted variation in high-throughput experiments. Bioinformat- ics 28:882–883. Lefebvre C, Aude JC, Gl´ emet E, N´ eri C (2005) Balancing protein similarity and gene co- expression reveals new links between genetic conservation and developmental diversity in in- vertebrates. Bioinformatics 21:1550–1558. Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H (2014) UpSet: visualization of inter- secting sets. IEEE Transactions on Visualization and Computer Graphics 20:1983–1992. Li CM, Klevecz RR (2006) A rapid genome-scale response of the transcriptional oscillator to perturbation reveals a period-doubling path to phenotypic change. Proceedings of the National Academy of Sciences 103:16254–16259. Li MX, Hwang PM (2015) Structure and function of cardiac troponin C (TNNC1): Implications for heart failure, cardiomyopathies, and troponin modulating drugs. Gene 571:153–166. Li Y , Lin B, Yang L (2015) Comparative transcriptomic analysis of multiple cardiovascular fates from embryonic stem cells predicts novel regulators in human cardiogenesis. Scientific Reports 5:1–16. Li Y , Sun N, Lu Z, Sun S, Huang J, Chen Z, He J (2017) Prognostic alternative mrna splicing signature in non-small cell lung cancer. Cancer Letters 393:40–51. Liao BY , Zhang J (2008) Null mutations in human and mouse orthologs frequently result in different phenotypes. Proceedings of the National Academy of Sciences 105:6987–6992. Liebermeister W (2002) Linear modes of gene expression determined by independent component analysis. Bioinformatics 18:51–60. 94 Lin L, Park JW, Ramachandran S, Zhang Y , Tseng YT, Shen S, Waldvogel HJ, Curtis MA, Faull RL, Troncoso JC et al. (2016) Transcriptome sequencing reveals aberrant alternative splicing in Huntington’s disease. Human Molecular Genetics 25:3454–3466. Lindstr¨ om NO, Tran T, Guo J, Rutledge E, Parvez RK, Thornton ME, Grubbs B, McMahon JA, McMahon AP (2018) Conserved and divergent molecular and anatomic features of human and mouse nephron patterning. Journal of the American Society of Nephrology 29:825–840. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR (2008) Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133:523–536. Liu N, Bezprozvannaya S, Williams AH, Qi X, Richardson JA, Bassel-Duby R, Olson EN (2008) microRNA-133a regulates cardiomyocyte proliferation and suppresses smooth muscle gene ex- pression in the heart. Genes & Development 22:3242–3254. Liu SJ, Nowakowski TJ, Pollen AA, Lui JH, Horlbeck MA, Attenello FJ, He D, Weissman JS, Kriegstein AR, Diaz AA et al. (2016) Single-cell analysis of long non-coding RNAs in the developing human neocortex. Genome Biology 17:1–17. Liu W, Zhao Y , Liu X, Zhang X, Ding J, Li Y , Tian Y , Wang H, Liu W, Lu Z (2021) A Novel Meiosis-Related lncRNA, Rbakdn, Contributes to Spermatogenesis by Stabilizing Ptbp2. Fron- tiers in Genetics p. 1963. Liu Y , McKenna E, Figueroa D, Blevins R, Austin C, Bennett P, Swanson R (2000) The hu- man inward rectifier K+ channel subunit kir5. 1 (KCNJ16) maps to chromosome 17q25 and is expressed in kidney and pancreas. Cytogenetic and Genome Research 90:60–63. Liu Y , Jiang M, Li C, Yang P, Sun H, Tao D, Zhang S, Ma Y (2011) Human t-complex protein 11 (TCP11), a testis-specific gene product, is a potential determinant of the sperm morphology. The Tohoku Journal of Experimental Medicine 224:111–117. Liu Z, Miner JJ, Yago T, Yao L, Lupu F, Xia L, McEver RP (2010) Differential regulation of human and murine p-selectin expression and function in vivo. Journal of Experimental Medicine 207:2975–2987. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N et al. (2013) The genotype-tissue expression (GTEx) project. Nature Genetics 45:580–585. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15:550. Lukk M, Kapushesky M, Nikkil¨ a J, Parkinson H, Goncalves A, Huber W, Ukkonen E, Brazma A (2010) A global map of human gene expression. Nature Biotechnology 28:322–324. Lynch M, Conery JS (2003) The evolutionary demography of duplicate genes. Genome Evolu- tion pp. 35–44. 95 Mali P, Kaipia A, Kangasniemi M, Toppari J, Sandberg M, Hecht N, Parvinen M (1989) Stage- specific expression of nucleoprotein mRNAs during rat and mouse spermiogenesis. Reproduction, Fertility and Development 1:369–382. Margoliash E (1963) Primary structure and evolution of cytochrome c. Proceedings of the Na- tional Academy of Sciences 50:672. Markadieu N, Delpire E (2014) Physiology and pathophysiology of SLC12A1/2 transporters. Pfl ¨ ugers Archiv-European Journal of Physiology 466:91–105. Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17:10–12. Mearini G, Stimpel D, Geertz B, Weinberger F, Kr¨ amer E, Schlossarek S, Mourot-Filiatre J, Stoehr A, Dutsch A, Wijnker PJ et al. (2014) Mybpc3 gene therapy for neonatal cardiomyopathy enables long-term disease prevention in mice. Nature Communications 5:1–10. Meng J, Li L, Zhao Y , Zhou Z, Zhang M, Li D, Zhang CY , Zen K, Liu Z (2016) MicroRNA- 196a/b mitigate renal fibrosis by targeting TGF- β receptor 2. Journal of the American Society of Nephrology 27:3006–3021. Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL (2012) Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nature Biotechnology 30:99–104. Merkin J, Russell C, Chen P, Burge CB (2012) Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 338:1593–1599. Milardi D, Grande G, Vincenzoni F, Pierconti F, Pontecorvi A (2019) Proteomics for the identi- fication of biomarkers in testicular cancer–review. Frontiers in Endocrinology 10:462. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods 5:621–628. Mu M, Niu W, Zhang X, Hu S, Niu C (2020) LncRNA BCYRN1 inhibits glioma tumorigenesis by competitively binding with miR-619-5p to regulate CUEDC2 expression and the PTEN/AKT/p21 pathway. Oncogene 39:6879–6892. Mueller AJ, Canty-Laird EG, Clegg PD, Tew SR (2017) Cross-species gene modules emerge from a systems biology approach to osteoarthritis. NPJ Systems Biology and Applications 3:1–15. Mullis KB (1990) The unusual origin of the polymerase chain reaction. Scientific Ameri- can 262:56–65. Murdock DR, Dai H, Burrage LC, Rosenfeld JA, Ketkar S, M¨ uller MF, Y´ epez V A, Gagneur J, Liu P, Chen S et al. (2021) Transcriptome-directed analysis for Mendelian disease diagnosis overcomes limitations of conventional genomic testing. The Journal of Clinical Investigation 131. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The tran- scriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344–1349. 96 Nemes AD, Ayasoufi K, Ying Z, Zhou QG, Suh H, Najm IM (2017) Growth associated protein 43 (GAP-43) as a novel target for the diagnosis, treatment and prevention of Epileptogenesis. Scientific Reports 7:1–13. Nestler EJ, Hyman SE (2010) Animal models of neuropsychiatric disorders. Nature Neuro- science 13:1161–1169. Nguyen DQ, Webber C, Ponting CP (2006) Bias of selection on human copy-number variants. PLOS Genetics 2:e20. Nurtdinov RN, Artamonova II, Mironov AA, Gelfand MS (2003) Low conservation of alternative splicing patterns in the human and mouse genomes. Human Molecular Genetics 12:1313–1320. Obayashi T, Kagaya Y , Aoki Y , Tadaka S, Kinoshita K (2019) Coxpresdb v7: a gene coexpression database for 11 animal species supported by 23 coexpression platforms for technical evaluation and evolutionary inference. Nucleic Acids Research 47:D55–D62. Omelchenko MV , Galperin MY , Wolf YI, Koonin EV (2010) Non-homologous isofunctional en- zymes: a systematic analysis of alternative solutions in enzyme evolution. Biology direct 5:1–20. Osio A, Tan L, Chen SN, Lombardi R, Nagueh SF, Shete S, Roberts R, Willerson JT, Marian AJ (2007) Myozenin 2 is a novel gene for human hypertrophic cardiomyopathy. Circulation Research 100:766–768. Owen R (1843) Lectures on the comparative anatomy and physiology of the invertebrate animals. London . Pandey UB, Nichols CD (2011) Human disease models in Drosophila melanogaster and the role of the fly in therapeutic drug discovery. Pharmacological Reviews 63:411–436. Pasquini G, Arias JER, Sch¨ afer P, Busskamp V (2021) Automated methods for cell type annota- tion on scrna-seq data. Computational and Structural Biotechnology Journal 19:961–969. Pauling L, Zuckerkandl E (1963) Molecular restoration studies of extinct forms of life. Acta Chemica Scandinavica 17:S9–S16. Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Current protocols in Bioinformatics 42:3–1. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences 85:2444–2448. Platzer K, Lemke JR (2019) Grin1-related neurodevelopmental disorder. Ponnapalli SP, Saunders MA, Van Loan CF, Alter O (2011) A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms. PLOS ONE 6:e28072. 97 Ponnusamy M, Liu F, Zhang YH, Li RB, Zhai M, Liu F, Zhou LY , Liu CY , Yan KW, Dong YH et al. (2019) Long noncoding RNA CPR (cardiomyocyte proliferation regulator) regulates cardiomyocyte proliferation and cardiac repair. Circulation 139:2668–2684. Quiring R, Walldorf U, Kloter U, Gehring WJ (1994) Homology of the eyeless gene of drosophila to the small eye gene in mice and aniridia in humans. Science 265:785–789. R Core Team (2022) R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. Ringn´ er M (2008) What is principal component analysis? Nature Biotechnology 26:303–304. Ro S, Park C, Sanders KM, McCarrey JR, Yan W (2007) Cloning and expression profiling of testis-expressed microRNAs. Developmental Biology 311:592–602. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140. Sanchez-Martin I, Magalh˜ aes P, Ranjzad P, Fatmi A, Richard F, Manh TPV , Saurin AJ, Feuillet G, Denis C, Woolf AS et al. (2021) Haploinsufficiency of the mouse Tshz3 gene leads to kidney defects. Human Molecular Genetics . Sanuki R, Onishi A, Koike C, Muramatsu R, Watanabe S, Muranishi Y , Irie S, Uneo S, Koyasu T, Matsui R et al. (2011) miR-124a is required for hippocampal axogenesis and retinal cone survival through Lhx2 suppression. Nature Neuroscience 14:1125–1134. Schaum N, Karkanias J, Neff NF, May AP, Quake SR, Wyss-Coray T, Darmanis S, Batson J, Botvinnik O, Chen MB et al. (2018) Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: The Tabula Muris Consortium. Nature 562:367. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470. Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N (2003) Module net- works: identifying regulatory modules and their condition-specific regulators from gene expres- sion data. Nature Genetics 34:166–176. Sharpless NE, DePinho RA (2006) The mighty mouse: genetically engineered mouse models in cancer drug development. Nature Reviews Drug Discovery 5:741–754. Sheikh F, Lyon RC, Chen J (2015) Functions of myosin light chain-2 (myl2) in cardiac muscle and disease. Gene 569:14–20. Shi Y , He M (2014) Differential gene expression identified by RNA-Seq and qPCR in two sizes of pearl oyster (Pinctada fucata). Gene 538:313–322. Skala SL, Wang X, Zhang Y , Mannan R, Wang L, Narayanan SP, Vats P, Su F, Chen J, Cao X et al. (2020) Next-generation RNA sequencing–based biomarker characterization of chromophobe re- nal cell carcinoma and related oncocytic neoplasms. European Urology 78:63–74. 98 Soares MB, Schon E, Henderson A, Karathanasis S, Cate R, Zeitlin S, Chirgwin J, Efstratiadis A (1985) RNA-mediated gene duplication: the rat preproinsulin I gene is a functional retroposon. Molecular and Cellular Biology 5:2090–2103. Soucy SM, Huang J, Gogarten JP (2015) Horizontal gene transfer: building the web of life. Nature Reviews Genetics 16:472–482. Stover E, Borthwick K, Bavalia C, Eady N, Fritz D, Rungroj N, Giersch A, Morton C, Axon P, Akil I et al. (2002) Novel ATP6V1B1 and ATP6V0A4 mutations in autosomal recessive distal re- nal tubular acidosis with new evidence for hearing loss. Journal of Medical Genetics 39:796–803. Stuart JM, Segal E, Koller D, Kim SK (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science 302:249–255. Su H, Yuan Y , Wang XM, Lau WB, Wang Y , Wang X, Gao E, Koch WJ, Ma XL (2013) Inhibition of CTRP9, a novel and cardiac-abundantly expressed cell survival molecule, by TNFα -initiated oxidative signaling contributes to exacerbated cardiac injury in diabetic mice. Basic Research in Cardiology 108:1–12. Su W, Cao R, Zhang Xy, Guan Y (2020) Aquaporins in the kidney: physiology and pathophysi- ology. American Journal of Physiology-Renal Physiology 318:F193–F203. Sweet-Cordero A, Mukherjee S, Subramanian A, You H, Roix JJ, Ladd-Acosta C, Mesirov J, Golub TR, Jacks T (2005) An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nature Genetics 37:48–55. Tafoya LC, Mameli M, Miyashita T, Guzowski JF, Valenzuela CF, Wilson MC (2006) Expres- sion and function of snap-25 as a universal snare component in gabaergic neurons. Journal of Neuroscience 26:7826–7838. Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CW, Mesirov JP (2007) Metagene projec- tion for cross-platform, cross-species characterization of global transcriptional states. Proceed- ings of the National Academy of Sciences 104:5959–5964. Tan MC, Widagdo J, Chau YQ, Zhu T, Wong JJL, Cheung A, Anggono V (2017) The activity- induced long non-coding RNA Meg3 modulates AMPA receptor surface expression in primary cortical neurons. Frontiers in Cellular Neuroscience 11:124. Tautz D, Domazet-Loˇ so T (2011) The evolutionary origin of orphan genes. Nature Reviews Genetics 12:692–702. Theißen G (2002) Orthology: secret life of genes. Nature 415:741–741. Toll-Riera M, Bosch N, Bellora N, Castelo R, Armengol L, Estivill X, Mar Alba M (2009) Origin of primate orphan genes: a comparative genomics approach. Molecular Biology and Evolu- tion 26:603–612. Tomaszewski M, Eales J, Denniff M, Myers S, Chew GS, Nelson CP, Christofidou P, Desai A, B¨ usst C, Wojnar L et al. (2015) Renal mechanisms of association between fibroblast growth factor 1 and blood pressure. Journal of the American Society of Nephrology 26:3151–3160. 99 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28:511–515. Uhl´ en M, Fagerberg L, Hallstr¨ om BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson ˚ A, Kampf C, Sj¨ ostedt E, Asplund A et al. (2015) Tissue-based map of the human proteome. Sci- ence 347:1260419. van Dam S, Craig T, de Magalh˜ aes JP (2015) Genefriends: a human rna-seq-based gene and transcript co-expression database. Nucleic Acids Research 43:D1124–D1132. Van Loan CF (1976) Generalizing the singular value decomposition. SIAM Journal on Numerical Analysis 13:76–83. Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH (2008) Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Molec- ular Ecology 17:1636–1647. Walsh KA, Neurath H (1964) Trypsinogen and chymotrypsinogen as homologous proteins. Pro- ceedings of the National Academy of Sciences 52:884. Wang HQ, Zheng CH, Zhao XM (2015) jNMFMA: a joint non-negative matrix factorization meta-analysis of transcriptomics data. Bioinformatics 31:572–580. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformat- ics 28:2184–2185. Wang P, Sun B, Hao D, Zhang X, Shi T, Ma D (2010) Human TMEM174 that is highly expressed in kidney tissue activates AP-1 and promotes cell proliferation. Biochemical and Biophysical Research Communications 394:993–999. Wang Z, Gerstein M, Snyder M (2009) Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10:57–63. Warren JS, Tracy CM, Miller MR, Makaju A, Szulik MW, Oka Si, Yuzyuk TN, Cox JE, Kumar A, Lozier BK et al. (2018) Histone methyltransferase smyd1 regulates mitochondrial energetics in the heart. Proceedings of the National Academy of Sciences 115:E7871–E7880. Watson H, Kendrew J (1961) Comparison between the amino-acid sequences of sperm whale myoglobin and of human haemoglobin. Nature 190:670–672. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The Cancer Genome Atlas Pan-Cancer project. Nature Genet- ics 45:1113–1120. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V , Goodhead I, Penkett CJ, Rogers J, B¨ ahler J (2008) Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolu- tion. Nature 453:1239–1243. 100 Wray GA (2007) The evolutionary significance of cis-regulatory mutations. Nature Reviews Genetics 8:206–216. Wykes SM, Nelson JE, Visscher DW, Djakiew D, Krawetz SA (1995) Coordinate expres- sion of the PRM1, PRM2, and TNP2 multigene locus in human testis. DNA and Cell Biol- ogy 14:155–161. Xie Y , Yao J, Zhang X, Chen J, Gao Y , Zhang C, Chen H, Wang Z, Zhao Z, Chen W et al. (2020) A panel of extracellular vesicle long noncoding RNAs in seminal plasma for predicting testicular spermatozoa in nonobstructive azoospermia patients. Human Reproduction 35:2413–2427. Yan W, Ma L, Burns KH, Matzuk MM (2004) Haploinsufficiency of kelch-like protein ho- molog 10 causes infertility in male mice. Proceedings of the National Academy of Sci- ences 101:7793–7798. Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V , Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E et al. (2005) Genome-wide midrange transcription profiles reveal ex- pression level relationships in human tissue specification. Bioinformatics 21:650–659. Yang K, Meinhardt A, Zhang B, Grzmil P, Adham IM, Hoyer-Fender S (2012) The small heat shock protein ODF1/HSPB10 is essential for tight linkage of sperm head to tail and male fertility in mice. Molecular and Cellular Biology 32:216–225. Yang Z, Wang KK (2015) Glial fibrillary acidic protein: from intermediate filament assembly and gliosis to neurobiomarker. Trends in Neurosciences 38:364–374. Yashiro K, Saijoh Y , Sakuma R, Tada M, Tomita N, Amano K, Matsuda Y , Monden M, Okada S, Hamada H (2000) Distinct transcriptional regulation and phylogenetic divergence of human lefty genes. Genes to Cells 5:343–357. Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11:R14. Young RA (2011) Control of the embryonic stem cell state. Cell 144:940–954. Yuan S, Qin W, Riordan CR, Mcswiggin H, Zheng H, Yan W (2015) Ubqln3, a testis-specific gene, is dispensable for embryonic development and spermatogenesis in mice. Molecular Repro- duction and Development 82:266. Yuzwa SA, Borrett MJ, Innes BT, V oronova A, Ketela T, Kaplan DR, Bader GD, Miller FD (2017) Developmental emergence of adult neural stem cells as revealed by single-cell transcriptional profiling. Cell Reports 21:3970–3986. Zhang J, Dean AM, Brunet F, Long M (2004) Evolving protein functional diversity in new genes of drosophila. Proceedings of the National Academy of Sciences 101:16246–16250. Zhang L, Salgado-Somoza A, Vausort M, Leszek P, Devaux Y et al. (2018) A heart-enriched antisense long non-coding RNA regulates the balance between cardiac and skeletal muscle triadin. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research 1865:247–258. 101 Zhang S, Liu CC, Li W, Shen H, Laird PW, Zhou XJ (2012) Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Research 40:9379–9391. Zhu Q, Wong AK, Krishnan A, Aure MR, Tadych A, Zhang R, Corney DC, Greene CS, Bongo LA, Kristensen VN et al. (2015) Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nature Methods 12:211. Zinman GE, Naiman S, Kanfi Y , Cohen H, Bar-Joseph Z (2013) ExpressionBlast: mining large, unstructured expression databases. Nature Methods 10:925. Zon LI, Peterson RT (2005) In vivo drug discovery in the zebrafish. Nature Reviews Drug Discovery 4:35–44. 102 Appendix A Appendix Figures: Orthogonal shared basis factorization ≈ Figure A.1: Individual components of OSBF: A cartoon summarizing the joint matrix factoriza- tion using the OSBF. The dimensions of different matrices are shown using smaller fonts. 103 Appendix B Appendix Figures: Identifying functionally relevant genes 104 H.sapiens M.mulatta M.musculus B.taurus G.gallus H.sapiens M.mulatta M.musculus B.taurus G.gallus 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.3 0.5 0.7 0.9 0.00 0.25 0.50 0.75 1.00 testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain R = 0.83 R = 0.70 R = 0.74 R = 0.59 R = 0.94 R = 0.93 R = 0.89 R = 0.89 R = 0.93 R = 0.90 Figure B.1: Inter-tissue gene expression correlation (R i ) relationship across species: Heatmap shows gene expression inter-tissue correlation (R i ) within a species for nine different tissue types. Pairwise scatter plot shows the R i relationship between five species: human, macaque, mouse, cow, and chicken. A data point in the scatter plot represents the correlation between the same tissues within a species for two species. 105 0.00 0.15 0.30 0.00 0.15 0.30 0.00 0.15 0.30 0.00 0.15 0.30 0.00 0.15 0.30 0.00 0.15 0.30 0.00 0.15 0.30 0.00 0.15 0.30 H.sapiens M.mulatta M.musculus B.taurus G.gallus H.sapiens M.mulatta M.musculus B.taurus G.gallus 0.00 0.25 0.50 0.75 1.00 testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain testis spleen muscle lung liver kidney intestine heart brain R = -0.44 R = 0.32 R = -0.20 R = -0.08 R = -0.03 R = 0.03 R = -0.15 R = -0.01 R = 0.05 R = 0.21 Figure B.2: Inter-tissue gene expression correlation (R i ) relationship across species for shuf- fled counts: Pairwise inter-tissue gene expression correlation relationship (R i ) between five species human, macaque, mouse, cow, and chicken, for nine different tissue types when counts of the tissues are permuted to create random expression profiles. 106 0.00 0.25 0.50 0.75 1.00 0.00 0.01 0.02 0.03 H.sapiens a) Expression specificity ( ) Coefficient 0.00 0.25 0.50 0.75 1.00 0.00 0.01 0.02 0.03 M.musculus b) Expression specificity ( ) Coefficient Figure B.3: Properties of the dimension 1 of the common subspace: a-b) Scatter plot showing the relationship between gene loadings ofU 1 and gene expression specificity ( τ ) for three different species. All the gene coefficients along dimension 1 are positive values as the centroid of all the tissue types from different species lie on the positive coordinates along dimension 1 of the common subspace. a) b) 0.00 0.25 0.50 0.75 1.00 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 U 2 Coefficient Expression specificity ( τ) M.musculus Top genes Ratio of genes in GO category 0.05 0.10 0.15 padj spermatogenesis flagellated sperm motility biological process spermatid development fertilization chromosome condensation binding of sperm to zona pellucida penetration of zona pellucida 0.00 0.19 0.38 0.56 0.75 Figure B.4: Properties of the dimension 2 of the common subspace: In (a), the top-ranked genes selected for GO analysis (d) for the mouse are highlighted in red. b) The GO ontology analysis of the top mouse genes in dimension 2 shows enrichment for testis-related functions. 107 a) b) c) d) Dim 4 Dim 3 brain heart kidney liver lung testis human chimp macaque mouse rat cow pig chicken -0.5 0.0 0.5 1.0 -0.5 0.0 0.5 Ratio of genes in GO category chemical synaptic transmission regulation of neuronal synaptic plasticity learning positive regulation of dendrite extension neurotransmitter secretion negative regulation of neuron apoptotic process memory synaptic transmission, glutamatergic 0.01 0.02 0.03 0.04 padj 0.00 0.09 0.19 0.28 0.38 Ratio of genes in GO category neurotransmitter transport ionotropic glutamate receptor signaling pathway chemical synaptic transmission regulation of neuronal synaptic plasticity gamma−aminobutyric acid transport calcium ion−regulated exocytosis of neurotransmitter memory regulation of synaptic plasticity 0.00 0.25 0.50 0.75 1.00 0.004 0.008 0.012 padj Coefficient -2 -1 0 1 2 Zscore G.gallus 0 0.25 0.50 0.75 1.00 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Expression specificity ( ) Figure B.5: Properties of the dimensions 3 and 4 of the common subspace: a) The projected expression profiles along dimensions 3 and 4 of the common subspace. Along dimension 3, a clear separation of the brain tissue from the rest of the tissues is observed. Along dimension 4, a weak separation of the lung tissue is observed. The GO ontology analysis of the top human genes (b) and top mouse genes (c) in dimension three show enrichment for brain-related functions. d) Scatter plot showing the relationship between gene loadings ofU 4 and gene expression specificity ( τ ) for chicken. The normalized lung gene expression value for each gene is shown using the z-score values. 108 a) b) c) d) Dim 6 Dim 5 brain heart kidney liver lung testis human chimp macaque mouse rat cow pig chicken -1.0 -0.5 0.0 0.5 -0.5 0.0 0.5 1.0 Ratio of genes in GO category regulation of heart rate striated muscle contraction skeletal muscle thin filament assembly cardiac myofibril assembly cardiac muscle tissue morphogenesis cardiac muscle contraction sarcomere organization muscle contraction 0.00 0.17 0.33 0.50 0.67 0.00 7.6e-20 1.5e-19 2.3e-19 3.0e-19 padj Ratio of genes in GO category muscle contraction cardiac myofibril assembly cardiac muscle tissue morphogenesis cardiac muscle contraction sarcomere organization skeletal muscle thin filament assembly regulation of heart rate heart contraction 0.0 5.0e-08 1.0e-07 1.5e-07 2.0e-07 padj 0.00 0.16 0.32 0.48 0.64 Ratio of genes in GO category ion transport excretion transmembrane transport metanephric distal convoluted tubule development chloride transmembrane transport anterior/posterior pattern specification inorganic anion transport sodium ion transport 2e−04 4e−04 6e−04 8e−04 padj 0.00 0.25 0.50 0.75 1.00 Figure B.6: Properties of the dimensions 5 and 6 of the common subspace: a) The projected expression profiles along dimensions 4 and 5 of the common subspace. The GO ontology anal- ysis of the top human genes (b) and top mouse genes (c) identified from dimension five show enrichment for heart-related functions. d) The GO ontology analysis of the top human genes in dimension six shows enrichment for kidney-related functions. 109 Estimate OSBF Species 1 Species 2 Species ≈ ≈ ≈ Clustering Visualization Classification Analysis in common subspace Mean expression of profiles from tissue types shared in species Gene expression profiles ( species) Project profiles to common subspace of dim species Profile from { estimated factors Identify genes contributing to different dimensions of the common subspace Figure B.7: Steps involved in the estimation of the OSBF common subspace and analysis: The OSBF matrix factorization is estimated using the mean expression profile of functionally similar tissues present in allk species. Allk factorizations share a common basisV , a conserved subspace representing the column phenotypes. Individual gene expression profiles from k species are pro- jected on the common subspace for joint analysis where each projected profile has a dimension of n× 1. 110 a) b) c) d) 0 10 20 30 H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non−coding genes lncRNA pseudogene miRNA other ncRNAs testis H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non−coding genes 0 2 4 6 brain lncRNA pseudogene miRNA other ncRNAs H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus S.scrofa G.gallus # of non−coding genes heart lncRNA pseudogene miRNA TEC 0.0 2.5 5.0 7.5 10.0 H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non−coding genes 0 5 10 15 20 kidney lncRNA pseudogene miRNA other ncRNAs TEC Figure B.8: Gene annotation classification of non-coding genes: a-d) The number of different non-coding TRGs identified by the OSBF. 111 a) b) c) d) 74 32 32 14 9 9 7 7 7 5 5 5 5 4 4 4 4 4 4 3 mouse rat cow pig chimp human rhesus chicken 0 100 200 300 0 20 40 60 80 Genes common 12 9 7 6 6 5 5 3 3 3 3 3 3 2 2 2 2 2 2 2 human chimp rhesus mouse rat cow pig chicken Genes common 0 5 10 0 30 60 90 19 11 4 4 4 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 human chimp rhesus mouse rat cow pig chicken 0 25 50 75 100 0 5 10 15 20 Genes common 31 22 12 10 5 4 4 4 4 4 3 3 2 2 2 2 2 2 2 2 human chimp rhesus mouse rat cow pig chicken 0 50 100 150 0 10 20 30 Genes common H.sapiens brain H.sapiens heart H.sapiens kidney M.musculus testis Figure B.9: Distribution of species-specific and shared tissue-relevant genes: a) Distribution of testis-specific genes unique to the mouse and shared with other species. A mouse gene is shared with a species if the corresponding ortholog is identified as testis-specific in the other species. b-d) Human organ-specific genes for the brain, heart, and kidney. Similar to a), the orthology relationship between human is considered for b-d. Only the top 20 intersection sets are shown for a-d. 112 a) b) c) d) P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus 0 20 40 60 80 Human−specific orthologs % P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus testis P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus 0 20 40 60 80 Human−specific orthologs % P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus brain P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus 0 20 40 60 80 Human−specific orthologs % P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus heart P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus 0 20 40 60 80 Human−specific orthologs % P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus kidney Figure B.10: Human TRGs genes with orthologs not being tissue-specific expressed: a-d)The percentage of human TRGs identified by OSBF that have orthologs, and the orthologs are not identified as TRGs in the other species. 113 a) b) c) d) 0 10 20 30 40 50 ncRNA protein coding pseudogene H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non orthologous genes Brain H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non orthologous genes 0 10 20 30 40 50 ncRNA protein coding pseudogene TEC Heart H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non orthologous genes 0 10 20 30 40 50 ncRNA protein coding pseudogene TEC Kidney 0 10 20 30 40 50 H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus # of non orthologous genes testis brain heart kidney Figure B.11: Tissue-relevant genes with no pairwise orthologs: a) The number of identified TRGs for different species with no pairwise orthologs with the other seven species in the Ensemble database b-d) Gene annotation classification of TRGs with no pairwise orthologs for the brain, heart, and kidney. 114 Appendix C Appendix Tables: Identifying functionally relevant genes Table C.1: The percentage of correlation information (p ij ) represented by the common subspace dimensions. p ij Dimension H.sapiens P.troglodytes M.mulatta M.musculus R.norvegicus B.taurus S.scrofa G.gallus 1 87.22 88.86 87.77 81.64 84.09 87.37 85.53 88.21 2 5.05 4.64 4.87 7.99 6.48 4.94 5.39 4.15 3 2.99 2.7 2.98 4.43 3.86 2.96 3.38 2.98 4 2.14 1.59 1.8 2.66 2.51 1.78 2.42 1.86 5 1.41 1.24 1.36 1.68 1.62 1.71 1.81 1.55 6 1.18 0.96 1.22 1.59 1.43 1.23 1.47 1.26 p ij =δ 2 ij / P 6 j=1 δ 2 ij × 100, where∆ i =diag(δ i1 ,...,δ i6 ) Table C.2: Top gene ontology (GO) biological process terms enriched for genes with a high U 1 coefficient in human. Category Term InCat Total p-adj GO:0006614 SRP-dependent cotranslational protein targeting to membrane 19 27 9.86E-16 GO:0000184 mRNA catabolism, nonsense-mediated 19 39 8.54E-15 GO:0006413 translational initiation 20 51 9.33E-15 GO:0006412 translation 22 130 1.17E-11 GO:0002181 cytoplasmic translation 6 9 1.19E-3 InCat: number of genes of interest in the GO category Total: total number of genes annotated in the GO category p-adj: FDR adjusted p-value for over-representation 115 Table C.3: Top gene ontology (GO) biological process terms enriched for genes with a high U 1 coefficient in mouse. Category Term InCat Total p-adj GO:0006412 translation 14 107 2.15E-06 GO:0002181 cytoplasmic translation 6 10 3.99E-05 GO:0022900 electron transport chain 7 26 9.79E-05 GO:1902600 proton transmembrane transport 7 34 4.97E-04 GO:0006457 protein folding 8 49 4.97E-04 InCat: number of genes of interest in the GO category Total: total number of genes annotated in the GO category p-adj: FDR adjusted p-value for over-representation Table C.4: Gene ontology (GO) biological process terms enriched for top genes with high U 2 coefficient in different species. Category Term MacacaMulatta GO:0007286 spermatid development GO:0030317 flagellated sperm motility GO:0007283 spermatogenesis GO:0051983 regulation of chromosome segregation GO:0007341 penetration of zona pellucida Susscrofa GO:0007283 spermatogenesis GO:0007286 spermatid development GO:0030317 flagellated sperm motility GO:0007342 fusion of sperm to egg plasma membrane involved in single fertilization Gallusgallus GO:0060294 cilium movement involved in cell motility GO:0030317 flagellated sperm motility GO:0003341 cilium movement GO:0007343 egg activation GO:0007286 spermatid development 116 Table C.5: Top gene ontology (GO) biological process terms enriched for genes with a high U 4 coefficient in humans. Category Term InCat Total p-adj GO:0044267 cellular protein metabolic process 11 100 0.0005 GO:0007585 respiratory gaseous exchange by respiratory system 5 19 0.0184 GO:0043129 surfactant homeostasis 4 10 0.0184 GO:0010811 positive regulation of cell-substrate adhesion 5 25 0.0361 InCat: number of genes of interest in the GO category Total: total number of genes annotated in the GO category p-adj: FDR adjusted p-value for over-representation Table C.6: Annotation distribution of TRGs identified for testis and brain Testis Brain species n protein coding% non-coding RNAs% pseudogene% n protein coding% non-coding RNAs% pseudogene% H.sapiens 278 87.05 (7.90e-37) 7.91 5.04 111 97.30 (2.58e-45) 2.70 NA P .troglodytes 343 96.79 (1.11e-39) 0.29 2.92 135 96.30 (1.28e-15) 2.96 0.74 M.mulatta 291 95.19 (1.88e-35) 3.09 1.72 162 98.77 (2.22e-27) 1.23 NA M.musculus 316 91.14 (1.34e-80) 5.06 3.80 138 97.83 (6.65e-49) 2.17 NA R.norvegicus 302 94.70 (4.70e-31) 3.31 1.99 82 91.46 (3.18e-07) 7.32 1.22 B.taurus 336 97.62 (1.01e-20) 0.60 1.79 106 99.06 (6.49e-09) 0.94 NA S.scrofa 358 99.44 (1.61e-20) 0.28 0.28 127 96.85 (6.09e-05) 3.15 NA G.gallus 253 96.44 (2.45e-22) 3.56 NA 85 92.94 (6.15e-06) 5.88 1.18 n: Total number of genes identified. Enrichment p-value (hyper-geometric) shown in brackets for protein-coding genes. Table C.7: Annotation distribution of TRGs identified for heart and kidney Heart Kidney n protein coding% non-coding RNAs% pseudogene% TEC% n protein coding% non-coding RNAs% pseudogene% TEC% H.sapiens 97 90.72 (1.89e-30) 8.25 1.03 NA 153 86.93 (1.69e-40) 10.46 1.96 0.65 P .troglodytes 94 96.81 (9.86e-12) 3.19 NA NA 164 98.17 (2.63e-22) 1.83 NA NA M.mulatta 110 96.36 (1.59e-15) 3.64 NA NA 167 96.41 (4.81e-23) 3.59 NA NA M.musculus 122 90.98 (1.07e-31) 7.38 0.82 0.82 191 93.19 (4.73e-54) 5.76 0.52 0.52 R.norvegicus 92 96.74 (3.46e-12) 3.26 NA NA 140 94.29 (1.50e-14) 5 0.71 NA B.taurus 98 100 (1.35e-09) NA NA NA 156 98.72 (5.03e-12) 1.28 NA NA S.scrofa 106 98.11 (2.62e-05) 0.94 0.94 NA 157 99.36 (2.29e-09) 0.64 NA NA G.gallus 75 97.33 (4.38e-08) 2.67 NA NA 155 97.42 (1.08e-15) 2.58 NA NA n: Total number of genes identified. TEC: To be experimentally confirmed biotype. Enrichment p-value (hyper-geometric) shown in brackets for protein-coding genes. 117 Table C.8: Table showing a few (n=25) previously studied tissue-specific expressed non-coding genes identified by the OSBF Gene Biotype Species Reference Testis SPATA8 lncRNA human (Djureinovic et al., 2014) MIR202 miRNA human (Ro et al., 2007) SPATA42 lncRNA human (Xie et al., 2020) 1700019M22Rik pseudogene mouse (Kato & Nozaki, 2012) Rbakdn lncRNA mouse (Liu et al., 2021) 4930571K23Rik lncRNA mouse (Kato & Nozaki, 2012) 1700008K24Rik lncRNA mouse (Kato & Nozaki, 2012) Brain MIR124-1HG lncRNA human (Liu et al., 2016) BCYRN1 scRNA human (Mu et al., 2020) Meg3 lncRNA mouse (Tan et al., 2017) Miat lncRNA mouse (He et al., 2020) Mir124a-1hg lncRNA mouse (Sanuki et al., 2011) Heart LINC00881 lncRNA human (Li et al., 2015) NPY6R pseudogene human (Burkhoff et al., 1998) TRDN-AS1 lncRNA human (Zhang et al., 2018) Mhrt lncRNA mouse (Han et al., 2014) C130080G10Rik lncRNA mouse (Ponnusamy et al., 2019) mml-mir-133a miRNA rhesus (Iguchi et al., 2017) Kidney RPS24P17 pseudogene human (Tomaszewski et al., 2015) LINC01187 lncRNA human (Skala et al., 2020) EMX2OS lncRNA human (Jiang et al., 2020) LINC01874 lncRNA human (Fagerberg et al., 2014) AP005432.2 lncRNA human (Gong et al., 2021) D630029K05Rik lncRNA mouse (Sanchez-Martin et al., 2021) Mir196a-1 miRNA mouse (Meng et al., 2016) Table C.9: The number of previously characterized tissue-specific expressed human marker genes identified by the OSBF Tissue OSBF # of genes Study Study # of genes # genes common p-value Testis 278 Djureinovic et. al. 2014 62 56 2.82E-129 Testis 278 Human protein atlas 911 216 0 Brain 111 Human protein atlas 517 46 1.07E-67 Heart 97 Human protein atlas 34 17 2.17E-40 Kidney 153 Human protein atlas 56 38 3.30E-89 Human protein atlas: (Uhl´ en et al., 2015) (HPA, v210) p-value: Overlap hyper-geometric p-value 118 Table C.10: Percentage of tissue-specific genes with at least one ortholog in other seven species in the Ensembl database identified as TRGs by the OSBF testis brain heart kidney H.sapiens 75.90 90.99 88.66 80.39 P .troglodytes 79.01 82.96 91.49 85.37 M.mulatta 84.54 74.07 86.36 76.05 M.musculus 79.11 78.99 80.33 67.54 R.norvegicus 79.47 89.02 91.30 81.43 B.taurus 82.74 88.68 87.76 77.56 S.scrofa 81.28 66.93 83.96 85.35 G.gallus 35.57 57.65 70.67 48.39 119 Table C.11: Table showing a few (n = 40) well-studied protein-coding marker genes identified by the OSBF Gene Biotype Species Reference Testis ODF1 protein coding 7 mammals + chicken (Yang et al., 2012) KLHL10 protein coding 7 mammals + chicken (Yan et al., 2004) TPPP2 protein coding 7 mammals + chicken (Milardi et al., 2019) PRM2 protein coding 7 mammals (Wykes et al., 1995) TNP1 protein coding 7 mammals (Mali et al., 1989) AKAP4 protein coding 7 mammals (Aslani et al., 2011) SMCP protein coding 7 mammals (Herr et al., 1999) TCP11 protein coding 7 mammals (Liu et al., 2011) UBQLN3 protein coding 7 mammals (Yuan et al., 2015) CETN1 protein coding 7 mammals (Hart et al., 1999) Brain GFAP protein coding 7 mammals + chicken (Yang & Wang, 2015) SNAP25 protein coding 7 mammals + chicken (Tafoya et al., 2006) GAP43 protein coding 7 mammals + chicken (Nemes et al., 2017) GRIN1 protein coding 7 mammals + chicken (Platzer & Lemke, 2019) SYT1 protein coding 7 mammals + chicken (Baker et al., 2018) PLP1 protein coding 7 mammals + chicken (Boison et al., 1995) STMN4 protein coding 7 mammals (Yuzwa et al., 2017) GRIA2 protein coding 7 mammals (Borges & Dingledine, 1998) NEFL protein coding 7 mammals (Byrne et al., 2017) NRSN1 protein coding 7 mammals (Kiyonaga-Endou et al., 2016) 120 Heart MYL2 protein coding 7 mammals + chicken (Sheikh et al., 2015) CSRP3 protein coding 7 mammals + chicken (Arber et al., 1997) ACTC1 protein coding 7 mammals + chicken (Frank et al., 2019) MYBPC3 protein coding 7 mammals + chicken (Mearini et al., 2014) MYOZ2 protein coding 7 mammals + chicken (Osio et al., 2007) TECRL protein coding 7 mammals + chicken (Devalla et al., 2016) MYL3 protein coding 7 mammals + chicken (Andersen et al., 2012) SMYD1 protein coding 7 mammals + chicken (Warren et al., 2018) TNNC1 protein coding 7 mammals + chicken (Li & Hwang, 2015) CASQ2 protein coding 7 mammals + chicken (Faggioni & Knollmann, 2012) Kidney AQP2 protein coding 7 mammals + chicken (Su et al., 2020) SLC12A1 protein coding 7 mammals + chicken (Markadieu & Delpire, 2014) KCNJ1 protein coding 7 mammals + chicken (Konrad et al., 2021) NPHS2 protein coding 7 mammals + chicken (Dong et al., 2015) ATP6V0A4 protein coding 7 mammals + chicken (Stover et al., 2002) SLC5A12 protein coding 7 mammals + chicken (Kirita et al., 2020) SLC34A1 protein coding 7 mammals + chicken (Fearn et al., 2018) RHCG protein coding 7 mammals + chicken (Brown et al., 2009) KCNJ16 protein coding 7 mammals + chicken (Liu et al., 2000) TMEM174 protein coding 7 mammals + chicken (Wang et al., 2010) 121 Table C.12: The number of homologous TRGs genes shared across different species clades primates rodents mammals all 8 species testis 81 147 32 3 brain 47 57 11 7 heart 49 67 23 17 kidney 73 83 30 20 Table C.13: Expression divergence of orthologs identified by OSBF analysis Gene Species Tissue Orthology Gene Species Tissue ENSG00000187581(COX8C) H. Sapiens testis 1 to 1 ENSMUSG00000025488 M.musculus heart ENSRNOG00000014656 R.norvegicus heart ENSBTAG00000002046 B.taurus heart ENSSSCG00000014560 S.scrofa heart 1 to many ENSGALG00000045362 G.gallus heart ENSMMUG00000019818(CRISP3) M.mulatta testis 1 to many ENSGALG00000034474 G.gallus kidney ENSPTRG00000023956(SULT1C3) P .troglodytes kidney 1 to 1 ENSSSCG00000028691 S.scrofa testis ENSSSCG00000011277(CCK) S.scrofa heart 1 to 1 ENSPTRG00000014791 P .troglodytes brain ENSMUSG00000032532 M.musculus brain ENSBTAG00000013027 B.taurus brain ENSMUSG00000071347(C1qtnf9) M.musculus heart 1 to 1 ENSGALG00000017117 G.gallus kidney 122 Table C.14: Mouse homologous TRGs with orthologs not identified in the other species tissue reference #total reference #orthologs species # common reference-specific ortholog % testis 316 235 H.sapiens 108 54.04 221 P .troglodytes 121 45.25 209 M.mulatta 123 41.15 259 R.norvegicus 182 29.73 230 B.taurus 159 30.87 240 S.scrofa 155 35.42 84 G.gallus 34 59.52 brain 138 134 H.sapiens 54 59.70 126 P .troglodytes 55 56.35 126 M.mulatta 62 50.79 129 R.norvegicus 58 55.04 129 B.taurus 61 52.71 133 S.scrofa 49 63.16 109 G.gallus 23 78.90 heart 122 108 H.sapiens 70 35.19 108 P .troglodytes 69 36.11 101 M.mulatta 73 27.72 105 R.norvegicus 74 29.52 106 B.taurus 69 34.91 105 S.scrofa 75 28.57 91 G.gallus 48 47.25 kidney 191 160 H.sapiens 73 54.38 156 P .troglodytes 80 48.72 147 M.mulatta 83 43.54 165 R.norvegicus 91 44.85 155 B.taurus 74 52.26 163 S.scrofa 81 50.31 118 G.gallus 51 56.78 reference #total: # of mouse TRGs reference #orthologs: # of mouse homologous TRGs with an ortholog in other species # common: # of orthologs identified as TSGEs in both mouse and other species reference-specific ortholog%: % of mouse-specific homologous TRGs 123 Table C.15: The number of OSBF TRGs (Ensembl v94) that do not have any pairwise orthologs in the other seven species but have at least one in Ensembl v105 Tissue species #updated in v105 Tissue species #updated in v105 testis H.sapiens 3 heart H.sapiens 0 testis P .troglodytes 1 heart P .troglodytes 0 testis M.mulatta 7 heart M.mulatta 2 testis M.musculus 4 heart M.musculus 1 testis R.norvegicus 10 heart R.norvegicus 1 testis B.taurus 5 heart B.taurus 1 testis S.scrofa 4 heart S.scrofa 0 testis G.gallus 6 heart G.gallus 1 Tissue species #updated in v105 Tissue species #updated in v105 brain H.sapiens 0 kidney H.sapiens 1 brain P .troglodytes 2 kidney P .troglodytes 0 brain M.mulatta 0 kidney M.mulatta 1 brain M.musculus 0 kidney M.musculus 1 brain R.norvegicus 2 kidney R.norvegicus 1 brain B.taurus 0 kidney B.taurus 0 brain S.scrofa 2 kidney S.scrofa 0 brain G.gallus 1 kidney G.gallus 4 124 Table C.16: TRGs and differential expression species tissue nDE nTRG nDE+TRG % TRG % DE H.sapiens brain 1853 111 110 99.10 5.94 H.sapiens heart 567 97 95 97.94 16.75 H.sapiens kidney 1006 153 138 90.20 13.72 H.sapiens testis 6578 278 278 100.00 4.23 P .troglodytes brain 1219 135 130 96.30 10.66 P .troglodytes heart 375 94 84 89.36 22.40 P .troglodytes kidney 561 164 126 76.83 22.46 P .troglodytes testis 2310 343 279 81.34 12.08 M.mulatta brain 1395 162 157 96.91 11.25 M.mulatta heart 542 110 106 96.36 19.56 M.mulatta kidney 864 167 153 91.62 17.71 M.mulatta testis 3346 291 290 99.66 8.67 M.musculus brain 2357 138 137 99.28 5.81 M.musculus heart 890 122 118 96.72 13.26 M.musculus kidney 1224 191 176 92.15 14.38 M.musculus testis 6207 316 273 86.39 4.40 R.norvegicus brain 1571 82 79 96.34 5.03 R.norvegicus heart 630 92 85 92.39 13.49 R.norvegicus kidney 970 140 132 94.29 13.61 R.norvegicus testis 3449 302 296 98.01 8.58 B.taurus brain 1136 106 100 94.34 8.80 B.taurus heart 694 98 91 92.86 13.11 B.taurus kidney 752 156 142 91.03 18.88 B.taurus testis 2200 336 312 92.86 14.18 S.scrofa brain 1332 127 120 94.49 9.01 S.scrofa heart 486 106 90 84.91 18.52 S.scrofa kidney 730 157 127 80.89 17.40 S.scrofa testis 2647 358 356 99.44 13.45 G.gallus brain 1379 85 84 98.82 6.09 G.gallus heart 543 75 74 98.67 13.63 G.gallus kidney 777 155 140 90.32 18.02 G.gallus testis 1840 253 235 92.89 12.77 mean 93.21 12.75 nDE: common DE identified for each tissue in a species % TRG: % of TRGs that are also DE % DE: % of non-TRG DE genes 125 Table C.17: GO enrichment analysis for human DE genes with low coefficients and DE + TRGs GO category term GO category term testis DE with low coefficients DE + TRGs GO:0007608 sensory perception of smell GO:0007283 spermatogenesis GO:0050896 response to stimulus GO:0030317 flagellated sperm motility GO:0050911 detection of chemical stimulus involved in sensory perception of smell GO:0051321 meiotic cell cycle GO:0070268 cornification GO:0030154 cell differentiation GO:0031424 keratinization GO:0007338 single fertilization brain DE with low coefficients DE + TRGs GO:0060041 retina development in camera-type eye GO:0007268 chemical synaptic transmission GO:0034765 regulation of ion transmembrane transport GO:0048168 regulation of neuronal synaptic plasticity GO:0050896 response to stimulus GO:0007612 learning GO:0001523 retinoid metabolic process GO:1903861 positive regulation of dendrite extension GO:0009416 response to light stimulus GO:0007269 neurotransmitter secretion heart DE with low coefficients DE + TRGs GO:0086069 bundle of His cell to Purkinje myocyte communication GO:0030240 skeletal muscle thin filament assembly GO:0033034 positive regulation of myeloid cell apoptotic process GO:0055003 cardiac myofibril assembly GO:0070543 response to linoleic acid GO:0055008 cardiac muscle tissue morphogenesis GO:0070994 detection of oxidative stress GO:0060048 cardiac muscle contraction GO:0090317 negative regulation of intracellular protein transport GO:0030049 muscle filament sliding kidney DE with low coefficients DE + TRGs GO:0001658 branching involved in ureteric bud morphogenesis GO:0006811 ion transport GO:0015730 propanoate transport GO:0055085 transmembrane transport GO:0015913 short-chain fatty acid import GO:0007588 excretion GO:0015802 basic amino acid transport GO:0006814 sodium ion transport GO:0003339 regulation of mesenchymal to epithelial transition involved in metanephros morphogenesis GO:0016338 calcium-independent cell-cell adhesion via plasma membrane cell-adhesion molecules 126 Appendix D Appendix Analysis: Cross-species gene expression query 127 mAP Human Frog Zebrafish Drosophila 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Euclidean Cosine mAP 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Human Frog Zebrafish Drosophila Human Frog Zebrafish Drosophila a) b) Euclidean Spearman Cosine Pearson JSD 1/2 Figure D.1: Within species gene-expression search results of the curated profiles: a) Mean average precision (mAP) of the within-species search results is shown for four different species. For within-species queries, the individual profiles of each species are searched against the rest of the profiles from that species. We observed a similar high search accuracy (mAP ≥ 0.95) for the remaining 13 species. Five metrics were used to calculate the distance between the query and database profiles: Euclidean distance (Euclidean), cosine distance (Cosine), Pearson correlation distance (Pearson), Spearman correlation distance (Spearman), and the square root of the Jensen- Shannon divergence (JSD 1/2 ). b) mAP score for within species-search based on Euclidean and Cosine distance. Red bars show the search performance for within-species queries (same as (a)). The blue bar shows the performance when the search is carried out after removing the profiles from the same project with the same tissue type as the queries. When querying gene-expression profiles, the high mAP score values for the blue bars, almost identical to the red bars, indicate no strong project-specific biases in our curated compendium. 128 0 300 600 900 Phylogenetic distance with human (Mya) Fraction of house-keeping human orthologs 0.6 0.5 0.4 0.3 0.2 a) Fraction of house-keeping genes 0.5 0.4 0.3 0.2 0.1 0.0 primates treeshrew rodents mammals bird tetrapods vertebrate fly b) Figure D.2: Proportion of housekeeping orthologous genes: a) The fraction of 1-to-1 protein- coding orthologous genes in humans that are housekeeping genes. The pairwise orthology rela- tionship of 107 species with humans from the Ensembl database is used to generate this plot. b) The proportion of shared orthologous genes that are housekeeping in each clade. 129 tissue species tissue species tissue species marmoset gorilla rhesus human mouse rabbit bonobo chimpanzee rat treeshrew 0 50 100 150 0 100 200 300 400 0 100 200 300 400 brain heart kidney liver testis brain heart kidney liver muslce testis brain heart kidney liver muslce testis cow drosophila zebrafish chicken human mouse treeshrew frog PC1: 37% variance PC2: 23% variance -4 -2 0 2 4 -2.5 0.0 2.5 cow chicken frog zebrafish drosophila human treeshrew mouse marmoset gorilla rhesus bonobo chimpanzee human brain heart kidney liver testis a) c) d) b) Figure D.3: Clustering of gene expression profiles based on shared orthologous genes: a) Heatmap showing the clustering of rodent clade profiles using shared orthologous genes at the rodents clades (n=8046 genes). b,c) Heatmap and PCA clustering of gene expression profiles of different species (one representative species per clade) based on the shared orthologous genes at the fly clade (n=964 genes). Mean TPM expression of tissues is used for all heatmap and PCA analyses. d) Heatmap shows gene expression profiles of primates clustering using shared orthologous genes at the fly clade (n=964 genes). 130 -10 -5 0 5 -5 -3 3 0 0 5 15 10 -10 -5 0 5 5 -5 -5 -4 -5 0 0 0 0 5 4 8 15 10 PC1: 53% variance PC1: 58% variance PC1: 49% variance PC1: 40% variance PC2: 18% variance PC2: 19% variance PC2: 32% variance PC2: 19% variance Human Drosophila brain heart kidney liver muscle ovary testis Mouse Drosophila brain heart kidney liver muscle ovary skin testis Human Zebrafish brain heart intestine kidney liver muscle ovary testis brain heart intestine kidney liver muscle ovary skin testis Mouse Zebrafish a) c) d) b) Figure D.4: Clustering analysis of gene expression profiles of two divergent species based on orthologous genes: a,b) PCA clustering of Drosophila gene expression profiles with human and mouse profiles, respectively. b,c) PCA clustering of Zebrafish gene expression profiles with human and mouse profiles, respectively. The mean TPM expression of tissues is used for heatmap and PCA analysis. 131 a) b) c) d) primates treeshrew rodents mammals bird tetrapods vertebrate fly primates treeshrew rodents mammals bird tetrapods vertebrate fly 0.0 0.2 0.4 0.6 0.8 mAP Euclidean Cosine Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.mulatta S.scrofa R.norvegicus S.scrofa S.scrofa S.scrofa S.scrofa S.scrofa H.sapiens R.norvegicus 0.0010 0.0049 0.0087 0.0096 0.0102 0.0104 0.0104 0.0106 0.0110 0.0123 0.0088 0.0103 0.0120 0.0124 0.0127 0.0128 0.0128 0.0131 0.0131 0.0138 ovary ovary ovary ovary ovary ovary ovary ovary adrenal lung M.musculus ovary (GSM1196046) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.musculus M.musculus M.musculus M.musculus M.musculus M.musculus B.taurus M.musculus M.musculus M.mulatta 0.0022 0.0037 0.0046 0.0069 0.0076 0.0096 0.0102 0.0122 0.0129 0.0147 0.0093 0.0098 0.0101 0.0112 0.0115 0.0124 0.0127 0.0137 0.0141 0.0151 intestine intestine intestine intestine intestine intestine intestine intestine intestine intestine R.norvegicus intestine (ERR2130670) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 M.mulatta M.mulatta M.mulatta M.mulatta M.mulatta M.mulatta M.mulatta M.mulatta 0.0012 0.0012 0.0015 0.0018 0.0023 0.0024 0.0029 0.0036 0.0088 0.0088 0.0089 0.0090 0.0092 0.0093 0.0095 0.0097 retina retina retina retina retina retina retina retina B.taurus retina (GSM1453522) Figure D.5: Cross-species gene expression search using OSBF for additional tissues: a) mAP precision based on the cross-species search results for different clades when the search is performed using the OSBF common expression subspace. b-d) Top cross-species search results for ovary, intestine, and retina profiles from mouse, rat, and cow, respectively. The accession ids for search hits in (b) -(d) can be found in Appendix Table D.8. 132 a) b) c) d) Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.mulatta M.mulatta M.mulatta M.mulatta G.gallus M.mulatta M.mulatta M.mulatta M.mulatta B.taurus 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0005 0.0084 0.0084 0.0084 0.0085 0.0085 0.0085 0.0085 0.0085 0.0086 0.0086 brain.GTEX.111FC.3126.SM.5GZZ2 brain brain brain brain brain brain brain brain brain brain Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.mulatta O.cuniculus C.jacchus O.cuniculus O.cuniculus G.gallus P.paniscus O.cuniculus T.belangeri O.cuniculus 0.0004 0.0004 0.0005 0.0006 0.0006 0.0006 0.0006 0.0007 0.0007 0.0007 0.0085 0.0085 0.0086 0.0086 0.0086 0.0086 0.0086 0.0087 0.0087 0.0087 heart.GTEX.111FC.0826.SM.5GZWO heart heart heart heart heart heart heart heart heart heart Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 M.mulatta B.taurus M.mulatta M.musculus R.norvegicus M.musculus S.scrofa B.taurus P.paniscus S.scrofa 0.0005 0.0011 0.0013 0.0014 0.0018 0.0019 0.0019 0.0021 0.0021 0.0023 0.0086 0.0088 0.0088 0.0089 0.0090 0.0091 0.0091 0.0091 0.0092 0.0092 testis.GTEX.117XS.2026.SM.5GID1 testis testis testis testis testis testis testis testis testis testis Rank Species Tissue Distance p-value 1 2 3 4 5 6 7 8 9 10 P.troglodytes P.troglodytes S.scrofa P.troglodytes S.scrofa O.cuniculus P.troglodytes P.troglodytes S.scrofa S.scrofa 0.0005 0.0006 0.0010 0.0012 0.0012 0.0012 0.0012 0.0013 0.0013 0.0014 0.0085 0.0086 0.0087 0.0088 0.0088 0.0088 0.0088 0.0089 0.0089 0.0089 liver GTEX.11DXY.0526.SM.5EGGQ liver liver liver liver liver liver liver liver liver liver Figure D.6: Top search results using OSBF common subspace for GTEx profiles: a-d) Top search results for GTEx profiles from different tissue types using the OSBF approach. The acces- sion ids for search hits in (a) -(d) can be found in Appendix Table D.8. 133 a) b) c) d) 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 70 80 90 100 mAP evaluated at rank mAP evaluated at rank mAP n 0 10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1.0 mAP 1 2 3 4 5 6 7 8 9 10 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 mAP 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 9 10 Cosine # of functionally similar tissues # of functionally similar tissues Euclidean Cosine Euclidean 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 mAP 0.2 0.4 0.6 0.8 1.0 Figure D.7: The effect of the number of functionally similar tissues on OSBF-based cross- species gene expression search: a) Using a different number of functionally similar tissues (n), the mAP is evaluated for different ranks (top hits 1-100) for a database and query set consisting of only human, macaque, mouse, and rat datasets. A total ofn = 10 tissue types are shared between these four species in our curated datasets. b) Similar to a) the mAP evaluated for different ranks (top hits 1-100) for a database and query set consisting of only human and mouse datasets. A total ofn = 15 tissue types are shared between human and mouse in our curated datasets. c) Different numbers of functionally similar tissues affect the overall search performance for a database and query set consisting of only human, macaque mouse, and rat profiles. A varying number of shared tissue types (1-10) is used to estimate the OSBF factorization, and the search is carried out based on this. The mAP score is evaluated based on all the hits. d) Similar to c) overall search performance for a database and query set consisting of only human and mouse profiles. 134 Table D.1: Gene ontology (GO) biological process terms enriched for orthologous genes (n=964 genes) shared across 17 species. category term numDEInCat numInCat padj GO:0006396 RNA processing 147 579 5.50E-36 GO:0034660 ncRNA metabolic process 102 355 2.52E-29 GO:0034641 cellular nitrogen compound metabolic process 504 4176 2.52E-29 GO:0006807 nitrogen compound metabolic process 524 4488 6.17E-27 GO:0022613 ribonucleoprotein complex biogenesis 80 259 2.33E-25 GO:0034470 ncRNA processing 73 226 1.52E-24 GO:0006725 cellular aromatic compound metabolic process 460 3881 2.83E-23 GO:0006139 nucleobase-containing compound metabolic process 444 3739 2.86E-22 GO:0046483 heterocycle metabolic process 454 3860 2.95E-22 GO:1901360 organic cyclic compound metabolic process 469 4029 2.95E-22 GO:0090304 nucleic acid metabolic process 401 3334 4.08E-20 GO:0044237 cellular metabolic process 684 6862 2.59E-18 GO:0042254 ribosome biogenesis 52 163 2.52E-17 GO:0044260 cellular macromolecule metabolic process 576 5536 3.97E-17 GO:0006399 tRNA metabolic process 45 121 7.84E-17 GO:0044238 primary metabolic process 672 6816 4.06E-16 numDEInCat: Number of orthologous genes in the GO category numInCat: Total number of genes annotated in the GO category padj: FDR adjusted p-value for over-representation Table D.2: Top OSBF-based cross-species search results for selected Drosophila and Zebrafish profiles SRP051100.dmelanogaster.heart.GSM1564445 SRP074433.dmelanogaster.brain.GSM2143981 Hits cosine.distance pvalue Hits cosine.distance pvalue SRP136499.mmulatta.heart.GSM3068308 0.0011 0.0088 SRP081121.ggallus.brain.GSM2264632 0.0106 0.0129 SRP102989.acarolinensis.heart.GSM2563012 0.0012 0.0088 SRP017611.ocuniculus.brain.GSM1055115 0.0112 0.0132 SRP058740.mmusculus.heart.GSM1695929 0.0012 0.0088 SRP007412.ptroglodytes.brain.GSM752665 0.0139 0.0147 SRP016501.rnorvegicus.heart.GSM1020686 0.0013 0.0089 SRP007412.ggorilla.brain.GSM752656 0.014 0.0147 SRP136499.mmulatta.heart.GSM3068312 0.0013 0.0089 SRP032451.sscrofa.brain.GSM1256176 0.0143 0.0149 SRP015997.mmusculus.heart.GSM1015154 0.0013 0.0089 SRP065882.drerio.brain.GSM1931768 0.0145 0.015 SRP124265.mmusculus.heart.GSM2842958 0.0014 0.0089 SRP007412.hsapiens.brain.GSM752692 0.0149 0.0152 SRP136499.ptroglodytes.heart.GSM3068276 0.0014 0.0089 SRP007412.hsapiens.brain.GSM752696 0.015 0.0153 SRP102989.acarolinensis.heart.GSM2563011 0.0016 0.009 SRP007412.ppaniscus.brain.GSM752679 0.0151 0.0153 SRP097223.ggallus.heart.GSM2464055 0.0017 0.009 SRP145002.rnorvegicus.brain.GSM3137383 0.0158 0.0157 SRP055573.drerio.liver.GSM1620954 SRP135774.drerio.testis.GSM3043290 Hits cosine.distance pvalue Hits cosine.distance pvalue SRP081121.ggallus.liver.GSM2264648 5.00E-04 0.0086 SRP141481.sscrofa.testis.GSM2461283 0.0035 0.0097 SRP058740.mmulatta.liver.GSM1695924 5.00E-04 0.0086 SRP141481.sscrofa.testis.GSM2461285 0.0038 0.0099 SRP124265.mmulatta.liver.GSM2842952 5.00E-04 0.0086 SRP141481.sscrofa.testis.GSM2461296 0.005 0.0103 SRP007412.ppaniscus.liver.GSM752688 6.00E-04 0.0087 SRP141481.sscrofa.testis.GSM2461300 0.0051 0.0104 SRP016501.mmulatta.liver.GSM1020706 6.00E-04 0.0087 SRP141481.sscrofa.testis.GSM2461294 0.0071 0.0112 SRP007412.hsapiens.liver.GSM752706 6.00E-04 0.0087 SRP092553.dmelanogaster.testis.GSM2367703 0.0072 0.0113 SRP057683.tbelangeri.liver.SRR1997916 6.00E-04 0.0087 SRP018524.sscrofa.testis.GSM1080027 0.0075 0.0114 SRP017959.xtropicalis.liver.GSM1064863 7.00E-04 0.0087 SRP092553.dmelanogaster.testis.GSM2367697 0.0078 0.0115 SRP016501.mmulatta.liver.GSM1020715 7.00E-04 0.0086 SRP141481.sscrofa.testis.GSM2461256 0.0078 0.0116 SRP008743.mmusculus.liver.SRX104354 7.00E-04 0.0086 SRP141481.sscrofa.testis.GSM2461295 0.008 0.0117 135 Table D.3: Cross-species mAP scores when Drosophila and Zebrafish profiles are queried based on common expression subspace and 1-to-1 ortholog genes. Shared tissue type profiles 1-to-1 orthologs OSBF Euclidean Cosine Euclidean Cosine Zebrafish 0.87 0.83 0.99 1.00 Drosophila 0.47 0.50 0.99 1.00 Table D.4: Few examples of cross-species query results for orthology-based search for Drosophila and Zebrafish profiles. Orthologous gene-based search is carried out using n = 964 coding genes shared across 17 species. Orthology based search SRP043319.dmelanogaster.malpighian.tubules.GSM1414456 SRP136174.dmelanogaster.oenocyte.GSM3058508 Hits cosine.distance Hits cosine.distance SRP156282.drerio.intestine.GSM3318287 0.03905 SRP044781.drerio.bones.SRR1524244 0.03894 SRP156282.drerio.intestine.GSM3318286 0.04018 SRP022612.drerio.skin.GSM1141100 0.03926 SRP156282.drerio.intestine.GSM3318285 0.04031 SRP156282.drerio.intestine.GSM3318285 0.03963 SRP156282.drerio.intestine.GSM3318282 0.04037 SRP016501.rnorvegicus.intestine.GSM1020685 0.03972 SRP017611.btaurus.kidney.GSM1055040 0.04039 SRP156282.drerio.intestine.GSM3318287 0.04004 SRP156282.drerio.intestine.GSM3318284 0.04072 SRP016501.rnorvegicus.intestine.GSM1020667 0.04005 SRP156282.drerio.intestine.GSM3318283 0.04073 SRP055573.drerio.skin.GSM1620982 0.04010 SRP145002.rnorvegicus.kidney.GSM3137392 0.04084 SRP016501.ggallus.intestine.GSM1020748 0.04033 SRP068653.btaurus.kidney.GSM2042596 0.04112 EMTAB6081.rnorvegicus.intestine.ERR2130671 0.04041 SRP068653.btaurus.kidney.GSM2042598 0.04119 EMTAB6081.rnorvegicus.intestine.ERR2130657 0.04047 SRP051100.dmelanogaster.heart.GSM1564439 SRP044781.drerio.kidney.SRR1524243 Hits cosine.distance Hits cosine.distance SRP156282.drerio.intestine.GSM3318280 0.02544 SRP102989.acarolinensis.ovary.GSM2563029 0.01770 SRP156282.drerio.intestine.GSM3318283 0.02579 SRP015997.xtropicalis.kidney.GSM1015167 0.01780 SRP156282.drerio.intestine.GSM3318279 0.02625 SRP058740.hsapiens.testis.GSM1695912 0.01957 SRP156282.drerio.intestine.GSM3318286 0.02850 SRP016501.ggallus.intestine.GSM1020757 0.01969 SRP150379.drerio.head.GSM3188314 0.03047 SRP016501.ggallus.intestine.GSM1020748 0.01988 SRP150379.drerio.head.GSM3188315 0.03066 SRP102989.acarolinensis.ovary.GSM2563030 0.01990 SRP007412.ppaniscus.liver.GSM752689 0.03067 SRP102989.acarolinensis.liver.GSM2563026 0.02021 SRP028336.ptroglodytes.kidney.GSM1198498 0.03076 SRP016501.ggallus.intestine.GSM1020766 0.02032 SRP150379.drerio.head.GSM3188316 0.03107 SRP102989.acarolinensis.ovary.GSM2563031 0.02093 SRP136499.ptroglodytes.kidney.GSM3068281 0.03107 SRP121212.ocuniculus.lung.GSM2829105 0.02093 Table D.5: Cross-species search performance in common subspace for different tissue types tissue mAP tissue mAP liver 0.9907 fin 1 brain 0.9507 pineal 0.9850 kidney 0.9684 pituitarygland 0.9333 testis 0.9996 gills 1 endometrium 0.9729 salivaryglands 1 breast 1 betacells 1 carcass 1 leukocyte 1 136 Table D.6: Cross-species mAP scores when Drosophila and Zebrafish profiles are queried using OSBF and orthologs. Complete Profiles 1-to-1 orthologs OSBF Euclidean Cosine Euclidean Cosine Zebrafish 0.39 0.37 0.45 0.43 Drosophila 0.22 0.26 0.67 0.55 Table D.7: Comparison of cross-species gene-expression search methods OSBF C3 ExpressionBlast ProfileChaser Method matrix factorization, Uses all genes. Gene set enrichment using highly expressed genes Comparison of top differential expression (DE) genes, Web based Comparison of differential expression profiles in reduced space, Web based Orthology mapping required No Yes Yes Yes Datatype RNA-Seq RNA-Seq/Microarray Microarray Microarray Input Gene expression matrix Gene expression matrix DE genes (max 100) with expression values Gene expression matrix Analysis type Pairwise comparison and Multiple species together Pairwise Pairwise Pairwise # of species Unrestricted Unrestricted NA 6 Dependency Cross-species datasets with similar tissues Curated compendium DE genes Reduced set of gene expression features 137 Table D.8: Accession IDs for OSBF-based cross-species gene expression search Query Hits Accession IDs (In rank order) G.gallus GSM2464051 GSM2463993,GSM1198492,GSM1198494,GSM2464012,GSM1020654, ERR2130644,GSM848934,ERR2130643,GSM1198524,GSM2464026 GTEX.13112.2126.SM.5GCO4 GSM1198495,GSM2842964,GSM752623,GSM3137354,GSM752625, GSM3137392,GSM2464063,SRR1524243,GSM1020723,GSM1278053 D.melanogaster GSM1414456 GSM3137354,GSM3068293,ERS2508670,ERS2508660,ERR2130673, GSM752623,GSM3137369,GSM2464120,GSM2464065,ERR213067 D.melanogaster GSM3058508 GSM1080042,GSM752627,SRX104356,GSM1069684,SRX102937, SRX104354,GSM1080038,GSM1080033,GSM1020661,GSM75262 D.rerio GSM1215619 GSM3137404,GSM1564440,GSM2842954,GSM1015154,GSM3137386, ERX012378,ERX012374,ERX012344,ERX012363,ERX012362 D.rerio GSM1620933 ERS2508666,GSM1695914,GSM1020711,GSM3137362,ERS2508654, GSM1020756,ERS2508674,GSM1020720,GSM2829086,GSM3137361 M.musculus GSM1196046 ERS2508607,GSM2905243,GSM1278057,GSM2905245,GSM2905262, GSM2905274,GSM2905264,GSM2905263,ERS326954,GSM1020689 R.norvegicus ERR2130670 ERR2130632,ERR2130620,GSM1020641,ERR2130633,ERR2130622, ERR2130630,GSM1020739,ERR2130631,ERR2130628,GSM1020703 B.taurus GSM1453522 GSM2068594,GSM2068593,GSM2068602,GSM2068591,GSM2068600, GSM2068599,GSM2068601,GSM2068595 GTEX.11DXY .0526.SM.5EGGQ GSM432620,GSM432621,GSM1080035,GSM432614,GSM2461216, GSM3137411,GSM752677,GSM432615,GSM1080037,GSM1080034 GTEX.117XS.2026.SM.5GID1 GSM1064840,GSM1020746,GSM752642,GSM752629,GSM1278058, GSM752630,GSM2461288,GSM1020728,GSM752690,GSM2461297 GTEX.111FC.0826.SM.5GZWO GSM3137352,GSM3184855,GSM3137366,GSM3184854,GSM3184852, GSM752561,GSM752684,GSM3184856,GSM957059,GSM3184851 GTEX.111FC.3126.SM.5GZZ2 ERS2508655,GSM3137350,ERS2508674,ERS2508654,GSM1015156, ERS2508666,GSM3137348,GSM752632,GSM1020702,GSM1020738 138
Abstract (if available)
Abstract
Cell phenotypes are supported by specific functions or pathways that involve the coordinated activity of multiple genes. When distantly related species share conserved cell phenotypes, those phenotypes depend on conserved multi-gene functions. Evolution does not require that all genes working to implement conserved functions have unique orthology between the species. Pathway function may adapt, incorporating the activity of additional genes along a lineage while retaining essential behaviors in development or cellular responses. In comparative studies of cell phenotypes, exclusive dependence on orthology will limit those functions that can be examined across long evolutionary time scales, and obscure adaptation.
In this dissertation, we developed a novel joint matrix factorization that we call the orthogonal shared basis factorization (OSBF) to enable orthology-independent comparative analyses of expression profiles from multiple species. OSBF decomposes expression profiles of functionally similar phenotypes from different species to estimate interpretable matrix factors. Our method uses a similar correlation structure within individual datasets to estimate a biologically meaningful expression subspace shared by all species. We show that the OSBF expression subspace provides a reduced dimensional space shared by different organisms that can be used to summarize the conserved gene expression patterns. The species-specific matrix factors in our method quantify the role of all genes in the phenotype without relying on the homology relationships of genes between species. The strategy of estimating different matrix factors, the interpretation of individual matrix factors, and the procedure to ensure that the reconstructed profile using the estimated factors closely recapitulates the original expression profiles are discussed.
Cross-species gene expression analysis of different tissue types from eight vertebrate species using OSBF identified functionally relevant genes for tissue phenotype, including known gene markers, non-coding genes, pseudogenes, and other less-studied genes. We find that more than 40% of the genes important for tissue phenotype are not one-to-one orthologs shared in all species. About 6-10% of these functionally essential genes do not have orthologs in other species. In addition, the method identifies several examples of gene expression divergence of orthologs and complements differential expression analysis to detect genes vital for the phenotype. The common expression subspace also provides a universal coordinate system for querying gene expression profiles from multiple species. Using public RNA-seq datasets from 17 species, including vertebrates and invertebrates, we show that the OSBF-based cross-species search outperforms conventional methods in identifying functionally similar expression profiles across species.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Integrative analysis of gene expression and phenotype data
PDF
Characterizing synonymous variants by leveraging gene expression and GWAS datasets
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Signaling cascades: a functional characterization of cone arrestin and a differential gene expression analysis of developing retinal ganglion cells
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Identification and characterization of cancer-associated enhancers
PDF
An analysis of conservation of methylation
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Gene-set based analysis using external prior information
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Cell crosstalk in the healthy knee influences remodeling and inflammatory gene expression
PDF
Differential role of two coactivators, CCAR1 and CARM1, for dysregulated beta-catenin activity in colorectal cancer cell growth and gene expression
PDF
Correcting for shared measurement error in complex dosimetry systems
Asset Metadata
Creator
Thomas, Amal
(author)
Core Title
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/18/2022
Defense Date
10/31/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cross-species gene expression,expression subspace,joint matrix factorization,OAI-PMH Harvest,orthogonal shared basis factorization,OSBF
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew (
committee chair
), Chaisson, Mark (
committee member
), Dean, Matt (
committee member
), Edge, Michael Doc (
committee member
), Yu, Min (
committee member
)
Creator Email
amalthom@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112485551
Unique identifier
UC112485551
Identifier
etd-ThomasAmal-11322.pdf (filename)
Legacy Identifier
etd-ThomasAmal-11322
Document Type
Dissertation
Format
theses (aat)
Rights
Thomas, Amal
Internet Media Type
application/pdf
Type
texts
Source
20221121-usctheses-batch-992
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cross-species gene expression
expression subspace
joint matrix factorization
orthogonal shared basis factorization
OSBF