Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Schema evolution for scientific asset management
(USC Thesis Other)
Schema evolution for scientific asset management
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCHEMA EVOLUTION FOR SCIENTIFIC ASSET MANAGEMENT by Robert Edward Schuler A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2022 Copyright 2022 Robert Edward Schuler Acknowledgments First, I would like to thank my advisor Carl Kesselman for his guidance throughout my Ph.D. studies and contributions to this thesis. He has consistently guided me towards addressing research problems through a balance of scholarly contributions grounded in critical issues impacting scientic progress while providing solutions of practical utility. Likewise, I am grateful for the expertise and advice given by members of my qualications and dissertation committees including Jos e Luis Ambite, Yolanda Gil, Daniella Meeker, Aiichiro Nakano, and Cyrus Shahabi. I am also thankful for my many colleagues and collaborators in the Informatics Sys- tems Research Division of the Information Sciences Institute: Alejandro Bugacov, Joshua Chudy, Karl Czajkowski, Mike D'Arcy, Laura Pearlman, Aref Shafaeibejestan, Hongsuda Tangmunarunkit, Serban Voinea, and Cristina Williams. Specically, Karl for many conver- sations and contributions to the characterization of data-centric discovery, conceptualization of micro-publication, principles of data-oriented architecture, and early designs of scientic digital asset management. Alejandro, Hongsuda, Karl, Laura, and Mike for contributions to the analysis of and re ections on lessons learned from long-running use cases of scientic as- set management. Alejandro, Cris, and Laura for participating in FaceBase use cases. Serban for his participation in the CIRM use case, implementation of the data acquisition agent, and early development of the model-driven user interface. Aref, Hongsuda, and Josh for their continued development of the model-driven user interface. Aref, Hongsuda, Josh, Karl, Laura, Mike, Serban and many other past students and sta for development of theDeriva ii data management platform and for providing descriptions of their respective components of Deriva. I would also like to thank Jitin Singla and Brinda Vallat for their contributions to the PBCC and Cell-Lab use cases, including descriptions and illustration of their scientic ap- plication and database schema. Seth Runs for his support on the CIRM microscopy use case. Kyle Chard for reviewing and oering critical feedback on many of the early drafts of the original published articles. I am grateful to Ann Chervenak and Michael Arbib who served as my initial thesis advi- sors as I explored early interests at the intersection of data management and computational neuroscience. While ultimately my research focus changed directions, many of the inspira- tions from that time in uenced the work contained here. Certainly, their generous advice and mentoring greatly enriched my research experience. I am also thankful for the tremen- dous encouragement and stimulating discussions with my early lab mates, Victor Barr es, James Bonaiuto, Brad Gasser, Jinyong Lee, and Arthur Simons. I also thank Prasad Ram along with my past colleagues at Xerox Research for encouraging and inspiring me to pursue a career in research. Last but not least, I would like to thank my family and in-laws who patiently and kindly encouraged me during my Ph.D. studies. My parents of course laid the foundation on which my whole education rests, and my sister was the exemplary student in the family. I have always felt fortunate to have a school teacher as a mother to give me that initial condence that carried me along throughout my life long learning. My wife Chiachun (Carol) has been unfailing and unparalleled in her support, and my son Maximilian has always been the most meaningful and worthwhile respite from my studies. I would not have traded one moment with them. iii Table of Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Summary of Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Data-Centric Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Data-Centric Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Data-Oriented Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 A Data-Oriented Architecture for Scientic Digital Asset Management . . . 26 2.6 Prototype for Scientic Digital Asset Management . . . . . . . . . . . . . . . 29 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 iv 3 Accelerating Data-Centric Discovery With Scientic Asset Management 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Characteristics and Requirements of Scientic Asset Management . . . . . . 43 3.2.1 Acquisition and Characterization of Scientic Assets . . . . . . . . . 44 3.2.2 Models and Evolution for Scientic Asset Management . . . . . . . . 46 3.2.3 Storing and Accessing Scientic Assets . . . . . . . . . . . . . . . . . 49 3.2.4 Aggregation and Exchange of Assets . . . . . . . . . . . . . . . . . . 51 3.2.5 Policies for Managing Assets . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Scientic Asset Management Architecture . . . . . . . . . . . . . . . . . . . 52 3.4 Deriva: Platform for Scientic Asset Management . . . . . . . . . . . . . . 55 3.4.1 Chaise: Model-Driven Web Applications . . . . . . . . . . . . . . . . 55 3.4.2 ERMrest: Model-Neutral Relational Metadata Store . . . . . . . . 58 3.4.3 Hatrac: Version-Tracking Object Store . . . . . . . . . . . . . . . . 61 3.4.4 IObox: Import and Export Agent . . . . . . . . . . . . . . . . . . . . 61 3.4.5 BDBag: Big Data Exchange Format . . . . . . . . . . . . . . . . . . 63 3.4.6 WebAuthN: Authentication Provider Framework . . . . . . . . . . . . 63 3.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1 FaceBase Research Consortium . . . . . . . . . . . . . . . . . . . . . 66 3.5.2 CIRM Microscopy Core . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.6.1 Spreadsheets Are Poor Data Entry Tools . . . . . . . . . . . . . . . . 77 3.6.2 Detect Errors While You Have the User's Attention . . . . . . . . . . 77 3.6.3 Give Users Incentives to Change Their Ways . . . . . . . . . . . . . . 78 3.6.4 Users Want a Bird's Eye View of Their Data . . . . . . . . . . . . . . 79 3.6.5 Control Vocabulary But Not Too Much . . . . . . . . . . . . . . . . . 79 3.6.6 Human Factors are Critical to Success . . . . . . . . . . . . . . . . . 80 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 v 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4 User-oriented Framework for Database Evolution . . . . . . . . . . . . . . 83 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3 Schema Evolution Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.1 Characteristics of a SMO . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.3 Primitive SMOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.4 Composite SMOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.1 Key Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.2 Program Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5 Planning and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.1 Expression Trees and Rewrite Rules . . . . . . . . . . . . . . . . . . . 106 4.5.2 Logical Planning and Optimization . . . . . . . . . . . . . . . . . . . 109 4.5.3 Subexpression Consolidation . . . . . . . . . . . . . . . . . . . . . . . 110 4.5.4 Physical Planning and Execution . . . . . . . . . . . . . . . . . . . . 113 4.6 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6.2 Data Source Federation . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.6.3 Integrating Third-Party Libraries . . . . . . . . . . . . . . . . . . . . 117 4.6.4 Interactive Environment . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.7 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.7.1 Rubric for Measuring the Cost of Schema Evolution Expressions . . . 120 4.7.2 Comparative Costs of Schema Evolution Operations . . . . . . . . . . 123 4.8 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.8.1 Data Commons Consortium . . . . . . . . . . . . . . . . . . . . . . . 128 vi 4.8.2 Genomic Enhancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.8.3 Microscopy Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.8.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.9 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.9.1 Real-World Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.9.2 Schema Evolution Benchmark Experiments . . . . . . . . . . . . . . . 145 4.10 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5 Co-Evolution of Data-Centric Ecosystems . . . . . . . . . . . . . . . . . . . 159 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.3 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4 Architecture Pattern for Co-Evolving Data-Centric Ecosystems . . . . . . . . 168 5.4.1 Role of Models, Mappings, and Boundary Objects . . . . . . . . . . . 168 5.4.2 Database-Client Interaction . . . . . . . . . . . . . . . . . . . . . . . 173 5.5 Schema Evolution With Model Management . . . . . . . . . . . . . . . . . . 179 5.5.1 Model Mappings in the Deriva Platform . . . . . . . . . . . . . . . 179 5.5.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.5.3 Model Management Operators . . . . . . . . . . . . . . . . . . . . . . 185 5.5.4 Extending the Schema Modication Operators . . . . . . . . . . . . . 190 5.5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 5.6.1 Schema Mappings in Scientic Databases . . . . . . . . . . . . . . . . 200 5.6.2 Comparison Between CHiSEL and EF Core Migrate . . . . . . . . . . 204 5.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.7.1 Bioscience Data Hub . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.7.2 Boundary Objects for Bioinformatics . . . . . . . . . . . . . . . . . . 210 vii 5.7.3 Epidemic-Type Aftershock Sequence Models . . . . . . . . . . . . . . 212 5.7.4 Pancreatic -Cell Consortium . . . . . . . . . . . . . . . . . . . . . . 213 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6 Database Evolution, by Scientists, for Scientists: A Case Study . . . . . 215 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.3 Schema Evolution Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 220 6.3.1 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.3.2 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.3.3 Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.4 Case Study: Evolution of PBCC . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.4.1 PBCC Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.4.2 Evolving the PBCC Database . . . . . . . . . . . . . . . . . . . . . . 225 6.4.3 Characteristics of the Case Study . . . . . . . . . . . . . . . . . . . . 227 6.5 Analysis of the Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.5.1 Recorded Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.5.2 Observed Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 6.5.3 Quantifying Actions and Processes . . . . . . . . . . . . . . . . . . . 232 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 viii List of Tables 3.1 Examples of annotations used to describe models. . . . . . . . . . . . . . . . 49 3.2 Examples of heuristics used to interpret models. . . . . . . . . . . . . . . . . 49 3.3 Summary of representative Deriva deployments. . . . . . . . . . . . . . . . 65 4.1 Database table of experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Summary of primitive SMOs. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3 Summary of composite SMOs. . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Summary of scoring rubric for measuring cost of schema evolution expressions. 121 4.5 Cost comparison matrix for CHiSEL and SQL expressions for each SMO. . . 128 4.6 Comparison of operation cost required per listing. . . . . . . . . . . . . . . . 132 4.7 Comparison of operation cost required per listing. . . . . . . . . . . . . . . . 136 4.8 Comparison of operation cost required per listing. . . . . . . . . . . . . . . . 141 4.9 SMO operator usage per use case. . . . . . . . . . . . . . . . . . . . . . . . . 141 4.10 Test environment for real-world-based experiments . . . . . . . . . . . . . . . 143 4.11 Parameters of the benchmark data generation utility. . . . . . . . . . . . . . 146 4.12 Parameter settings for dataset generation. . . . . . . . . . . . . . . . . . . . 149 4.13 Test environment for benchmark-based experiments. . . . . . . . . . . . . . . 149 5.1 Summary of requirements for conceptual model evolution instigated by an evolution of the database schema. . . . . . . . . . . . . . . . . . . . . . . . . 185 5.2 Summary of model management operators for coupled evolution of model mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 ix 5.3 Revised denitions of primitive SMOs to address changes to Integrity Con- straints (ICs) and Model Mappings (MMs). . . . . . . . . . . . . . . . . . . . 191 5.4 Summary of integrity constraint modication and model management primi- tives for SMOs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 5.5 Formal Denitions for DDL Operations . . . . . . . . . . . . . . . . . . . . . 196 5.6 Summary of deployments evaluated. . . . . . . . . . . . . . . . . . . . . . . . 200 5.7 Evaluation of Schema (S), Data (D), and Model (M) evolution between CHiSEL and EF Core migrate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.1 Types of actions recorded in the executable notebook. . . . . . . . . . . . . . 229 6.2 Relationship between actions and process. . . . . . . . . . . . . . . . . . . . 231 x List of Figures 2.1 Primitives of Data-Oriented Architecture. . . . . . . . . . . . . . . . . . . . 22 2.2 Reference architecture for Digital Asset Management based on Data-Oriented Architecture principles depicting a deployment of its primary components (white) and complementary components (gray). . . . . . . . . . . . . . . . . 27 2.3 Screenshot of theChaise interface to the asset management system deployed for FaceBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Scientic Asset Management Architecture. . . . . . . . . . . . . . . . . . . . 53 3.2 Chaise layered architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Faceted search application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Record details application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 ERMrest query processing example, showing the query URI (top), model denition (middle right), conceptual processing stages (middle left), and gen- erated SQL statement (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 WebAuthN layered architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7 FaceBase ERM. Metadata are organized broadly as investigation, biosample, bioassay, and asset entities with relationships indicated by arrows. . . . . . . 70 3.8 FaceBase Data Curation Pipeline. Shaded boxes indicate Spoke responsibili- ties versus clear boxes for the Hub's activities. . . . . . . . . . . . . . . . . . 70 xi 3.9 The CIRM microscopy and data management work ow, showing before (A) and after (B) integration with Deriva; with manual tasks (trapezoids), au- tomated tasks (rectangles) and major out-of-band activities (hexagons). The dash-outlined boxes callout sequences of steps for both manuals tasks and automated tasks respective to each work ow. . . . . . . . . . . . . . . . . . . 75 3.10 Screenshot of an early Deriva user interface; with search (top), browse col- lections (left), results (middle), and details (right). . . . . . . . . . . . . . . . 76 4.1 Flow of operations in the framework. . . . . . . . . . . . . . . . . . . . . . . 101 4.2 Overview of the planning and execution stages. . . . . . . . . . . . . . . . . 105 4.3 High-level architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.4 Screenshot of CHiSEL usage in a Jupyter Notebook. . . . . . . . . . . . . . . 119 4.5 Comparison of the required user eort (cost) of CHiSEL vs SQL schema evo- lution operations per use case. . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.6 Results by experiment and atomicity. Execution time of optimized expres- sions (Composite) shown in light gray and non-optimized expression execution (Atomic) shown in dark gray with error bars in red. . . . . . . . . . . . . . . 144 4.7 Reify N concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.8 Reify N subconcepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.9 Reify concept and N subconcepts . . . . . . . . . . . . . . . . . . . . . . . . 151 4.10 Reify N subconcepts and merge . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.11 Create N domains from N columns . . . . . . . . . . . . . . . . . . . . . . . 152 4.12 Create N vocabularies from N columns . . . . . . . . . . . . . . . . . . . . . 152 4.13 Create N relations from nested values . . . . . . . . . . . . . . . . . . . . . . 153 4.14 Reify N subconcepts and create domain from columns . . . . . . . . . . . . . 153 5.1 A typical data-centric ecosystem based on Deriva including (clockwise from top-right) web clients, command-line clients, export bundles, and visualization.161 xii 5.2 Example ER model (left) with schema annotations (middle) specifying model mappings for a model-adaptive interactive application (top-right) and a model- neutral command-line client (bottom-right). . . . . . . . . . . . . . . . . . . 169 5.3 Syntactic structure of expressions used for mappings. . . . . . . . . . . . . . 172 5.4 Anatomy of a boundary object. . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.5 Overview of the three primary patterns of database-client interaction. . . . . 174 5.6 Conceptual steps enacted by model-adaptive database clients. . . . . . . . . 176 5.7 Conceptual steps enacted by model-neutral database clients. . . . . . . . . . 177 5.8 Decoupling and mediation of model-bound clients from the data service via a model-neutral service layer for producing and consuming boundary objects. . 178 5.9 Example of a table schema (left-hand side) mapped to multiple views or con- ceptual entities (right-hand side). Arrows indicate schema mappings or cor- respondences between schema elements. . . . . . . . . . . . . . . . . . . . . . 180 5.10 Simple example of a relational schema for experiments, mappings from schema to conceptual model (dashed lines with arrowhead), and the conceptual model of experiments. Three separate scripts (1, 2, 3) show operations performed to co-evolve the schema and its mappings to the conceptual model. . . . . . . . 183 5.11 Overview of mappings used by deployments. . . . . . . . . . . . . . . . . . . 201 5.12 Types of mapping fragments in use. . . . . . . . . . . . . . . . . . . . . . . . 201 5.13 Non-trivial mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 5.14 Usage of arbitrary joins in mappings. . . . . . . . . . . . . . . . . . . . . . . 203 5.15 Usage of multiple \contexts" in mappings. . . . . . . . . . . . . . . . . . . . 203 5.16 Errors found in model mappings. . . . . . . . . . . . . . . . . . . . . . . . . 204 5.17 Major schema evolution events in the FaceBase Data Hub: initial schema (S 0 ), revised for greater experimental details (S 1 ), and evolved for better reproducibility (S 2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 xiii 5.18 Illustration of a bioinformatics pipeline supported by \bag" (boundary object) collections in the decoupled interaction model. . . . . . . . . . . . . . . . . . 211 6.1 High level view of the initial database schema showing tables (boxes) and relationships (arrows). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.2 Evolution of the PBCC database schema. A subset of the original schema is shown on left and the revised schema after evolution on the right. . . . . . . 225 6.3 Observed macro process. An overall loop can be seen beginning with Concep- tualization and Design, proceeding through several Schema Evolution phases (A{E), pausing for Review, then repeating until complete. . . . . . . . . . . 230 6.4 Observed iterative process within evolutions. An outer loop began with Re- view and Plan Next Changes, based on the type of change needed an inner loop (Schema Modication Loop or Data Transformation Loop) was executed, until the current evolution was Done. . . . . . . . . . . . . . . . . . . . . . . 232 6.5 Totals of actions per category. . . . . . . . . . . . . . . . . . . . . . . . . . . 233 6.6 Breakdown of database (DQL, DML, DDL, SMO) actions. . . . . . . . . . . 234 6.7 Actions taken per phase of process. . . . . . . . . . . . . . . . . . . . . . . . 234 6.8 Ratio of actions taken per phase of process. . . . . . . . . . . . . . . . . . . 235 6.9 Lines of Code (LoC) per phase of process. . . . . . . . . . . . . . . . . . . . 236 xiv Abstract In order to drive discovery from data, scientists rely on accurate, up-to-date descriptions of data. Management of scientic data is often split between storage systems for large, bulk collections of data and specialized databases for detailed descriptive information about data, known as metadata, organized in a database schema. Without such schema and descriptive metadata, scientic data would be lost. Yet scientic data is among the most challenging to manage due to schema evolution brought on by changing experiments, new technology and instruments, changing methods, emerging standards, and dierent conceptualizations between investigators. In this thesis, we study the problem of evolving schemas for scientic data. First, we outline a \data-centric" approach to scientic discovery where data rather than processes are the central artifacts of scientic discovery and actors coordinate their activities through interactions on the data. Based on these principles, we then dene a general framework called \scientic asset management" in which we may structure the organization and interactions with data and metadata. Through real-world deployments, we evaluated the requirements for schema evolution for applications of scientic asset management, and we characterized the semantics of the transformations on the database schema. Next, we introduce a user-oriented framework for schema evolution based on an algebra of schema modication operators for simplifying the tasks needed to evolve schemas for scientic data. We present an implementation of our framework within an embedded domain specic language usable in an executable notebook environment familiar to scientists in order to xv reduce the eort for schema evolution. We also describe the algorithms for ecient planning and execution of the algebraic expressions. We then extend the framework with model management operators to facilitate coupled evolution of model mappings used by database- dependent applications. We present an analysis and experimental evaluation on the eectiveness and eciency of the schema evolution framework to reduce the time and eort needed to evolve schemas for scientic data. We also describe use cases of the framework that illustrate its utility in real-world applications. We show that these contributions can enable scientists to evolve database schemas with less eort and greater eciency. xvi Chapter 1 Introduction Curation covers a wide range of activities, starting with nding the right data structures to map into various stores. It includes the schema and the necessary metadata for longevity and for integration across instruments, experiments, and laboratories. Without such explicit schema and metadata, the interpretation is only implicit and depends strongly on the particular programs used to analyze it. Ultimately, such uncurated data is guaranteed to be lost. { Gordon Bell [13] Over the past two decades, a fourth paradigm of scientic inquiry [73] driven by the capture, analysis and sharing of large data sets has emerged and spread throughout most domains of science. From genomics to astronomy, new generations of instruments are pro- ducing more data than ever, for example doubling the entire volume of data ever collected in astronomy each year [6], leading to new discoveries being made from extant data collec- tions [99]. While these new data have opened up the opportunity to make new discoveries, the ability to produce data has vastly outpaced the ability of scientists to manage it all. The handling of data increasingly drains time from researchers that could be spent on core inves- tigative activities, thus slowing the discovery process. A new approach to handling research data throughout the research life cycle is needed in order to truly exploit the potential of the data being collected across scientic disciplines today. 1 Scientic Asset Management is an approach to managing valuable scientic data (i.e., \assets") throughout the discovery process which follows data from initial acquisition and contextualization at the instrument or sensor, ingest and metadata extraction, annotation and metadata entry, organization and subsetting of assets into datasets, analysis via com- putational work ows that produce derived data, tracking of provenance, collaboration and sharing, and nally packaging and publication [125]. In order for data to be understood and interpreted by others, they must have a schema and be accompanied by descriptive meta- data [73]. While researchers may begin with an initial conceptualization of their domain realized in a schema for managing metadata about assets, scientists will evolve the model numerous times over the course of their research [78], similar to enterprise case studies that show that a database schema is often outdated in a matter of months [141]. Although recent research has explored schema evolution, such as workbench [40] or appli- cation model mapping [148] approaches generally aimed at enterprise information systems, most available tools oer just data manipulation and data denition languages at a low level and basic graphical user interfaces based on these languages. Schema evolution remains a dicult, ad-hoc, and labor intensive activity. Scientists need to decide on the transforma- tions for their models, what to normalize, how to map values to controlled vocabularies, what attributes should be constrained to domain tables, and what strategy to use to avoid data loss. These decisions slow the process of evolving databases and are potentially error prone. They also generally require technical sta that do not necessarily know the domain model well and introduce another level of coordination and communication which is slow and expensive. We illustrate the problem with a critical application we have worked with for kidney stem cell research. 2 1.1 Motivating Example A microscopy laboratory, faced with the challenges of organizing and tracking valuable sci- entic resources, namely glass slides and the digitally scanned \whole slide images" of them, implements a database application for managing these valuable resources or \assets" pro- duced through their research. Given the steep upfront cost of database schema design, they may begin with the simplest conceptualization of the domain, a single table of image meta- data with references to the le objects on disk. As their conceptualization of the domain grows they realize they can improve the organization and reduce repetitive data entry by re- structuring the original image table into four related tables to represent scans (i.e., images), slides, specimens, and experiments. Later, they notice that searching for \ orescent in situ hybridization" experiments returns less results than they believe should be in the database and discover that the term has been spelled ve dierent ways including variations on common acronyms for it (e.g., \FISH") because the methods column allowed free text entry. To remedy this, they create a table of methods terms (commonly called a \domain table" in database design practices), populate it with the distinct domain of terms found in the methods column of the experiments table, de-duplicate the terms, and apply key and foreign key constraints to ensure the uniqueness and referential integrity of the aected columns. As the project progresses, new requirements stipulate that they submit data and meta- data to community repositories, such as database of Genotypes and Phenotypes (dbGaP) and Gene Expression Omnibus (GEO), and must map these locally coded terms to stan- dard ontologies, such as those maintained by the National Center for Biomedical Ontologies (NCBO). They further transform the domain tables of local terms into tables that re ect the structure of controlled vocabularies, both improving the expressiveness of their database and streamlining the exchange of information with the external repositories. Along the way, the applications that their researchers have been using for data entry, display, browse and search were also updated to re ect changes in the database schema. While these changes 3 were necessary, signicant eort was spent on schema changes and data migration. 1.2 Summary of Thesis Work In this thesis, we address the problem of schema evolution for scientic asset management by presenting an algebraic approach to dening a database evolution language (DEL) tailored to the requirements and skills of scientists, which raises the level of abstraction of schema evolution operations on scientic database schemas to simplify their tasks. Activities to evolve a scientic database are to be governed by algebraic schema mod- ication operators (SMOs) that consume and produce relations. This yields a declarative representation for the evolution of scientic databases and eases the expression of transfor- mations necessary to evolve the database from the current state to the next. To support such evolution, our algebra is delivered in a programming language (Python or R) and de- velopment environment (Jupyter Notebooks) that are familiar to many scientists in order to reduce the learning curve for adoption and ecient usage. We further extend the schema modication operators with supplementary model management operators (MMOs) that co- evolve database-dependent application models, i.e., the mappings from the database schema to the application concepts used for generating database entry forms, search interfaces, user interface views of data, and others. We show that a high-level algebra to facilitate schema evolution delivered in a familiar programming environment can enable scientists to evolve database schemas more eectively and eciently. We make the following specic contributions: 1. An analysis of schema evolution in scientic applications, that yields a characterization of the needed semantic transformations of their domain models; 2. An algebra of high-level schema modication operators that raise the level of abstrac- tion for schema evolution over conventional data denition and data manipulation languages; 4 3. An extension of the algebra for evolution of the database-dependent application models that in uence how applications adapt to and interact with a databases under schema evolution; 4. A schema evolution workbench based on the proposed schema evolution algebra, de- livered in the form of an embedded domain specic language usable in an executable notebook environment; 5. A case study and experimental evaluation based on the implementation of our approach in a scientic asset management system which supports using our schema evolution workbench and algebraic approach. Beyond satisfying the requirements of schema evolution for scientic asset management, we believe these contributions may further benet scientists including increased data and metadata capture, improvements in reproducibility of scientic work ows, more opportu- nities for data reuse by other investigators, higher likelihood of experimental validation, incremental or \agile" development of scientic databases, and perhaps most importantly shortened knowledge cycles. 1.3 Thesis Outline The primary contributions of this dissertation, the user-centered framework for schema evolu- tion and the extensions of the framework for coupled evolution of data-centric ecosystems, are presented in Chapters 4 and 5 respectively, while the former Chapters 2 and 3 on data-centric science and scientic asset management provide the context, motivation, and foundation on which the algebra and framework rest. In Chapter 2, we motivate the need for specialized systems to support data-centric science and address key problems that are nagging science today. We discuss the transformation of science toward a mode of data-centric discovery. We then discuss the social and philosophi- cal considerations for handling the massive accumulation of data in science. We present an 5 overarching data-oriented architecture on which to structure systems for managing the com- plexity and volume of scientic data. We outline the preliminaries that dene the context in which data management and hence schema evolution of information systems for science exist. Finally, we present a prototype system that embodies the data-oriented architecture approach in a form suitable for digital asset management for science. In Chapter 3, we build on the prototype outlined in Chapter 2 to a fully rened system and provide detailed case studies of its utilization in real-world science applications, and we motivate the need for an adaptive, exible approach to database evolution. We introduce the concept of scientic asset management as a targeted, specialized form of digital asset management tailored to the needs of the scientic discovery life cycle. We discuss broad lessons learned working with real-world scientic use cases that have applied the scientic asset management approach. We then reinforce the need for schema evolution of complex information systems with databases and database-dependent applications impacted by the evolution of the schema. In Chapter 4, we introduce a schema evolution framework that addresses the needs for scientists to evolve complex information systems in support of their research. We base the framework on a novel schema evolution algebra that is amenable to composing complex operations to provide simpler yet ecient expressions for evolving schema in a way that satises observed requirements. We describe the algebra and the algorithms for planning and optimizing expressions of the algebra. Because schema evolution languages take dierent forms, we present a methodology and rubric for comparing between languages, and we use the methodology to evaluate the comparative cost in terms of user eort between using our framework versus the various dialects of conventional structured query language for dening, manipulating, and querying data in support of schema evolution tasks. We then dene a schema evolution benchmark and tools for generating synthetic benchmark data and driving tests. Our experiments show that the algebraic approach can be optimized for improved eciency over a hypothetical monolithic approach. 6 In Chapter 5, we build on the work described in Chapter 4, rst by presenting architecture approaches for coupled evolution of data-centric ecosystems. We then extend the schema evolution framework presented in Chapter 4 with a denition and implementation of model management operators to evolve model mappings for model-driven applications so that they may seamlessly adapt to changes in the database schema. Finally, we present an analysis of schema mapping usage in scientic databases, a qualitative comparison of our approach versus a leading enterprise database migration utility, followed by case studies that have used the proposed architecture approaches. In Chapter 6, we outline a concise methodology for scientists to use the overall framework eectively. The methodology consists of recommended policies, processes, and best practices for scientists to use our proposed approach to evolving scientic databases. We then present a detailed case study of the framework in use by a scientist in support of a real-world scientic application. In Chapter 7, we summarize our contributions, state our conclusions, and outline poten- tial future directions for extending the work presented in this thesis. 1.4 Related Publications Parts of this thesis have been published in data management, distributed computing, bioin- formatics, and eScience conferences. The list includes: Related to Chapter 2 { Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \An Asset Management Approach to Continuous Integration of Heterogeneous Biomedical Data". In: Data Integration in the Life Sciences. Ed. by Helena Galhardas and Erhard Rahm. Cham: Springer International Publishing, 2014, pp. 1{15. isbn: 978-3- 319-08590-6. doi: 10.1007/978-3-319-08590-6_1. url: https://doi.org/ 10.1007/978-3-319-08590-6_1 7 { Robert Schuler, Carl Kesselman, and Karl Czajkowski. \Data Centric Discovery with a Data-Oriented Architecture". In: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models. SCREAM '15. Portland, Oregon, USA: Association for Computing Machinery, 2015, pp. 37{44. isbn: 9781450335669. doi: 10.1145/2753524.2753532. url: https://doi.org/10.1145/2753524.2753532 Related to Chapter 3 { Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \Digital asset manage- ment for heterogeneous biomedical data in an era of data-intensive science". In: Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on. Nov. 2014, pp. 588{592. doi: 10.1109/BIBM.2014.6999226 { Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \Accelerating Data- Driven Discovery With Scientic Asset Management". In: 2016 IEEE 12th Inter- national Conference on e-Science (e-Science). Baltimore, MD USA, 2016, pp. 1{ 10. doi: 10.1109/eScience.2016.7870883 { Alejandro Bugacov et al. \Experiences with DERIVA: An Asset Management Platform for Accelerating eScience". In: 2017 IEEE 13th International Confer- ence on e-Science (e-Science). 2017, pp. 79{88. isbn: 978-1-5386-2686-3. doi: 10.1109/eScience.2017.20 Related to Chapter 4 { Robert E. Schuler and Carl Kesselman. \Towards an Ecient and Eective Framework for the Evolution of Scientic Databases". In: Proceedings of the 30th International Conference on Scientic and Statistical Database Management. SS- DBM '18. Bozen-Bolzano, Italy: Association for Computing Machinery, 2018. isbn: 9781450365055. doi: 10.1145/3221269.3221300. url: https://doi. org/10.1145/3221269.3221300 8 { Robert E. Schuler and Carl Kessleman. \A High-Level User-Oriented Frame- work for Database Evolution". In: Proceedings of the 31st International Con- ference on Scientic and Statistical Database Management. SSDBM '19. Santa Cruz, CA, USA: Association for Computing Machinery, 2019, pp. 157{168. isbn: 9781450362160. doi: 10.1145/3335783.3335787. url: https://doi.org/10. 1145/3335783.3335787 { Robert Schuler and Carl Kesselman. \CHiSEL: A User-Oriented Framework for Simpling Database Evolution". In: Distrib. Parallel Databases 39.2 (June 2021), pp. 483{543. issn: 0926-8782. doi: 10.1007/s10619-020-07314-x. url: https://doi.org/10.1007/s10619-020-07314-x Related to Chapter 5 { Robert Schuler et al. \Towards Co-Evolution of Data-Centric Ecosystems". In: 32nd International Conference on Scientic and Statistical Database Manage- ment. SSDBM 2020. Vienna, Austria: Association for Computing Machinery, 2020. isbn: 9781450388146. doi: 10.1145/3400903.3400908. url: https: //doi.org/10.1145/3400903.3400908 { Robert E. Schuler and Carl Kesselman. \Managing Database-Application Co- Evolution in a Scientic Data Ecosystem". In: 2022 IEEE 18th International Conference on e-Science (e-Science). Salt Lake City, Utah, USA: IEEE, Oct. 2022 Related to Chapter 6 { Robert Schuler et al. \Database Evolution, by Scientists, for Scientists: A Case Study". In preparation. 9 Chapter 2 Data-Centric Science The content of this chapter is based on the paper: Robert Schuler, Carl Kesselman, and Karl Czajkowski. \Data Centric Discovery with a Data-Oriented Architecture". In: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Appli- cations and Models. SCREAM '15. Portland, Oregon, USA: Association for Computing Machinery, 2015, pp. 37{44. isbn: 9781450335669. doi: 10.1145/2753524.2753532. url: https://doi.org/10.1145/2753524.2753532. In this chapter, we outline the overall concepts and background of data-centric science and then we ground the discussion in a reference architecture and prototype. We begin with a discussion of the changing landscape in science and its transformation toward a mode of data-centric discovery. We then continue these observations with a discussion of how the scholarly communications ecosystem must also transform from traditional publication to what we will term micro-publication centered on data. With that in mind we will then describe a data-oriented architecture style for designing and building systems to address the needs of data-centric discovery. We then acquaint the reader with a widespread enterprise information systems concept of digital asset management, and we will present a blueprint for designing a digital asset management solution for data-centric discovery by applying a data-oriented architecture style and following the principles of micro-publication. Finally, 10 we will review related work and present concluding thoughts for this chapter. 2.1 Introduction Scientic discovery is an increasingly data-intensive pursuit. New knowledge is not only the result of empirical experiments but also of theoretical models and complex data analysis. The growing dependence of scientic discovery on the usage of simulations and informatics has been described as the 4th paradigm of knowledge discovery [74], and it is driven by the ever growing mountain of research data accumulated from sensors and instruments as well as from simulation and analytic output. While some data are generated with specic intent in mind, they may also come from repurposing of extant data, or may be generated with no specic intent other than to feed retrospective analysis and knowledge extraction. This leads us to a new era of scientic discovery centered on data, a methodology of developing and testing hypotheses from large, rich, and complex data where the initial observation may precede the hypothesis [55]. In this new era, not only are data the products of experiments, they may also be the object on which experiments are conducted. This does not, however, obviate the need for traditional hypothesis-driven research, but rather complements it with new insights into experimental observations and new tools with which to generate results. Current systems and tools to support data-intensive research methods, such as data mining platforms [159] and pipeline systems [48, 61, 167] tend to focus on manipulating and transforming data. They address the composition of reusable tools and the orchestration of repeatable processes. While analysis is clearly important, we assert that data are the currency around which discovery is made, and we should take the perspective that discovery is largely data-centric not process-centric. Yet currently no holistic solution exists that focuses on the role of data in data-driven discovery. This represents a surprising yet signicant gap in the data-driven discovery ecosystem. The closest approximates we have found are asset management systems, document-work ow systems, and content-management systems, yet 11 these are ill suited for use as tools to drive discovery due to their tight coupling to specic business processes. 2.2 Data-Centric Discovery The process of discovery rarely follows a straight line or an established path. The formation of a repeatable process is often like a random walk, complete with blind alleys, random starts and stops, and periodic retracing of ones steps. Indeed, identifying the right question is as time consuming and important as determining the answer to the question. Unlike more formalized approaches such as repeatable business processes, computational pipelines, or randomized controlled trials that can be characterized by a codied set of rules or policies, it is unrealistic to assume that we can write down a specic set of procedures to be followed in the pursuit of extraction of knowledge from data. Furthermore, we observe that discovery is often not a solitary process, but rather the re- sult of collaborative eorts during which individuals may exchange data, analysis, or current hypothesis. Collaborations can take place asynchronously and at multiple levels: between collaborators in a laboratory; with colleagues distributed across institutions; within coordi- nated consortia; or among broad communities. These collaborations may all be occurring simultaneously, each proceeding at their own rate, with local tasks and processes evolving to address the nature of the collaboration. It is via the exchange of data that collaborators share knowledge and advance towards discovery. Without the structure of ordered sequences of tasks to perform, we seek to identify common characteristics of discovery across the many domains of scientic inquiry so that we may build a common foundation on which data-centric discovery can take place. To gain insight into this question, it is useful to examine how discovery traditionally has taken place, and in particular, the role of the research collections and libraries in that process. Libraries represent stores of organized data, in which a researcher may enter, wander 12 about, use catalogs to identify and retrieve books and papers, add them to the growing stack of material in their private research collection (i.e. the piles on their desktop), organized in stacks that make sense to them, augmented with personal notes, papers from other libraries. Occasionally, material might be shared with collaborators, stacks of papers reshued, or thrown out. Eventually one reaches a point where information may need to be shared in a more formal setting, and a paper written, at which point it may be included in the main research library. In light of these observations, we consider the question as to how we may accelerate the process of discovery in an environment in which the collection consists of distributed, large- scale data. To enable discovery, the focus should not be on streamlining the creation and execution of processes, as these are highly idiosyncratic and transient. Rather, building on our example of research collections, we propose a data-centric approach to discovery, which focuses attention on the points at which data are organized and exchanged while allowing the tasks and work ows to evolve at their own rates independently from the data. Data Centrism Business process management (BPM) and work ow management focus on understanding and optimizing the procedures and tasks of a group of actors to achieve an overall task. Unlike the process-centric approach, data centrism holds that the collection of data is the foundation around which process are managed. A similar observation has been put forward in enterprise computing in a sub-eld called Case Management where, unlike conventional BPM, the emphasis is on understanding a core collection of documents and identifying their life-cycles and the interactions of actors with the collection of data [18, 95]. We argue that methods of discovery that are now more data centric could benet from a similar focus on the research data artifacts rather than the procedural activities as a foundation for organizing the overall research endeavour. Even with hypothesis-driven research, where rigorous, repeatable protocols are expected, the protocol (i.e., the process-centric description for how to repeat the experiment) is itself 13 discovered throughout the course of the investigation, until it can be codied and repeated. One rarely starts with an a priori process for how to get from hypothesis to (positive!) result in a strait line without many trials and errors. In fact, such a result would be highly suspicious as rarely is a successful result found on the rst try. While we may imagine that discovery is bound by rigorous processes (i.e., protocols) the reality is far messier and less deterministic at rst. It may be only after a discovery is made that a process is subsequently discovered to repeat the nding. Clearly, a result is only of signicance if it is repeatable, and the (process-centric) protocol is essential for describing the method for repeating and validating a result, but the initial discovery is driven by a series of trials and errors and is far more data-centric than the resulting process-centric description (i.e., protocol) for how to reproduce the results. As such, it is the data (e.g., lab notebooks, accompanying information, etc.) that drive the initial discovery. Collaboration Over Data As we have noted above, science is also becoming an even more collaborative pursuit. It is not coincidental that both science has increasingly centered on data and that collaboration is more important than ever. New discoveries are enabled by complex instruments, computational analyses, integrative investigations spanning sub- disciplines, and studies that depend on subjects spanning geographic regions. In some cases, national-scale infrastructure such as the Large Hadron Collidor (LHC), James Webb Space Telescope (JWST), or Earth System Grid (ESG) [165] necessitate wide ranging collabo- rations and generate vast amounts of data requiring signicant computational and storage infrastructure, as well as diverse expertise to facilitate investigations. Even at a smaller scale, novel instruments and computational approaches are bringing together multi-disciplinary in- vestigations such as studying in-vivo synaptic gain and loss in zebrash to understand the mechanisms of memory formation [45]. In other cases, multi-site collaboration is forced by the need to recruit large patient cohorts not possible in any one geographic location, such as the study of rare neurological syndromes [70]. And in other cases, the integrative nature of 14 research drives new collaborations such as between craniofacial biologists and neuroscientists that are studying the impact of craniofacial malformation on neurodevelopmental problems [168]. In each of these scenarios, data (either the over-abundance of it or the need to acquire more of it) necessitates the formation of collaborative research teams. The needs of collaborating researchers have been studied from an ecological perspective leading to a characterization for the role of information described as \boundary objects" that serve as a means of encapsulating concepts needed to facilitate communication between researchers [139]. In the traditional sense, a scientic publication is one form of boundary object as are a laboratory notebook or a data repository. Each of these boundary objects facilitate collaborating researchers to communicate ideas and results. Ultimately, data in its many forms are what facilitate discourse among collaborators. Interactions With Data Data-centric discovery focuses on the creation, organization, discovery, and access to data. It is the data that is long lived, and not the processes that produced the data { though it may be crucial to capture the provenance of data including details about the processes which created them. While it is the case that creating new knowledge requires manipulation and analysis of input data, it is ultimately the exchange of the resulting data that is critical. Data-centric discovery focuses on this exchange of results. We can characterize the basic interactions with data to span the following classes of activities: Model: capture the salient features of the problem domain, current knowledge state, etc.; Map: take available data that has been collected and map it into the current model; Manipulate: perform analysis and computation on the data; conduct data experiments; Muse: think about what you have observed and formulate new conjecture or conclusion; and 15 Merge: integrate new and manipulated data; evolve models to describe new knowledge. Each of these activities may be applied repeatedly and in any order { capturing the evolu- tionary, incremental, and dynamic nature of discovery. 2.3 Publication The centrality of data to the future of science challenges current models of publication and the traditional scholarly communications ecosystem devised around written manuscripts with mere illustrations of results. In this changing scientic landscape, these traditional publica- tions serve only as advertisements for the actual results that are embodied in collections of data that were acquired, produced, or analyzed to yield the distilled statements presented in the written work. Publication has traditionally been the nal stage of discovery, whether publishing manuscripts to share results and analysis in a compact form with the research community or fullling \data sharing" requirements by publishing data to institutional or community repositories. In the emerging landscape of data-centric discovery, publication is no longer a single event at the nal stage, but rather there are many incremental publication events throughout the discovery process. This is especially pronounced for multi-disciplinary collaborative research groups. For instance, a group of biologists may be conducting imaging experiments on a model organ- ism, sporadically producing new images, creating images and recording simple metadata characteristics in at les or spreadsheets. Computational scientists on the team then re- trieve the experimental data and run through a variety of steps from ltering to visualizing to generating analytic measures. The computational scientists then store new data derived from the raw experimental data and add more metadata. Further, instrument designers may then evaluate the results from both sides and through discussion with experimentalists and computational scientists determine changes to the instrument to better support the research goals. 16 This process may be repeated many times with many more experimental and derived data sets produced and \published" within the research collaboration. When the research produces sucient positive results and a certain level of maturity (repeatability) in the methods are achieved, the group will invariably publish papers to share research results with the broader research community. They may also share data with community repositories, as is becoming more common with requirements imposed by funding agencies and journals. The conference or journal paper may be assigned a permanent digital object identier, the data may be assigned a stable accession number by the repository, and both the data and manuscript will generally be archived online for long term preservation of the work. We may refer to the many data and metadata production events in the example above as micro-publication (Publication) [130] events in contrast to the nal stage of the tradi- tional publication (Publication) event. But what is the dierence between the Publication and Publication events? In many ways, it is merely a matter of scope and intent. For a Publication event, the scope is generally public for all, while the intent can vary from presen- tation of early research results to mature statements about general knowledge produced by the research, whereas thePublication scope may be restricted as narrowly as an individual researcher (e.g., like a personal lab notebook) or as broadly as an entire research team of a multi-site collaboration. The intent of the Publication events will be to capture, record, and communicate (within scope) the evolving research methods, counterexamples, negative results, and progress toward reinforcing the hypotheses throughout the discovery process. Publication Events APublication event is associated with a specic piece of data. It may be initiated explicitly by an end user or triggered automatically by a computational agent. We can think of a Publication event as occurring in three stages: Harvest: the data are examined in order to harvest any directly accessible metadata associated with the data. This could include information coded in the le name or directory path, or in the case of structured le formats (e.g. Hierarchical Data Format) 17 this may include attributes stored in the le header. Basic information such as the size of the object, its creation time, etc. may also be captured. Register: the data are assigned a unique identier, with a version number if appro- priate. The storage system housing the data may be notied that edits are no longer allowed on them. The information about the data is inserted into a registry so that it may be found for subsequent use. Annotate: additional attributes may be associated with the data by subsequent action. For example, the data producer may annotate the data with additional context known only by them, or quality assurance measures might be computed or assigned by other actors. Annotation can continue to amend the metadata about a data indenitely, as dierent actors introduce additional derived information. Publication of Digital Assets A digital asset is a unit of data that will be managed and manipulated as part of the discovery process through Publication events and have evidential or other value for scientic inquiry. In the context of Publication, not all data are digital assets but only those that satisfy these criteria. Technically, a digital asset is an arbitrary byte-stream, like a conventional digital le abstraction but without the notion of a mutable container. New digital assets are introduced into the overall system in the form of a Publication event. One consequence of the Publication event is that collaborating consumers may subsequently access the asset. The level of granularity at which we qualify data as an asset may be domain and use case dependent, and a single set of data values may be modeled as several digital assets. For example, a confocal microscopy image may be represented as a single asset that is the 3D volume, or as a collection of assets, each of which is a 2D image. We consider the asset to be immutable and permanently named, that is, oncePublished, any subsequent consumption of the asset will always see the same data value that was originally published. This is sometimes referred to as \data xity". The data may become 18 unavailable but there will never be ambiguity as to what it means when an asset is referenced by name. If edits or modications are needed, then a revised asset must bePublished again. This immutability is consistent with existing object storage systems such as Amazon S3 or Dropbox if you use a version-qualied name to reference the asset. The only additional constraint we apply is that a particular version-qualied name should never be reused for dierent content after any sequence of Publication and deletion events. This may require additional coordination if the storage system does not issue permanently unique version identiers. An asset may be related to other assets via metadata referencing the assets or by embed- ded references between assets using their permanently unique identiers. This relation may capture notions of derivation or provenance (e.g. this asset is a revision of another asset, or a transcoded representation of another asset, or this asset was produced as the output of another manipulation of one or more assets) [102]. Alternatively, an asset may be related to others with respect to a collection. For example, an asset may be part of a higher-level collection of assets (e.g. image slices that make up a single volume) or dierent encodings of the same original data, or dierent observations of the same subject. Continuously \FAIR"Publication While data availability statements in publications are not new, the quality and organization of shared or supplemental data have been uneven at best with errors as critical as incorrect gene naming [1] and inaccurate nucleotide se- quences widespread [107]. In the past few years, momentum has built toward establishing best practices for the sharing of research data. Core to this movement is the so-called \FAIR" principles [164]. They argue that data should be Findable as identied by a unique identier and characterized by rich metadata that describe the details of the data and its relation to other data, Accessible via standard protocols with access control and the metadata accessible even when the data are not, I nteroperable by using standardized terms to describe it, and Reusable by providing accurate and relevant attributes. Data-centric discovery can be ac- 19 celerated and collaboration enhanced when the data associated with the discovery exemplify these principles. FAIR principals are often thought of in the context of sharing nal published data. However, there is no reason to assume that the benets of FAIR data apply only to sharing and reuse of formally published data but also to each stage of the research data life-cycle [91, 45] consistent with the Publication events as discussed previously. Creating FAIR data as part of the daily process of a scientic investigation has the potential to enhance a data driven collaboration by streamlining the exchange and reuse of data across a research team, reducing the potential for error by maintaining accurate data context, enhancing provenance for reproducibility and improving the quality of data that are ultimately published. 2.4 Data-Oriented Architecture We now turn to the question of how one might structure a system that supports data-centric discovery withPublication of digital assets. To address this, we propose a new architectural style, which we call Data-Oriented Architecture. Data-orientation vs. Service-orientation Data-orientation, like service-orientation, is a design paradigm for decomposing a complex system into modular pieces. These pieces, in turn, are meant to be put back together in novel combinations. However, the nature of the pieces and the means of recombination are dierent: Service-orientation: a service encapsulates a (possibly hidden) state and a set of compu- tational behaviors behind a message-passing interface, which can trigger computation and state mutation. Each service may have its own messages and state-transitions. Data-orientation: data is a rst-class citizen and data stores expose protocols for inter- acting with data. A universe of actors take on roles of producing, searching, referencing, and/or consuming data objects via transactions which modify data stores. 20 Crucially, data itself is the main shared resource and point of integration in a data- oriented architecture. Data are managed through simple transactions on data stores which realize storage resources, provisioned to serve a community. Such data may include digests, indices, and other simple derivatives as well as the wholly new results of synthesis and discovery. Over time, the set of available data evolves as a result of the combined activity of the community of agents and actors who interact with one another through the publication, discovery and access of data. From a data-oriented perspective, services with domain-specic message interfaces and \business logic" are considered to be transient just like any other actor or agent. They cannot be relied upon to mediate access to data over the long term. The data are passed down through time, much like the body of literature and knowledge passed through traditional libraries. This requires dierent mechanisms for the collection and dissemination of data among a community while remaining agnostic to the changing methods and motivations of the actors. Data-Oriented Architecture builds on the core concepts of Service-Oriented Architec- ture { loosely coupled, cohesive, encapsulations of capabilities with well-dened protocols, potentially distributed and under dierent administrative domains [162]. It also builds on key observations that real-world transactions are nested and long-lived, that people are key participants in many or most transactions, and that policies govern many of the activities in real-world interactions, thus the architecture should support or allow for these character- istics [166]. The key distinction between service-oriented and data-oriented is that instead of encapsulating tools, processes, or business functions, the data-oriented approach encap- sulates data, which may be coarse-grain data objects or intricate models of the domain. Primitives of Data-Oriented Architecture Figure 2.1 illustrates the primitives of Data-Oriented Architecture upon which more complex operations and higher-level semantics to t a particular domain may be constructed. At the core is the universe of data over which 21 Producer Data Store Consumer Curator Curate Publish Search Retrieve Data Figure 2.1: Primitives of Data-Oriented Architecture. interactions and integration take place. These data have been specically identied and are assumed to have a digital representation, which is contained in a data store. While the gure shows a single abstract data store, in practice there will be many data stores, which will be geographically and organizationally distributed. Interacting with the data store are one or more actors, who can be human or computational entities, and their behaviors are expected to be governed by local or global policies. Each actor can play one or more roles within the Data-Oriented Architecture environment: producer, consumer, or curator. A producer creates new data and makes it available within a data store, and provides a means for uniquely identifying that data. Introduction of data into the data store is mediated by the curator who sets policies and standards for what is stored and by whom. Note that the data producer is the only role that can introduce new data into the environment, and once introduced, it cannot change. A consumer may locate data of interest based on specic criteria and retrieve that data. Finally, in the curator role, an actor can manage a subset of data, adjusting access policies, deciding what should be kept or pruned, improving the ability to search, and even soliciting the production of new data by other actors. While a curator can change descriptions of the data, they cannot change the data itself. Such changes can only be done by assuming the producer role and creating a new data revision with its 22 own unique identier. Any actor can take on multiple roles, subject to policy of a data store, and the roles that an actor may take can change over time. An actor may be simultaneously the producer of one data set and the consumer of other data sets, even straddling several data stores. For example, a single agent may retrieve data (consumer role), perform some analysis and then publish that data (producer role) back into the originating store or into a completely dierent store. Finally, we note that while consumption must occur after publication, there is no other implied sequencing or dependencies in the interactions with the data. Data Models A data model is a set of rules for the syntax or structuring of information. Models are sometimes expressed in a formal meta-modeling language that assumes certain universal data structuring conventions. For example, languages exist for tabular models, graph models, hierarchical object models, or hierarchical document models. Other models are expressed directly in human language specications or using low-level byte-stream parsing rules. Examples of data models include a particular database schema written in Structured Query Language (SQL), a particular eXtensible Markup Language (XML) document schema written in XML-Schema, a particular ontology written in Web Ontology Language (OWL), or a particular le format like Tag Image File Format (TIFF) image containers. Data models often include some domain concepts contained in syntactic structure. For example: a database schema includes table and column names; an XML schema includes element and attribute names; an OWL ontology includes predicate names; and a le format like TIFF has le sections devoted to particular imaging concepts. However, many other aspects of the domain are not captured in the data model but instead require a domain model, discussed next. Domain Models A domain model is a predened representation of a domain written as content in a particular data model. Examples of domain modeling content include: a set of vocabulary entities in a database table; a set of well-known constants or Uniform Resource 23 Locators (URLs) in an XML schema; a set of well-known subjects and axiomatic statements extending an OWL ontology; or a set of extended attributes used for certain classes of TIFF image. In some cases, such as XML Schema, it can be dicult to draw a distinction between data model and domain model because both are usually mixed into one specication and one specication can extend another. In other cases, such as databases and TIFF les, it can be dicult to draw a distinction between domain model and data because both are mixed into the same generic data storage containers. The semantic web sometimes blends all three layers of data model, domain model, and data into a single graph database. Together, data and domain models guide and constrain how data are captured, organized and rendered. Using shared models improves our ability to exchange data, to reuse it, and to integrate data sets, both internal and external to our immediate research collections. Model and Schema Evolution Given the dynamic, rapidly evolving nature of discovery, it is dicult to dene a priori the models to use for an experiment or other data generating research activity. As the experimental methods mature, so must the models used to constrain the structure and organization of data captured by the experiments or other analyses. These changes may be syntactic (i.e. data model) or may change the set of predened content (i.e. domain model). Similarly, we can expect data to outlive particular actors and experiments. Any particular digital asset has an intrinsic data model and may have embedded domain modeling content which are all just as immutable as the asset itself. A data consumer may need to adapt to this frozen model long after a community has adopted newer models. This requires an ability to identify and distinguish models, and also may motivate derivation of new data products which simply convert or \migrate" older data into newer models. Consequently, data and domain models are by denition incomplete, in constant ux, and a community or data collection may have multiple, simultaneously inconsistent models 24 depending on what subset of content is considered. While one may seek to have an ultimate domain model that completely captures the knowledge produced by an experiment, for dis- covery, incomplete or partial models are often good enough. The utility of incremental model expansion has been explored in other contexts, such as Dataspaces [56] and SQLShare [77]. Data, Metadata, and Search We observe that realistically complex scientic data col- lections tend to include signicant metadata to contextualize a particular digital asset within the domain model. Sometimes this metadata is simply encoded in the name of a digital asset; sometimes it is encoded in embedded metadata within a complex digital asset; and other times it is included in an external manifest, which is in reality another digital asset, which describes one or more digital assets. In order for data consumers to nd digital assets of interest to them, a data store must oer a search capability, which indexes the contextual metadata of the stored data and in some cases, the contents of the data too. This index or catalog is a communal resource that represents the entire data collection as it stands at one point in time. The ow of metadata is not, however, one-way only. In addition to incorporating metadata about assets into a searchable catalog, we also expect that actors will want to capture or snapshot catalog content into new digital assets. For example, a particular summary of digital assets and their contextual metadata may be captured to record exactly what was used as input for another data-centric discovery process. Or as another example, an actor may realize that there are errors in metadata and wish to author revisions that are then recorded as new manifests or revised assets with changes to the embedded metadata. 25 2.5 A Data-Oriented Architecture for Scientic Digital Asset Management Data-Oriented Architecture presents a general design pattern for which there may be many alternative instantiations. Here we explore a specic application of data-oriented architecture principles in the context of digital asset management for supporting data-centric discovery. Overview Digital Asset Management (DAM) systems generally encompass storage, an- notation, search, retrieval, repurposing, and transformation of digital assets [106], where a \digital asset" has been dened as a le or other digital object for which one has rights of use and access [152]. DAM systems are designed to streamline unstructured, dynamic, exploratory processes and highly idiosyncratic creative work ows, rather than support long- lived predened processes which are better addressed by business process management and work ow systems [66]. The requirements for data-centric discovery bear some resemblance to the target use cases of digital asset management. Like other creative activities, scien- tic discovery does not follow a predictable path or pattern, and the information acquired along the way is produced in dierent contexts with dierent descriptive properties. Prior to dening its architecture, we will rst explore some key concepts that such a system should address. Figure 2.2 illustrates a system architecture for digital asset management in a simple but representative deployment conguration. The primary components of the architecture (described below) operate according to the principles of Data-Oriented Architecture. Two specializations of the Data Store are introduced: Domain Catalog and Object Store. Domain Catalogs mediate interactions with the domain data, which contextualizes the digital assets maintained in Object Stores. Data may be ingested into Object Stores or exported from them using Acquisition Agents and Export Agents, respectively. External to the DAM system, scientic data is typically generated by an instrument at the acquisition site, and conversely, 26 Typically Local Database Storage User Model- Driven UI Storage Policy Model- Driven UI Model- Driven UI Storage Multi-Tenant Hosted or Single-Tenant On-Site Deployments Object Store … … Local or Hosted Database Domain Catalog Acquisition Agent User User Instruments Local or Hosted Export Agent Acquisition Agent Local or Hosted Analytics Pipelines operates instruments acquires assets Data Manipulation Agents annotates curates searches / retrieves performs analysis exports assets / reacquires derived assets Figure 2.2: Reference architecture for Digital Asset Management based on Data-Oriented Architecture principles depicting a deployment of its primary components (white) and com- plementary components (gray). data are exported to analytics pipelines that in turn produce derived data subsequently reacquired as digital assets. Users interact with the system via exible, model-driven user interfaces, that adapt to the underlying data model manifested by the system. Next, we describe the key components of the system. Domain Catalog In Data-Oriented Architecture terms, the Domain Catalog is a type of data store that supports actors who create, search, store, retrieve and remove data models, domain models, and metadata which may reference digital assets, typically via their per- manent identiers. The Domain Catalog tracks the assets and descriptive information used to discover and organize them. Models dened in the Domain Catalog may be drawn from standard ontologies and controlled vocabularies or common data elements, or they may be dened by the researcher. It must support model and schema evolution by allowing the community to both change over time and hold dierent models simultaneously. 27 Model-driven User Interface An eective User Interface (UI) for a scientic DAMS should be model-driven [3] in order to function in a system with model and schema evolution. That is, the UI should re ect the current state of the Domain Catalog without having assumptions and semantics about the data model or domain model hard-coded into UI programs. This rules out many conventional UI development paradigms where programmer intervention is required to make changes to models and then to the UI views and data entry forms that depend on the data model. Object Store An Object Store is a type of Data Store that provides the interface for storing and retrieving digital assets as opaque objects. An object store must satisfy our Data- Oriented Architecture requirements for transactional update and permanent, immutable naming of data assets. The Object Store may support versioning so long as each version of the digital asset is stored as an immutable, though-related, object in the Object Store. Data Management Agents Following the pattern of Data-Oriented Architecture, all other services, functionality, tools, etc. are provided by Agents that interact with the Domain Catalog and the Object Store. The Acquisition Agent watches an external storage device (e.g., local le system attached to the instrument or analytics engine) and upon detecting newly generated data, copies the data to the object store and may update the domain catalog with references to the newly created objects. Conversely, the Export Agent monitors the state of the data stores and based on local policy decisions retrieves data from the object store and maps it to a particular layout at the local le system. Data Manipulation Agents perform a variety of domain- and task-dependent operations. Some agents may be responsible for applying several Quality Control checks when new data is uploaded to a data store, some may be responsible for extracting metadata from digital assets, while still others may be responsible for format conversion, imaging stacking, image tiling, down-sampling, and many other data manipulation tasks. 28 Policy and Policy Services Actors and data may be governed by policies that specify the allowable interactions and expected behaviors of system elements, including access control restrictions, usage auditing, and specialized agent activities. As in the real world, actors and agents must decide how to implement and enforce policies. In some deployments, data stores or agents may have self-contained policies, while in others the policies may be dened in a more centrally managed policy service. In the latter case, the agent's local policy is merely a scaolding to retrieve and enact the centrally managed policy. 2.6 Prototype for Scientic Digital Asset Management Based on the above blueprint, we have developed and deployed in production various com- ponents of the prototypical DAMS for data-centric discovery. In this section, we brie y introduce the system components of the prototype and how they map on to the general architecture of DAMS. We revisit many of these system components in greater detail next in Chapter 3. Domain Catalog ERMrest (Entity Relationship Model via Representational State Trans- fer) implements the Domain Catalog semantics of the architecture, which enables the evolving and dynamic domain models needed for multi-domain, multi-modal, heterogeneous research data. Among the top design priorities, we sought to eliminate the need for signicant back- end systems development just for data model changes. As such, ERMrest introspects a relational database schema, interprets its structure, and presents an ad hoc query interface which allows \CRUD" (create, read, update, delete) semantics to manipulate the data, and it is implemented as a Representational State Transfer (REST) [54] Web service. In addi- tion, users can alter the data model through its interface, and thus add or remove tables and columns. We chose to use a relational database on the backend (as opposed to an XML or graph store) because of the relative simplicity for users to understand the table as a data structure and the interfaces to manipulate it [126]. It has been observed by others that 29 scientists are familiar with the use of spreadsheets to collect and organize semi-structured information about experiments and their results [77]. After exploring alternatives such as triple based models [122], we found that well established entity relationship models (ERM) were more easily understood and manipulated by a broad base of researchers, especially when mediated through the higher level interfaces provided byERMrest. One of the most common methods for researchers to model and record data is by using spreadsheets, and the relational model closely matches the data structure of a spreadsheet (i.e., tables and columns with the addition of type- and referential-constraints). ERMrest was developed to run on a conventional Linux environment (Apache, Python, PostgreSQL) and supports standalone deployments as well as multi-tenant hosted environments with multiple database back-ends serving distinct tenants of the system. Model-driven User Interface Chaise (Computer-Human Access Interface with Schema Evolution) is a Model-driven User Interface implemented in JavaScript to be loaded and executed in a web browser (see Figure 2.3). The UI communicates with the data stores over RESTful web protocols. The UI supports authentication of the user and all data store requests on behalf of the user. It retrieves the domain model from the domain catalog and interprets it, and then it presents a user interface. While the UI is model driven, the range of interactions are narrowly scoped to the semantics of digital asset management, thus its presentation and supported operations address the needs of searching, retrieving, and annotating assets based on descriptive domain models. The UI relies on schema annotations which provide hints on how to present the domain model. For example, an annotation may indicate that an attribute in the catalog should be interpreted as an asset reference with hints on how to follow the URI to display previews of the asset or how to present options to download the asset. Object Store Hatrac (pronounced \hat rack") is a simple Object Store service for web-based, data-oriented collaboration. It presents a simple RESTful Web service model 30 Figure 2.3: Screenshot of the Chaise interface to the asset management system deployed for FaceBase. with: H ierarchical data naming; Atomic transactions; Trivial support for browser-based applications such as ourChaise UI; Referential stability for immutable data; Access control suitable for collaboration; and C onsistent use of distributed data. Hatrac supports multiple hierarchical namespaces as zones with local access control policy enforcement on operations. It supports the required interfaces specied for the Object Store and supports limited restore semantics. It also supports incremental and restartable transfer for large object creation or download. Data Acquisition Agent IObox is an implementation of the Acquisition Agent. The IObox can be congured to scan a hierarchical le system on a schedule or use operating system specic calls to monitor a le system for changes and wake on asynchronous notica- tions of le modication events. In either scheduled or asynchronous modes, IObox inspects basic le statistics (e.g., le size, last modied date, etc.) and executes a pattern matching 31 rule against incoming lenames to determine whether les should be imported into the object store. We intend to support Export Agent semantics through IObox as well. Data Manipulation Agent Automan is a suite of Data Manipulation Agent utilities for automating specic purpose-built tasks. We have developed several prototypes of the Au- toman and used them in production environments. One agent supports metadata extraction and harvesting based on pattern matching rules. It has a \pluggable" design with format support via 3rd party parser libraries for many common scientic data formats including Hierarchical Data Format (HDF), Network Common Data Form (NetCDF), Digital Imaging and Communications in Medicine (DICOM), Neuroimaging Informatics Technology Initia- tive (NIfTI), Microsoft Excel (XLSX), Open Microscopy Environment TIFF (OME-TIFF), and text formats such as Sequence Alignment Map (SAM), Variant Call Format (VCF), and Comma-Separated Values (CSV) les. Another agent supports format conversion of large tiled pyramidal images into a web accessible format. The Automan agents perform opera- tions on the digital assets (e.g., metadata harvesting or format conversion) and update the domain catalog to re ect the state of the assets. Policy Services Our prototype services integrate a pluggable web authentication frame- work for role-based and attribute-based authorization. Alternative authentication providers determine client identity and a list of attributes which may include group membership or other simple labels for the client. One supported interface is Globus Nexus [5], a SaaS iden- tity and group management platform, which supports self-managed user groups which allows members of a community to manage membership in groups used as roles for access to other resources. 32 2.7 Related Work Here we outline related work from a broad selection of categories that touch upon various aspects of data-centric discovery, data-oriented architecture, and digital asset management. Service-Oriented Science Service-oriented science [4] has been proposed as a means to improve interoperability and accessibility of critical services needed by scientists and to democratize science by enabling access to resources on demand and ooad the burden of operating infrastructure that can be dicult and expensive to maintain by all but the largest institutes. Data-Oriented Architecture (DOA) is more narrowly scoped in its concern and argues that data should be treated as a rst-class element of the architecture and that data is the point of integration and interaction for all actors (people or automated agents). Digital Object Services A distributed architect for naming, identifying, registering, stor- ing, and accessing \digital objects" { dened as a data type consisting of data and key- metadata where data may be bit-sequences, sets of bit-sequences, and other types and they may be declared as being mutable or immutable { called the Digital Object Architecture (DO Architecture) has been proposed [83]. The DO Architecture denes the roles and protocols for an identier/resolution system, for registering and resolving digital object identiers; a repository system, for storing and accessing digital objects; and a registry system, for storing and accessing metadata about digital objects. Unlike the DO Architecture, the data-oriented architecture (DOA) we propose is an architecture style meant as a design method and guid- ing principles to be followed when designing a specic system architecture. The DAMS we propose is unlike DO Services in the sense that a DAMS is more focused on the interactions with data during creative phases of the research life cycle rather than the nal publication of a nished work. The Digital Object Identier (DOI) that emerged from the DO Architecture is one of many potential permanent identiers (PIDs) that can be adopted within a DOA for DAMS and in fact many of our deployments do use DOIs and interact with DOI naming 33 authorities to register and resolve identiers for digital assets and their metadata. Digital Asset Management Scientic databases and data management are an area of intense research. Managing scientic data within the discovery process, as data are being acquired, described, and organized is one special class of scientic data management that most closely resembles what is known as digital asset management (DAM) in other elds [152, 106]. DAM systems are used throughout the content development process within creative elds, like digital arts and marketing. Within science, a similar need exists to manage scientic \assets" (i.e., data) throughout their formation. The problem of scientic asset management is notably dierent when compared to other data management issues such as integrating data from dierent sources, which assumes data exist; publishing data in specialized packaging formats, which is generally the nal or intermediate output stages; archiving of data, which is for the long term preservation of data; and many other aspects of scientic data management. Imaging Management Systems While Digital Asset Management Systems (DAMS) have been used widely by creative and business organizations, the closest comparisons sup- porting science may be imaging [94] and microscopy management systems [143]. These approaches, however, are modality specic (e.g., microscopy or neuroimaging, respectively) and do not facilitate the continuous renement and schema evolution necessary to support a full life cycle of research from inception to publication. There is a lack of general-purpose asset management capabilities that can span a wide range of multi-domain, multi-modal scientic data and which support the dynamic, rapidly evolving, heterogeneous research activities. Storage Management Systems Storage management systems, such as iRODS [68], pro- vide facilities for lower-level data storage operations and storage resource management. These systems provide high-performance network le transfer protocols and abstract away 34 the physical storage layout using an internal metadata catalog for tracking les and basic le attributes. They generally operate on data at a semantically lower level than digital asset management and have fewer facilities for comprehensive metadata management. Digital Repository Systems Digital repository systems, such as DSpace [42], SEAD Virtual Archive [109] and Globus Publish [24], may be used to develop institutional repos- itories which support preservation of digital works and enable open access to data. They typically integrate the functionality of object storage, metadata services, and document work ow management to support the publication process. Digital repositories are primarily concerned with publication, as opposed to Publication, as dened earlier. As such, they are not designed to support the Publication phases of discovery where one's understand- ing of the domain model may evolve considerably and thus the domain knowledge used to contextualize the data may change many times over the coarse of the discovery process. Metadata Catalogs Scientists in many elds depend on specialized scientic databases known as metadata catalogs to store, manage and share descriptive information about data and datasets [82]. They use metadata as a means to catalog, organize, and nd data through- out the scientic discovery process and to facilitate collaborations by sharing data. Re- searchers have identied key categories of requirements for metadata catalogs [43]: store and share metadata, organize metadata for publication and discovery, customize views of metadata, and support large scale datasets. In addition, metadata catalogs are typically deployed as part of a distributed systems infrastructure, and are used by clients across wide-area networks and geographic locations, who perform operations without central co- ordination. Metadata catalogs are expected to provide concurrent access, low latency, and minimal downtime. Further, the purpose of querying a metadata catalog is to identify data objects and collections or aggregations of objects. Therefore, a metadata catalog is not a general-purpose ad hoc query answer system, but one tailored to the identication of data sets. 35 Metadata catalogs must support exible model denition, evolution, and discovery [43]. Users must be able to dene metadata models for a wide range of scientic domains, dierent conceptualizations within a domain, and emerging standards for describing data. Given that metadata catalogs are typically shared resources, whether within a single lab or among large-scale consortia, users must be able to discover the structure and attributes of the metadata model within a catalog. Finally, users must be able to evolve metadata models and related contents of metadata catalogs due to the evolving nature of scientic discovery, changes in requirements, maturing knowledge of the domain, new experimental techniques and instruments, and changes in methods among other demands [105, 33]. Research on metadata catalogs has considered issues of exible modeling [81, 135], model evolution and discovery [43], dynamic model generation and integration [150], and incorporating semantic representations [58, 156] into metadata catalogs. However, the research has not explored the specic mechanisms that scientists may use in order to evolve the data models used by metadata catalogs. Other research has explored topics of integrated metadata catalogs with key-value-unit models [9, 113], specialized le and replica catalogs [27], distributed metadata catalogs with key-value models [120], distributed relational database access underpinning metadata cata- logs [7], and hybrid XML-relational metadata models [80]. Compute and Analytic Pipelines Computational pipeline tools, such as Taverna [167], are used to automate complex scientic work ows. Scientists dene a sequence or directed graph of computational tasks along with a specication for data input, output, and data ow between tasks (i.e., nodes) of the work ow. Today, much of the current emphasis on handling the demands of data-intensive science has been on innovative work to improve these analytic pipelines and to develop pipelines tailored for specic domains of science, such as genetics [61] and imaging [48]. However, pipelines do not address the overall meta-process (as idiosyncratic as it may be) for data-centric discovery. So called \Interactive Notebooks" [87] 36 such as Zeppelin and Jupyter, are another approach to analyzing and visualizing scientic data. However, these approaches do not provide facilities to capture data from instruments, and they are not intended to manage large volumes of data throughout their many transfor- mations. Data Workspaces Dataspaces [56] and SQLShare [77] also advocate the need for in- crementally expanding and evolving data workspaces for individual or collaborative usage. SQLShare oers a specic implementation based on a relational database backend with a simplied, freeform SQL query interface. Users import their existing spreadsheets, and SQLShare automatically generates a data model based on the columns and values from the spreadsheet. While these approaches agree with our observations of rapidly evolving, hetero- geneous scientic data, we have presented an overall context that addresses a broader view of data-centric discovery. Semantic Models The literature is replete with examples of semantic models for repre- senting scientic data and experiments. Of particular relevance is the Chado [105] approach to mapping ontological representations onto a relational database. We leverage similar meth- ods in our case study deployments to integrate standardized vocabularies for anatomy, gene nomenclature and others into our metadata (i.e., domain) catalogs, though our underly- ing platform is agnostic to the specic approach taken to mapping ontology to relational schema. In addition, a semantic model for Micropublication [28] has been introduced, which we believe is compatible with the architecture and system we have presented here. 2.8 Conclusions We have oered a broad sketch of data-centric discovery and argued that it should be viewed not as a single elegant process but as a dynamic meta-process centered on the data. While solutions exist for various activities within the meta-process, none so far address the overar- 37 ching problem of connecting the disparate processes into a collective whole. Next, we have discussed the need to consider micro-publication events and presented a solution at two levels of abstraction. The rst is a design pattern we call data-oriented architecture that refocuses attention on data as the core product of research and knowledge generation. The pattern addresses only the most basic activities of producing, curating, and consuming data. We then showed how to apply the design pattern to an architecture for digital asset manage- ment to support the specic needs of data-centric discovery, and we presented a prototypical implementation. 38 Chapter 3 Accelerating Data-Centric Discovery With Scientic Asset Management The content of this chapter is based on the papers: Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \Digital asset management for heterogeneous biomedical data in an era of data-intensive science". In: Bioinformatics and Biomedicine (BIBM), 2014 IEEE Interna- tional Conference on. Nov. 2014, pp. 588{592. doi: 10.1109/BIBM.2014.6999226; Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \Accelerating Data-Driven Discovery With Scientic Asset Management". In: 2016 IEEE 12th International Conference on e- Science (e-Science). Baltimore, MD USA, 2016, pp. 1{10. doi: 10.1109/eScience.2016. 7870883; and Alejandro Bugacov et al. \Experiences with DERIVA: An Asset Management Platform for Accelerating eScience". In: 2017 IEEE 13th International Conference on e- Science (e-Science). 2017, pp. 79{88. isbn: 978-1-5386-2686-3. doi: 10.1109/eScience. 2017.20. In this chapter, we present the asset management approach to scientic data management and introduce Deriva, a system that extends from the prototype described in Section 2.5. We report on the use ofDeriva in a number of substantial and diverse science applications. Finally, we describe the lessons we have learned, both from the perspective of the Deriva 39 technology, as well as the ability and willingness of scientists to incorporate Scientic Asset Management into their daily work ows. 3.1 Introduction Traditionally, a knowledge cycle in scientic discovery has been regarded as the formation of new knowledge through repeated turns of hypothesis, prediction, observation, and analysis, which are punctuated by transmission of results in the form of publications [60]. In the new paradigm of data-centric discovery, these knowledge turns are increasingly dependent on a scientist's ability to acquire, curate, integrate, analyze, and share data well beyond the compact reports oered by traditional publication. While the details of these cycles vary from domain to domain, and indeed across time within a single discovery process, they share common characteristics and face similar data related problems regardless of the domain. Experimental protocols involve multiple steps with data capture from diverse, high throughput, high delity instruments, analytic pipelines or computational simulation models. The results of one cycle of experiments are iteratively fed back into the system to rene the next series of experiments. Data must be contextualized within the protocol steps they are captured and may be related to physical specimens and other details of the experimental stage that produced the data. Data will also be produced by computational and human analyses in other steps of the protocol and also must be contextualized and linked to related data. As the complexity and volume of the data that are used to drive knowledge turns in science increase, the ability of the scientist to manage the logistics of executing these cycles can become a rate limiting step. Current approaches to scientic data management have failed to keep pace with the needs of increasingly data-intensive science. Managing data is often done manually with \meaningful" le names and directory hierarchies, locally coded spreadsheets, and ad hoc laboratory notebooks. As noted by [29], \large amounts of data 40 are generated using a variety of innovative technologies and the limiting step is accessing, searching and integrating this data." Scientists and other data analysts report spending 50% or more of their time on data wrangling (locating, extracting, cleaning, formatting, organizing) tasks [84, 34, 147, 134], which leads to potential misinterpretation of results, misuse of data, forgetting pertinent details to describe the data, inability to nd and retrieve previously captured data and results, and other similar issues. Concerns over repeatability of scientic results are growing along with an increase in scientic retractions [140], and some reports indicate that there is as little as 10% reproducibility of scientic results which cite lack of data publication as one of the signicant factors [11]. As noted in [64], tools merely to support data capture are \just dreadful" and \we lack good tools for both data curation and data analysis." These sentiments echo similar observations made decades earlier by J. C. R. Licklider in his seminal work on the so called \man-computer symbiosis", as he states, \...my choices of what to attempt and what not to attempt were determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability" [89]. Decades later, these same issues continue to obstruct data-centric discovery. While scientic data management has lagged behind the needs of data-centric discovery, this is not the case in other creative activities that also depend on managing large, com- plex data sets. For many years, the prevailing method of collecting digital pictures was to categorize images following user-dened, ad hoc naming conventions, resulting in a rigid, hierarchical, and often-confusing array of les and directories, much like scientic data is managed currently. Today, photographers use digital asset management (DAM) systems such as Apple Photos or Google Photos to automatically discover and catalog digital images on their cameras or hard disk drives; extract metadata from the imported media; add user annotations; organize pictures into virtual collections (i.e., photo albums); browse and search on metadata, annotations or features such as faces in the picture; export data for manipula- tion by external photo editing tools; and publish to cloud-based sharing or printing services. By comparison, scientic data management has failed to address the data wrangling tasks 41 incumbent on scientists today. Ecient creation of high-quality, reusable and sharable data would clearly benet from tools and processes that reduce the complexity and overhead of collecting, organizing and preparing data for data-driven scientic investigations. We argue that a data management approach based on scientic digital asset management [125] can signicantly streamline the process of managing complex and evolving data collections, and in doing so, accelerate \knowledge turns" for scientic discovery [60]. Digital asset management systems (DAMS) enable the management tasks and decisions surrounding the \ingestion, annotation, cata- loging, storage, retrieval and distribution of digital assets" [152]. Our hypothesis is that a DAMS platform tailored for science could have a similar impact as commercial DAMS prod- ucts have had in professional media and creative industries | transforming how scientists interact with their data on a daily basis, making them more ecient and eective. To test this hypothesis, we have developed Discovery Environment for Relational Informa- tion and Versioned Assets (Deriva), a platform which provides an end-to-end ecosystem for managing scientic data from acquisition through analysis and publication. Using Deriva, we have conducted a multi-year study on the impact of these techniques within the context of diverse data-centric scientic collaborations at scales spanning from small basic research investigations to international data sharing consortia. To evaluate the eectiveness of the sci- entic asset management based approach, we examine its use in detail in two representative use cases. This chapter makes the following contributions: 1. present the idea of asset management as a way of addressing scientic data management challenges, 2. enumerate the requirements for asset management systems based on analysis of science applications, 3. propose a general architecture for scientic asset management ecosystems, 42 4. describe the rst platform implementation for scientic asset management, 5. show how a single platform for asset management can be congured and applied to distinct scientic discovery domains, and 6. describe how daily work ows of diverse domain scientists have been improved by in- corporation of scientic asset management systems. In the following section, we will provide desirable characteristics and requirements for a scientic DAMS. We will then describe our proposed architecture for scientic asset man- agement in Section 3.3 and the implementation of Deriva in Section 3.4, a platform for introducing DAMS functions into science applications. We then share an evaluation of the approach and its framework in the context of science applications that use it regularly in Section 3.5 followed by a summary of lessons learned in Section 3.6. Finally, we review related work in Section 3.7 and oer our conclusions in Section 3.8. 3.2 Characteristics and Requirements of Scientic As- set Management The specics of a data-centric discovery process will vary from domain to domain, and even from instance to instance. However, it is also the case that there are common needs and requirements that cut across all types of data-centric discovery. By considering these needs across a wide number of dierent use cases, we propose that a scientic DAM ecosystem should provide the following capabilities: Acquisition and characterization of diverse scientic assets, including experimental data from instruments (e.g. microscopes, sequencers, ow cytometers), outputs from computational models, and results from analysis pipelines. Data must ow freely and automatically from the points of production into the management system, much like pictures ow from a smartphone into a management application. 43 Model-driven organization and discovery of assets. Successful DAM systems in the consumer space provide end users with intuitive and interactive ways of organizing and discovering assets that may be related via a complex underlying model. For example, in music DAM systems, one may discover based on artist, group, work, composer, instrumentation, genre, etc. Similarly, a scientic DAMS should provide model based organization and discovery, in spite of the fact that the models may vary radically from domain to domain, may cross domains (multi-disciplinary collaborations) and vary over time as the discovery process unfolds. Storage and retrieval of research data assets. These assets may be very large, and may be physically distributed in local, enterprise, and cloud based storage systems. Aggregation and exchange of data collections. A scientic DAMS should be viewed as the hub of a data management ecosystem and not create unnecessary data silos. Hence, it is critical that the DAMS allow users to assemble and export data sets for consumption by other tools and users. Rights management/access control. A core function of DAMS is management of IP associated with assets. In the research environment, this means enforcing data use agreements, access to proprietary data, time driven data embargoes, and dierent user roles within and across collaborations. 3.2.1 Acquisition and Characterization of Scientic Assets We represent data-centric discovery in terms of an evolving set of scientic \digital assets" (i.e., research data or simply assets), which are described and related to one another via a domain model. The idea of an asset contextualized within a domain model is similar to semantic models for describing argument and evidence as \micropublications" [28] where the continuous acquisition and characterization of assets throughout the discovery process may 44 be viewed as incremental micropublications with each asset as an embodiment of evidence on which scientic arguments may be grounded. Diverse sources of assets. An asset may be generated by sensors, instruments, or as the result of a computation. Like photos on a smartphone, assets should be seamlessly integrated into the asset management system. The production of assets is itself uid and cyclic throughout the discovery process as new data are generated, analyzed, shared with collaborators, and exported for publication. As new assets are acquired, they may be im- mediately consumed by other actors such as collaborators or computational work ows and analysis pipelines. Diverse forms of assets. Scientic assets come in dierent forms, formats, and concep- tually at dierent levels of granularity. For example, a video recording can also be represented more granularity as a time series of individual frames. In some cases, therefore, it may be useful to reference the parts versus the whole and conceptually to model the video as an aggregation of frames. Incremental derivation of assets. The framework must allow incremental derivation of assets throughout the scientic discovery process. Data may be captured early in the discovery process. However, at the point of acquisition researchers may only have minimal contextual information to describe the asset, such as the date and time of capture, the instrument type and settings, and identity of the investigator. It is not until later that more information regarding the asset will be known, such as measures of data quality. The acquired data may also become the input for downstream computational analyses, which will generate new insights and may result in derived data of its own. Data does not enter the discovery process fully formed but often goes from a minimally- to maximally-described state [28]. Automation and self-service curation. Manual data entry has been noted as one of the key barriers to successful adoption of data management services [109]. The framework must enable self-service curation of assets with simple user interfaces and automated services 45 to reduce manual eort where possible. However, the diversity of scientic domains and data and their unique requirements means that simple one-size ts all asset management applications will not suce. At the same time, developing new interfaces and applications for each use case is prohibitively expensive and time consuming. The user applications must adapt to the underlying domain model. They cannot assume a rigid structure but must be exible both to the structure of the data and also the work ow of researchers using the system. 3.2.2 Models and Evolution for Scientic Asset Management Domain models represent the key concepts, behaviors and relationships of the participants and elements in a real-world system. Domain models are used to describe scientic assets in order to link, organize and contextualize them within the discovery process. Eectively modeling the information for such diverse and complex domains of science motivates the need for domain modeling approaches that support well-dened, structured information. Yet, the process of scientic discovery is always unfolding and yielding new insights and understanding of the domain and hence systems to support it must be able to evolve as well. Complex data models. We argue that when one considers the complexity of scien- tic information and the importance of precise descriptions of research and results that a structured data model is necessary, ideally one based on a strong formalism such as rela- tional or graph theory. For example, the entity-relationship model (ERM) can serve as the underlying meta-model for domain models in scientic asset management. ERMs are ex- tremely expressive and can be used to describe data in tabular as well as graph structures through commonly used idioms. Another important factor in considering ERMs as an under- lying meta-model is that the majority of scientic data is in practice represented in tabular structure [77] and therefore is a natural t for use in scientic data modeling. While it has become increasingly popular to abandon structured data models like ERMs in favor of so-called NoSQL, schema-less, key-value, or their decades old predecessor entity- 46 attribute value (EAV) databases in order to avoid data modeling, there is nothing inher- ently static or restrictive about ERMs; in fact, most modern database management systems support schema manipulation. Scientic query workloads have been shown to depend on complex query operations not possible with simpler query dialects and less structured mod- els [78]. More recently, traditional database systems have added support for features that were thought of as the domain of \document-oriented" databases, such as special handling of JavaScript Object Notation (JSON) and text processing, thus oering the features of relaxed schema with the advantage of having full capabilities for structured data management. Evolution and introspection of models. Upfront data modeling is a signicant bot- tleneck to the adoption and usage of complex data models [78]. For science applications, it may be impossible to dene a data model upfront and the model may change in unexpected ways throughout the discovery process. It is therefore essential that the ecosystem for man- aging scientic assets be based around generic tooling that introspects domain models and adapts to them. The systems must allow scientists to begin with minimally-specied models and evolve the model and data throughout the discovery life cycle, while continually ren- ing and increasing the richness and rigor of the structures used to describe their research. Likewise, the ecosystem of tools and services for asset management must be sensitive to the changes in the model. Since data producers and consumers operate independently and asynchronously, they must adapt dynamically. Even in the middle of an interaction with the data, the model may change based on the interactions of another agent in the system. Tools should be model-agnostic so that they may be useful across diverse scientic applications and to allow model evolution without having to upgrade software repeatedly for every model or data change. One key obstacle for supporting exible yet structured data models, is that many appli- cations developed in the popular object-oriented programming (OOP) paradigm are often developed using object-relational mapping (ORM) libraries that dene their own proprietary query dialects and suer from the well-known object-relational impedance mismatch prob- 47 lem { i.e., objects and classes map poorly to relations. Unlike ORMs, we take an approach that does not hide or obfuscate the underlying model from the client. Instead, the elements of a domain model are represented and exposed directly and transparently through protocols and interfaces appropriate to the technology platform in which asset management is im- plemented. This argument is similar to the \don't repeat yourself" or \DRY" principle { that information should have a single unambiguous representation. By contrast, ORMs and popular Web frameworks introduce additional, obfuscated layers that must be updated whenever the data model changes. Extending and enriching models with annotations. While a formal meta-model, like the ERM, is necessary to model scientic domains, it may be insucient to describe the complete semantics of the model. In a relational model, one cannot describe anything more about a particular element in the model other than that it is a `table' or a `column' or a `constraint.' Additional semantics on the model are needed as a way of extending and enriching the basic concepts of the meta-model. For this purpose, model annotations are a way of lling the gap beyond what the meta-model can provide. The annotations are applied to various levels of the domain model including the individual schemas, tables, columns, and foreign key references. The annotations can be used to indicate additional semantics about an element of the model, provide presentation hints to the user interface for how to render a model element or its data in an interface, and other uses applicable to the domain or applications and agents involved in mediating and manipulating the schema. Examples of annotations are shown in Table 3.1. Heuristics for understanding models. While domain models will be extremely di- verse especially from one scientic domain to the next, we have found that a small number of heuristics can be very powerful in enabling presentation and interaction without statically coding behavior into the applications and tools that operate over structured data models. Our heuristic approach makes some simple assumptions about the semantics of the model, particularly as it concerns rendering, navigating, and updating data in a complex domain 48 Table 3.1: Examples of annotations used to describe models. Annotation Description tag:misd.isi.edu,2015:default Schema or table to be selected by default by user interface presentation. tag:misd.isi.edu,2015:url Table or column should be rendered as an link. tag:misd.isi.edu,2015:vocabulary Table should be interpreted as a controlled vo- cabulary. model. For example, when rendering a detailed view of an entity (i.e., a row) from a table, we can apply heuristics that denormalize the rendering of the entity in order to contextual- ize with its \neighbors" in the linked data graph of relationships it forms with references to and from other entities in the ERM. This heuristic strategy can be applied generally across dierent models, domains, and with alternative meta-models as an approach to developing general tooling for working with asset management systems. Examples of heuristics are shown in Table 3.2. Table 3.2: Examples of heuristics used to interpret models. Heuristic Description Ignore auto-generated Disable input for system-generated keys. Extended record When displaying a record, extend the record with enti- ties joined by many-to-one relationships. Nested entity When displaying a record, show preview of entities joined by one-to-many relationships. 3.2.3 Storing and Accessing Scientic Assets Some of the closest approximates to asset management storage systems (a.k.a., data stores) come in the form of version control systems and object stores that operate on data as atomic units and permit changes only through explicit versioning semantics. In general, conventional storage systems do not necessarily suce for the needs of scientic asset management. Here 49 we discuss the requirements unique to storing and accessing scientic assets. Stable naming of assets. Assets and their metadata are referenced by name, and data names can be shared between actors by external means or embedded within other data, which may exist in the same or dierent data stores. Therefore, references to data are not necessarily under the control of the data store holding the referenced data. Rather, references may be asynchronously communicated between distributed parties. Within the context of asset management, a data store must deliver certain guarantees: a reference to an asset should never become ambiguous, but rather should always denote one particular asset whether or not the data is currently available for retrieval; a retrieval operation should be unambiguous, with a consumer being able to determine whether they have successfully retrieved the denoted asset or have encountered an access error. In short, the retrieval of named assets should be atomic, consistent and stable. Immutability of assets. Scientic processes depend on reliable acquisition and sharing of data in order to support reproducible results, thus once an asset has been acquired it must not be mutated. Consumers such as manual reviewers of imaging data or computational work ows that take complex data as input must be assured that the assets they consume are exactly the assets that were produced and shared by the data producing actors. A close corollary to stable names for assets (discussed above) is that assets must therefore be immutable, such that once generated no edits or other in-place changes may be allowed. It is more acceptable, though perhaps not ideal, for an asset to be retracted from the system just as publications may be retracted from the scientic literature. In these exceptional cases, a marker sometimes called a \tombstone" should be left in the asset's place so that references to the asset may be resolved at least and consumers can understand why the asset is no longer available. In data preservation terms, immutability is often called data or le xity meaning the contents of a le are xed and cannot be changed. Versioning of assets. At times, a new asset will be generated based on incremental changes starting from an existing asset and in such cases, these assets enter the system as 50 new versions of an earlier asset. Versioning, however, must be done explicitly as a new asset with an explicit version relationship to another existing asset and as stated earlier must not mutate an asset in place. To assist in this a data store of assets should issue a name for an asset only once and may issue version-qualied names to add convenience to the consumers and references of assets. Linking and provenance of assets. Versioning and other operations that derive one asset from another further motivate the need for facilitating models of provenance. We see these as one natural application of the general requirement for linked data, where the relationship between data is semantically described in terms of the processes and actors that produced an asset, and existing models of provenance [103] should be employed for this purpose. 3.2.4 Aggregation and Exchange of Assets Methods must exist to extract collections of assets under management for consumption by external human or computational agents. Several alternatives exist for exchanging complex information via various forms of data aggregation packages, for example Research Objects [10] or structured le formats such as HDF. These methods can be used to facilitate exchange of asset sets from a DAMS. 3.2.5 Policies for Managing Assets Numerous forms of policy must be considered when managing scientic assets. Protected data Research involving sensitive data collect during studies involving human subjects are governed by policies administered by Institutional Review Boards (IRBs) or other bodies that set Data Use Agreements (DUAs). These data must be handled in compli- ance with rules governing privacy and strict data sharing limits. Under these circumstances, even when research concludes, the IRB or DUA policies may stipulate how storage systems 51 must be purged of stored information. Private use of data In general, data are of great value to the researchers that produce them. Data sharing embargoes specify limitations to access of data while the investigators, those that generated the data, use them in analysis and until they have been published. The ecosystem of tools for asset management must be cognizant of the policies, whether governed by formal policies or dened by individual researchers, and the tools must limit use and sharing of data in compliance with the specied policies. 3.3 Scientic Asset Management Architecture Based on the requirements and characteristics of asset management from our analysis in the previous section, we have dened a layered architecture for the system components of scientic asset management (Figure 3.1). Our proposed design extends from the standard distributed systems architecture of the Web according to the unique requirements of scientic asset management to enable an ecosystem of tools and services for data-centric discovery. The design adheres to the principles of data-oriented architecture presented in Section 2.4 and builds on and further renes the initial conceptual design presented in Section 2.5. Our goal is not to dene an exhaustive collection of providers for various capabilities but rather to identify general classes of services and tools that are necessary for implementing asset management solutions in various congurations and for diverse scientic applications. We now describe the primary components of the architecture. Application. Applications provide adaptive user interfaces for acquiring, curating, and consuming assets. They are model-agnostic and function by introspecting the data stores, in particular the structured data catalogs that hold the domain models and descriptive information. These applications support searching and browsing paradigms to help users locate assets of interest. They let users dene custom subsets and slices of data and support entry and editing of new data and annotating of assets. Applications should be model-driven 52 Storage Catalog Automation Application Ingest Export Policy Control Flow Data Flow Figure 3.1: Scientic Asset Management Architecture. and re ect the current state of the catalogs and data stores without hard-coded assumptions and semantics about domain models. Catalog. The Catalog layer represents specialized forms of the data store for recording, querying, and retrieving descriptive information (i.e., metadata) about scientic assets. As specied in the requirements, in order for the catalog to suciently handle domain models for scientic applications, it must support structured data models, model evolution, interfaces so that clients can introspect domain models, complex query operations and named queries, and model annotations. The catalog is primarily a passive service that applications and other tools (like ingest and export pipelines) interact with to store, query, and retrieve metadata about assets in the system. Storage. The Storage layer represents the data store for bulk asset storage and retrieval. It must satisfy requirements for transactional update, permanent naming of assets, and non- destructive updates. It must ensure the immutability or xity constraints and may support versioning. Like the catalog, the data store is primarily a passive service that provides interfaces for applications and other utilities (like ingest and export pipelines) to store and retrieve scientic assets. 53 Ingest. The role of ingest utilities and services is to identify new assets of interest, extract and harvest metadata from them, record descriptive information so that assets may be contextualized with who, what, when and how an asset was generated, which may only be possible to do at acquisition. It must interact with catalog and storage services to record and store assets. The ingest layer may be congured oine or may be administered through applications. Export. Export services or utilities, on the other hand, extract assets and metadata from catalog and storage services, serialize and package them in well-dened package formats, and may deposit them on recipient systems. As with the ingest layer, the services and utilities of the export layer may take an active (e.g., by polling) or passive (e.g., responding to events or other signals) role in identifying what and when to export assets. They may also be congured oine or administered online through applications. Automation. A wide range of data management and manipulation services may be automated in an asset management system. The services and utilities of the automation layer represent a potentially broad array of capabilities that will be needed in scientic asset management. Automation services will primarily take an active role interacting with catalog and storage layer services to perform a variety of domain- and task-dependent operations, including but not limited to metadata extraction and harvesting, format conversion, image stacking, image tiling, down-sampling, applying compression to various content types, or indexing data. Policy. The Policy layer represents a broad class of services, utilities, and other tools to help specify, evaluate, enforce, and otherwise make policy decisions. These tools tend to play a central and ubiquitous role in all manner of distributed systems. In asset management, the components of the policy layer may be used to express and evaluate policy rules for determining individual and role-based access to assets and metadata, for example. The proposed architecture does not specify or require that policy be implemented centrally or decentralized by the individual components of the dierent layers of the architecture which 54 may make independent and local policy decisions in practice. 3.4 Deriva: Platform for Scientic Asset Management Based on the above architecture, we now describe an implementation of scientic asset management called the Discovery Environment for Relational Information and Versioned Assets (Deriva). The Deriva platform consists of a multi-tenant relational data service for domain models (ERMrest), an object storage service for assets (Hatrac), a suite of adaptive user interface applications (Chaise), a suite of utilities for ingest and export of assets and metadata (IObox), an asset aggregation package format (BDBag), and a shared authentication layer (WebAuthN). 3.4.1 Chaise: Model-Driven Web Applications Chaise is the user interface application suite for Deriva. Chaise is implemented as a suite of JavaScript programs that dynamically generate adaptive, model-driven user inter- face applications for discovery, analysis, visualization, data entry, annotation, sharing and collaboration over scientic assets. Chaise applications introspect and dynamically render relational data resources based on a small set of baseline assumptions, combined with its rendering heuristics, which may be in uence, informed or overridden by model annotations dened on the domain models that are cataloged in ERMrest, and nally user preferences further rene the interfaces. Chaise is intended to support specic asset management interactions. As such, its presentation capabilities are narrowly scoped. Chaise makes a few assumptions about how users will interact with the underlying data. A few representative but non-exhaustive examples of these assumptions include: a) Search, explore, and browse catalogs of assets; b) Linked data navigation between records; c) Add, edit, remove records from the catalog; d) Create, alter, or extend the domain model in the catalog; e) Subset and export collections 55 of assets and metadata; f) Share collections with others; and g) Annotate records with tags or controlled vocabulary terms. Chaise is structured as a set of programs that generate interfaces to search and browse assets using the powerful \faceted search" paradigm (Figure 3.3); deep drill-down, visual- ization, and \linked data" navigation on individual records (Figure 3.4); and record entry and edit. Internally, the Chaise applications are built on AngularJS 1 , a front-end model- view-controller (MVC) framework developed by Google, and a common library of routines used acrossChaise applications (Figure 3.2). Chaise uses theERMrestJS library, which provides JavaScript language bindings for the ERMrest protocol and a comprehensive set of APIs for working with the ERMrest relational data service. ERMrestJS AngularJS Chaise Common Chaise RecordSet Chaise Record Chaise Record-Edit Browser ERMrest Hatrac Figure 3.2: Chaise layered architecture. Chaise makes no assumptions about the structure of the underlying data model, such as its tables, columns, keys, and foreign key references. It begins by introspecting the data model by getting the schema resource from ERMrest. It then uses rendering heuristics to decide, for instance, how to atten a hierarchical structure into a simplied (or \denormal- ized") presentation for searching and viewing. Chaise then interprets the model annotations to modify or override its rendering heuristics, for instance, to hide a column of a table or to present a related entity as an embedded Web resource in an inline frame (iframe). Chaise uses model annotations to determine when and how to integrate visualization tools into 1 https://angularjs.org 56 Figure 3.3: Faceted search application. Figure 3.4: Record details application. 57 the display, including charts, graphs, 3D volume rendering, and 2D, pyramidal, tiled image rendering. 3.4.2 ERMrest: Model-Neutral Relational Metadata Store ERMrest (Entity Relationship Model via Representational State Transfer) is a relational data service for the Web, and allows general entity-relationship modeling and manipulation of data resources by RESTful access methods. ERMrest serves as the metadata cata- log of Deriva and enables the evolving and dynamic data models needed for describing, contextualizing, and linking scientic assets. The design goals forERMrest were to enable non-expert users to create and evolve data models that represent the semantic concepts in their domain without the typical round trips from user to developer to database administrator and back. Many use cases can map into simple models with just a handful of entities and relationships and non-experts can easily think about their domain in terms of the main concepts that they want to manipulate [77]. By providing methods for incrementally creating these models, and by allowing users to express domain concepts directly in the catalog, it oers a platform where the data models can be created and maintained by the user community. While there will be situations in which a more formal up-front data modeling activity will be required and for which the full power of SQL may be needed, a signicant number of important usage scenarios fall within the design space of ERMrest. Our approach was to develop a service that supports resource manipulation idioms com- mon to RESTful Web services. ERMrest maps Entity, Attribute, Schema, Table, Column, and other relational concepts to Web resources, which are referenced by Uniform Resource Identiers (i.e., URIs). It supports the following interfaces: catalog: reference, retrieve, create, alter, delete \catalog" resources, each of which is an independent relational data store; 58 schema: reference, introspect, and alter entity-relationship models (i.e., schema name spaces, table and column denitions, key and foreign key constraints, etc.); entity: reference, query, and manipulate entity records (i.e., rows of a table); attribute: reference, query, and manipulate projected attribute records (i.e., subsets of attributes of relations); attributegroup: reference and query projected attribute group records (i.e., attributes grouped by a subset of attributes of relations); aggregate: reference and query projected aggregates with supported aggregate functions such as count, min and max. https://.../catalog/1/entity/ sp eci men/ i n i ti al s=res/ sl i d e/ created O n::gt::2015/ scan URL Path Tokenizer URL Path Parser Model Validator Query Constructor Query Executor table: sp eci men column: i n i ti al s table: sl i d e column: created O n table: scan SELECT scan.* FROM scan JOIN sl i d e ON (…) JOIN sp eci men ON (...) WHERE sp eci men. i n i ti al s = ‘res’ AND sl i d e: created O n > 2015; Foreign Key Reference Foreign Key Reference Figure 3.5: ERMrest query processing example, showing the query URI (top), model denition (middle right), conceptual processing stages (middle left), and generated SQL statement (bottom). The core of the implementation is in query processing. A request consists of the Hy- pertext Transfer Protocol (HTTP) method (e.g., GET, POST, PUT, DELETE), URL path (e.g., 59 catalog/1/entity/scan), and optional HTTP message body (e.g., JSON or CSV resource representation). The query language is a formally-dened context-free grammar in Backus{Naur form (BNF) notation and is parsed using a generated Look Ahead LR (LALR) parser. Fig- ure 3.5 depicts the query processing steps taken to satisfy each request to ERMrest. First, the request handler tokenizes and parses the URL path into an Abstract Syntax Tree (AST) representation of the query. The catalog resource (e.g., catalog/1/) indicates which relational data store to address with the query. Next, the interface part of the path (e.g., entity/ indicates which API will ultimately be used to answer the query. The remain- der of the URL is what we call the data path, a pseudo-hierarchical expression that references, joins, and lters tables of entities and attributes. In the example, the rst table in the path is the specimen table, and it is ltered by the binary predicate initials=res, which refer- ences the specimen.initials column in the left operand. The next URI component, which is delimited as usual by the `/' character, implicitly species a join with the slide table resource. The model validator will ensure that an unambiguous foreign key reference exists, in either direction, between the specimen and slide tables. Next, the ltered product of joined tables will again be ltered, this time by the binary predicate createdOn::gt::2015 which lters slides that were created on or after year 2015. Finally, the last URI component again species an implicit join, in this case with the scan table. The entity interface returns whole entities (i.e., rows of a table resource) and therefore does not require an explicit projection of attributes. With the attribute interface, however, URL data paths terminate with a list of attributes to be returned. More complex data path expressions are supported including projections involving multiple tables in the expression, joins between multiple tables at dierent depths of the expression hierarchy, table alias assignments and referencing, aggregation and grouping. Finally, ERMrest incorporated row-level access control policies from its underlying PostgreSQL database engine. Combined with user- and role-based authentication from We- bAuthN (discussed later),ERMrest can enforce ne-grain access control over the contents 60 of the catalog. For example, permissions may be granted on tables based on group mem- bership; visibility or editing may be enforced on a row-by-row basis, and more sophisticated policies are also possible. 3.4.3 Hatrac: Version-Tracking Object Store Hatrac (pronounced \hat rack") is a Web service for storage and retrieval of scientic as- sets as Web resources in a RESTful Web service model. Hatrac treats scientic assets as generic, opaque, byte-sequences and may be viewed as an object store for asset management. It supports atomic operation semantics { that is, an asset is created and named, updated or deleted in an atomic operation where the operation either nishes and succeeds completely as expected or terminates the operation. Once an asset has been stored inHatrac, it guaran- tees data xity, rst by enforcing immutability of stored objects, and second by maintaining check sum message digests which can be used to ensure data integrity. An \update" of an asset in Hatrac is non-destructive, such that the service preserves the current state of the asset while adding a new version of the asset in a version-qualied naming scheme. Similarly when \deleting" an asset, the service marks the named asset as deleted but does not release the name for reuse, so that it prevents names from being reused and potentially violating the stable reference semantics required by asset management. Finally, Hatrac supports a hierarchical naming scheme and allows users to dene access control policies on assets by name and by subtree names in the name space to simplify management of access controls. At present,Hatrac supports two congurations: it may be deployed as a standalone server with assets stored on a local or remote le system; or it may be deployed on Amazon AWS with assets stored in the Amazon S3 object store. 3.4.4 IObox: Import and Export Agent IObox is a suite of modular utilities for ingest and export of assets to and from Deriva and external data sources. Asset management for science must support diverse sources and 61 formats. The utilities of IObox may be combined in a variety of congurations to support the unique requirements of dierent scientic applications. The utilities in this component are generally categorized in terms of extract, transform, or load (ETL) operations and uses the BDBag package format. In general, extract operations acquire assets from data sources (les or databases) and generate a BDBag package; transform operations alter the format or structure of the contents of a BDBag package; and load utilities take a BDBag package and upload the contents to catalogs and storage services. At present, the utilities in the IObox suite include: bag2dams: takes a bag and imports it into Deriva data stores; dams2bag: extracts metadata and assets from Deriva data stores, serializes the con- tents, and generates a bag; sql2bag: connects to a database management server (e.g., Microsoft Access, Microsoft SQLServer, MySQL, etc.), executes user-specied queries, serializes the results and generates a bag; xls2bag: parses a Microsoft Excel spreadsheet and generates a bag; xml2bag: parses eXtensible Markup Language (XML) documents, serializes in tabular format, and generates a bag; iobox-win32 : uses Win32 APIs to \watch" a Microsoft Windows le system, as les are generated asynchronously by connected instruments (e.g., microscopes, sequencers, etc.), it executes regular expression rules to identify les and extract metadata, and ingest to Deriva data stores. In addition, IObox supports integrated parsers for many important data formats for scientic assets, including but not limited to HDF5, NetCDF, CSV, Excel, TIFF, OME- TIFF, NIfTI, DICOM, and VCF. 62 3.4.5 BDBag: Big Data Exchange Format BDBag is a specication for asset aggregation packages. BDBag extends The BagIt File Packaging Format (V0.97) 2 , incorporates the BagIt Proles Specication 3 , and adopts the Research Objects [10] semantic model for describing packages and provenance. The BDBag utilities are a collection of software programs for working with the enhanced specications for bags. These utilities combine various other components such as the BagIt creation utility and the BagIt prole validator utility into a single, easy to use software package. 3.4.6 WebAuthN: Authentication Provider Framework WebAuthN is a compact, modular authentication provider framework (illustrated in Fig- ure 3.6) written to support Python-based, RESTful Web services and is used by ERMrest and Hatrac. It allows deployment-time conguration of several alternative identity and attribute provider modules to establish client security contexts for Web requests by talking to a local or remote provider. The client provider is used when establishing the client (or user) identity while theattribute provider is used when establishing additional attributes, such as roles or groups associated with the client identity. At present, WebAuthN is distributed with three Identity Provider (IdP) connectors: OAuth2 OpenID Connect Provider: any standard OpenID Connect IdP, such as Google, for client authentication; Globus OAuth2 Provider: Globus IdP which also supports delegated group manage- ment and access to numerous campus IdPs; Standalone Database Provider: standalone database IdP that can be deployed with the Deriva suite. 2 https://tools.ietf.org/html/draft-kunze-bagit-13 3 https://github.com/ruebot/bagit-proles 63 WebAuthN Database Client Provider Database Attribute Provider OpenID Connect Client Provider Globus Oauth2 Client Provider Globus Oauth2 Attribute Provider ERMrest Hatrac RDBMS Google & other Globus Figure 3.6: WebAuthN layered architecture. The interfaces are well-dened and alternative implementations may be developed thus in- tegrating dierent IdPs (e.g., LDAP, Active Directory, etc.) into Deriva easily. 3.5 Case Studies To validate the utility of a DAMS based approach to scientic data management as well as the applicability of the Deriva platform, Deriva has been applied to a range of dierent science use cases. In each situation, the resulting solution was provided to domain scientists who are using it as part of their ongoing scientic explorations. From these studies we can conclude that the underlying Deriva platform can be readily adapted to diverse domains, and that the domain scientists report that these systems are simplifying the process of getting their science accomplished. Deriva has been used in scientic collaborations spanning from small teams engaged in data collection and analysis for basic science, to large, multi-national consortium producing data repositories. 64 Table 3.3 provides some basic statistics across a number of of these deployments. Shown are the number of dierent types of key entities currently being used to represent the domain model, the number of records corresponding to those entities, the number of dierent asset types, and the total number and size of the assets currently being managed by the system. First, we describe qualitatively a few of the representative use cases for Deriva. Table 3.3: Summary of representative Deriva deployments. Key Entities Assets Name Types Count Types Count TiB FaceBase 6 6,598 13 2,773 2.6 RBK 10 501 4 763 0.58 GUDMAP 4 23,186 4 35,334 0.6 GPCR 7 180,481 17 232,078 0.06 Synapse 4 1,675 9 1,906 1.7 CIRM 4 28,740 1 5,429 17 Generation of atlases of kidney development (GUDMAP/RBK) The goal of this collaboration is to collectively annotate high-resolution microscope images so as to trace the development of anatomical structure in the developing human kidney. Deriva automatically ingests images taken from a microscope and puts them into the asset catalog during which high-quality images are identied based on visual inspection. The domain model for images was extended to include annotation locations, an anatomical term obtained from a controlled vocabulary and a running set of comments on the annotation. Facets are used to manage annotation status. Policy mechanisms are used to separate the ability to annotate, comment, and make the nal decision on the annotation value. Platform for Phenome-Wide Association Study (PheWAS) from Neuroimaging For a given brain scan, it is possible to computationally produce a large number of pheno- types such as the size, density and curvature of each region of the brain. By aggregating these phenotypes across multiple subjects it is possible to make connections to specic ge- 65 netic conditions. In practice, domain scientists become rapidly overwhelmed by the tens of thousands of les that contain the phenotype values, by keeping track of image based quality control, and the association of statistical analysis to specic data sets. They are using Deriva to automatically ingest all of the analysis results and associate them with images, to track necessary manual review of some of the results, and to assemble data sets of phenotype to input to statistical analysis tools to look for signicant correlations. Determination of three dimensional structure of G-Protein Coupled Receptors (GPCR) Protein structure determination by X-Ray diraction requires many steps to synthesize and analysis steps to crystallize and measure the protein. At each stage mea- surements are made and analyzed, and at the nal step, large amounts of diraction data must be collected and processed to reveal the protein structure. They are using Deriva to manage the data that is being generated by a multi-site consortium. The platform is automatically acquiring and integrating data spanning protein design, ow cytometry, chro- motography and gel electrophorsis. Policy mechanisms distinguish between academic and industrial aliates. Next, we describe in greater detail howDeriva has been used in practice in two specic use cases: FaceBase Data Coordinating Center (FaceBase) and Center for Regenerative Medicine and Stem Cell Research (CIRM). The FaceBase consortium represents a large-scale deployment involving a multi-site collaboration and community curation and data sharing endeavor. CIRM represents a laboratory core facility and early-phase exploratory research environment. RBK and GUDMAP are projects related to molecular anatomy and are similar to FaceBase, while Synapse and GPCR are basic science projects similar in nature to CIRM. 3.5.1 FaceBase Research Consortium FaceBase (www.facebase.org) is a craniofacial research consortium that produces and shares comprehensive craniofacial data and resources for the international research community. 66 The consortium is organized around a \hub and spoke" model where 20 spokes create and contribute data while the Hub is responsible for integrating this data into a curated collection to be used by the broader craniofacial research community. Data in FaceBase are diverse as the spokes engage in a wide range of investigations including imaging, confocal microscopy, laser capture microdissection, DNA microarray, high-throughput sequencing, and enhancer reporter studies involving mouse, zebrash, and human subjects. Spokes are expected to release data immediately to the public upon their production without embargoing data for their own research. As such, timely release of data is among the critical goals of FaceBase. Challenges Prior to adopting Deriva, FaceBase took a manual, centralized approach to curation using Drupal 4 as a Content Management System (CMS) for storing and displaying metadata cou- pled with Apache Lucene 5 for keyword search of metadata. Data were submitted to the hub zipped in bundles that typically contained 5-20 individual les each, which were stored on and served from a standard Web server. Hub curators entered free form text descriptions in the CMS with a set of \tags" and a link to the zip le. Tags covered information about the species, age stage, anatomy, phenotype, and other descriptive properties about the data set. Users of the FaceBase site could perform keyword searches over the content and some limited ltering on tags. The initial approach to curation and publication of data suered signicant limitations. The key problems and user complaints included: limited resources for centralized curation at the Hub resulted in long delays before data set release and the curation process was susceptible to transcription errors leading to additional delays; lack of a detailed data model for describing data sets led to inconsistencies in the 4 https://www.drupal.org/ 5 https://lucene.apache.org/ 67 quality and level of detail for data set descriptions between dierent data producers, for example, the relationship between les and the bioassays that produced them or the biosamples to which they related could only be inferred from \meaningful" lenames with gene names, age stages, and other critical details embedded in them; diculty nding and narrowing search results due to: keyword search could return many \hits" but not allow users to narrow through precise ltering; hand coded lists of dataset properties that looked like but were not structured metadata tags in the CMS; and inconsistent, non-standard, or misspelled terms. FaceBase was restructured to use a scientic asset management based approach to achieve the following goals: streamline and accelerate the curation pipeline so that data sets should be made avail- able almost as fast as they are produced; simplify the sometimes cumbersome interactions between submitters and curators that often plagues large repositories; reduce the eort and resource load on the Hub by distributing the curation responsi- bilities among the participating spoke projects rather than assuming all curation tasks at the Hub; and increase the ability of users to nd data of interest via intuitive data models coupled with faceted search and linked data navigation. Implementation The application of Deriva to FaceBase was complicated by the need to transition the data and processes from the existing Drupal based approach. We took a staged approach: An initial data model that mirrored the original data representation was created and a one o ETL (extract, transform, and load) process was developed to move all of the 68 existing data from the CMS to anERMrest catalog andHatrac object store. Data was left in its initial format as a set of zip les. An ensuing clean up of terms was performed as aided by Chaise to display alphabet- ized term lists directly from the catalog which revealed inconsistencies (e.g., \Msx1" vs \MSX1"). A new more detailed data model was developed that was more representative of the structure of the actual and new data being provided. As the transition was taking place, spokes used spreadsheet templates to describe metadata while the Hub took responsibility for updating the catalog using IObox. A curation protocol has been established to streamline the process, and going forward the spokes will use IObox andChaise to directly upload and curate newly contributed data following the protocol. Schema evolution. The Hub evolved the initial database schema to a structure that could represent the detailed information about the experiments, assays, samples, subjects and assets submitted by the spokes. The new model was informed by established conventions found in Chado [105] and ISA (Investigation Study Assay) [118] while accommodating the constraints imposed by the legacy database. The conceptual data model for FaceBase is depicted in Figure 3.7. Curation pipeline. To release data to the broader community as rapidly as possible, FaceBase has implemented a three stage pipeline, where datasets are either pending, released, or curated. Fine grain access control in ERMrest is used to control visibility, allowing spokes to upload data and edit metadata until data quality standards are met at which point the data or metadata are made visible and publicly available to users. Curation stages are explicitly represented in the data model. Dataset accessioning. The curation pipeline (see Figure 3.8) begins with investigators notifying the Hub that they wish to submit data. The spoke's request must include basic 69 Asset Bioassay Project Dataset Sample Imaging Enhancer Bioinformatic Clinical Track (.bw,…) File (.bam, .fastq, .nii, .ome.tiff, .xls, .csv,…) Thumbnail Mesh (.obj) Investigation Biosample Figure 3.7: FaceBase ERM. Metadata are organized broadly as investigation, biosample, bioassay, and asset entities with relationships indicated by arrows. Spoke Initiates Submission Hub creates “pending” dataset Spoke organizes and uploads Hub reviews dataset files OK ? Hub releases dataset Spoke enters metadata descriptions Hub reviews metadata OK ? Hub releases metadata No No Yes Yes Figure 3.8: FaceBase Data Curation Pipeline. Shaded boxes indicate Spoke responsibilities versus clear boxes for the Hub's activities. metadata which cover essential details, similar to what may be found in Dublin Core [20] or other data publication standards. The Hub then mints an accession number and creates a Dataset entity in the database which begins in a \pending" state. This pending dataset record serves the dual purpose of anchoring the curation process and all communication concerning it, and establishing a stable \landing page" for the data set. Data acquisition. In the pending stage, data sets are submitted by the spokes using an IObox batch uploader or by transferring les to a hosted IObox operated by the Hub. IObox automatically extracts key information, such as le names and checksums, and registers the data les in the relevant elds in the data model. Thumbnails and 3D mesh les are stored in separate Hatrac namespaces with relaxed access control policies compared to general assets which FaceBase restricts to logged in end-users. In addition, genome tracks are stored 70 under a namespace that serves as a virtual trackhub for the UCSC Genome Browser [86] and an embedded jBrowse Genome Browser [137]. Once ingested, the Hub reviews the data. If the consortium's data quality standards are met, the Hub updates the data set's status to \released" and the assets are made visible to the end-users. Metadata curation. Metadata may be entered usingChaise or submitted in spreadsheets to the Hub for processing via the hosted IObox at any time in the process. In the interest of streamlining the release of data to the public, the process focuses on curating metadata after the data are released. The Hub currently uses Chaise for curation of the catalog and has begun training the spokes in its use so that they may directly enter metadata, with access control policies in ERMrest to control editing and visibility permissions. Once metadata are entered in the database, Hub sta review the entries to ensure that data quality standards have been satised and then upgrade the data set to \curated" status. At this point, the complete data set, both data and metadata, are fully available to the public. Results The implementation of Deriva has signicantly impacted every aspect of how FaceBase curates and publishes data sets. Structured, detailed data models now allow every individual asset to be represented in the database along with critical information about biosamples and bioassays; Alignment with community ontologies (e.g., OBI, MP, HPO, ZFA, Theiler stages, etc.) and integration with the Monarch Initiative's Phenogrid [104] have transformed Face- Base from a data silo to an interoperable data resource where data sets can be inte- grated with data produced outside of the consortium; Powerful, dynamically generated search and browse interfaces now allow users to nd and narrow searches using any attribute of any key entity in the database and then to navigate in a \linked data" style through the entire web of information in the database. 71 Quantiable metrics indicate signicant progress toward FaceBase goals: FaceBase spokes have been able to publish data with detailed descriptions on 1,155 biosamples and 4,736 bioassays, up from essentially 0 using the prior approach; FaceBase now measures its data curation cycle time in days rather than months mean- ing that users get access to new data sets rapidly; Users now visit the new FaceBase data browser more than any other resource on the site accounting for 32.5% of unique page views with an extremely low bounce rate of 22.2%; and most critically, in the past two years alone, FaceBase usage has grown from 4,699 unique users to 7,905 unique users with a two year total of 71,076 page views and 19,371 user sessions. 3.5.2 CIRM Microscopy Core The Center for Regenerative Medicine and Stem Cell Research (CIRM) operates a microscopy core facility with multiple labs that share imaging and slide scanning instruments and net- worked storage servers. Challenges With the acquisition of new high-throughput slide scanners, CIRM was faced with a challenge of creating an institute-wide view of data from slides across labs, across dierent microscopes, and across collaborations that share data beyond the boundaries of the local institute. This environment was typical of other user experiences in that 1) there were no specialized data management tools available to the researchers to support their specic domain models and data formats, 2) the researchers were being inundated with data, 3) they had resorted to storing data on shared le systems with hierarchical naming schemes, 4) they were unable 72 to record metadata linked to the data les, 5) their physical processes and assets (i.e., boxes, slides, lab notebooks) were disconnected from the data, and 6) researchers were highly sensitive to any overhead placed on them to document metadata. These factors led to operational overheads due to wasted eorts on numerous basic data management tasks, including: inability to easily nd and retrieve all of the scans for a particular physical slide, for a specimen, or for an experiment; inability to securely share individual researcher's data les since they belonged to a single shared le system; inability to easily locate valuable research data from postdocs, students, or sta after they leave their laboratory for other appointments; inability to search across the complete collection of scans; inability to quickly preview images over the network to nd relevant scans of interest; and other similar obstacles. Next, we outline the basic approach to applying Deriva in the context of the CIRM user study and then present preliminary results. Incrementally rene the domain model Our approach is informed by the incremental renement methods proposed earlier [56]. Re- searchers begin by mapping out the concepts that describe their domain and what questions they need to answer. In CIRM, this led to a model that captures the relationships between a slide, scans of that slide from dierent microscopes and specic experiments that use those slides. The schema editing elements of Deriva allow the model to be extended with exper- iment specic attributes as the experiments are rened by the researcher. This resulted in a simple initial model comprised of boxes, experiments, slides, and scans, each with distinct attributes. Boxes and Experiments may have 1 or more slides, and Slides may have 0 or many Scans. Slides are always associated with a Box but may not be have been used in an Experiment. This model was in fact based on the metadata scientists often recorded by handwriting on slide labels. Beginning with a simple model allowed CIRM to get started quickly, without a long upfront design phase, and allowed researchers to start shaping the system through feedback based on concrete examples (i.e., rather than design mock ups or 73 data model diagrams). Because the catalog (i.e.,ERMrest) and UI (i.e.,Chaise) adapt to the schema, CIRM has been able to revise their domain model several times without changes to Deriva software. Integrate and align with scientic work ows Prior to the use ofDeriva, the histology lab returned boxes of sectioned slides to researchers, who used the mounted samples in experiments, rejecting some and admitting others. Those that were selected were scanned, and the images were stored in a shared network le system. Boxes and slides were labeled with handwritten notes. We integratedDeriva into the CIRM research work ow (Figure 3.9), rst by having them print box labels (instead of handwritten labels) which captured minimal information about the specimen and research activity, and next by having them print slide labels for admitted slides which added basic information about the experiment. Instead of adding overheads to the researcher, we simply shifted from handwriting labels to printing them, which implicitly created linked data entities in the database with barcoded labels for asset and digital asset tracking. When the microscopy core scans the slides, the scanner reads the barcode and extracts the embedded slide identier in Deriva. A user interface modeled after familiar photo management solutions (Figure 3.10) allows for search, retrieval, and annotation of boxes, experiments, scans, and slides. Streamline manual data tasks Manual data entry has been noted as one of the key barriers to successful adoption of data management services [109]. To address this factor,Deriva relies on its IObox (Section 3.3) to automate several tasks that would otherwise involve user interactions. IObox was congured and deployed on the acquisition system associated with the microscope and on the network storage system. On the acquisition system, IObox monitors a folder and when new les are saved it 1) extracts the slide identiers, 2) invokes the le transfer agent to move data les to the storage system, and 3) registers the new scans in the Deriva catalog which 74 Figure 3.9: The CIRM microscopy and data management work ow, showing before (A) and after (B) integration with Deriva; with manual tasks (trapezoids), automated tasks (rectangles) and major out-of-band activities (hexagons). The dash-outlined boxes callout sequences of steps for both manuals tasks and automated tasks respective to each work ow. automatically links the scans into data model entities associated with their respective slides. On the storage server, another IObox monitors the server for new data, and when it arrives it 1) extracts additional le metadata (e.g., resolution, color channels, etc.), 2) generates thumbnail images, 3) generates tiled images for web viewers, and 4) updates the Deriva catalog. The IObox agents eliminate several operations that would otherwise result in manual eort for research sta. Results To evaluate the overall utility of the asset management approach embodied in Deriva, we have tracked the adoption of the system within CIRM over the initial months of its imple- mentation there. The microscopy core is primarily operated by its laboratory manager and 75 Figure 3.10: Screenshot of an early Deriva user interface; with search (top), browse collec- tions (left), results (middle), and details (right). microscopy specialist. Scientists throughout the institute send specimens to the histology laboratory to be sectioned and mounted on slides, conduct experiments on the mounted specimens, and submit selected slides to the microscopy core for scanning. The microscopy core scans the slides and returns scanned images to the researchers. Table 3.3 shows the state of Deriva usage in CIRM. In addition, the data model has undergone revisions through- out its deployment as discussed earlier, thus highlighting the incremental schema denition approach. 3.6 Lessons Learned We summarize a set of general issues and remedies, experienced across several Deriva de- ployments, for the application of scientic asset management to aid in data-centric scientic inquiry. 76 3.6.1 Spreadsheets Are Poor Data Entry Tools At the early stage of many of our deployments, new metadata were imported into Deriva via spreadsheets based on predetermined templates that referenced controlled vocabulary in some cases. Spreadsheets are commonly used by domain scientists, and we expected that this would be a streamlined path towards data ingest. In practice, however, we found the use of spreadsheets to be idiosyncratic and subject to frequent human error. And, when we updated the data model or simply added new terms to the vocabulary, we had to redistribute the template to all data submitters which was prone to communication failures. Moreover, entity linkages rely on users supplying the right (foreign key) references and in general we found many users are not comfortable with creating proper references in multi-tab spreadsheets such as Microsoft Excel and will just type in the values directly or copy-and-paste whole rows. Based on this experience, we placed an increased focus on on-line tools for metadata entry that introspect the catalog schema, dynamically adapt to the evolving data models, and apply data integrity checking upon submission. 3.6.2 Detect Errors While You Have the User's Attention For asset submission, we initially created an upload agent that monitored shared directories, automatically harvesting metadata and ingesting that metadata along with the asset. We had good success with this approach in earlier deployments in which we were able to include sucient metadata as part of the asset to ensure accurate placement into the data model, e.g. our CIRM use case used bar codes on microscope slides to indicate what experiment the resulting data belonged to. However, in use cases, such as GPCR we have less control over the content of an asset and are forced to rely on le naming protocols to cross-link the asset with existing metadata. Furthermore, in many bioscience use cases, data generating instruments are shared and users tend not to log in to their own account prior to acquiring data. This resulted in additional complexity to associate data with specic users when adding assets. In these cases, our agent based approach broke down, as the agent service 77 running in the background reports errors asynchronously, after the user has turned their attention to something else. A minor violation of required naming conventions could delay data collection for an unnecessarily long period of time. Therefore, immediate feedback is required. These observations led us to structure our on-line tools to perform error checking synchronously when the user indicates the desire to ingest an asset, while executing transfers of assets asynchronously once the platform has determined that the contextualization of all the assets are correct. 3.6.3 Give Users Incentives to Change Their Ways Most of our experiences to date have involved deployments of Deriva into an environment in which there was already an existing data management practice. While it is hard to nd a user who disagrees with the concept of producing sharable and reusable data, and while there was broad agreement that current data management practices in our use cases were inadequate, it is also not surprising to observe that users were reluctant to change their daily practice. Fortunately in many of our use cases, we were able to get users to alter their current prac- tice by providing time savings in somewhat unexpected areas, hence incentivizing use. For example, in GPCR, we were able to use experiment design metadata and our data extraction routines to automatically generate conguration les for an instrument, eliminating a double manual data entry task that the researcher would have otherwise had to perform. Providing data and images as part of a paper preparation process has also been a strong motivator. In an image based use case, the ability to easily nd and download images for publication was motivation enough to get users to adopt Deriva. By identifying and exploiting these small wins for a user, we can increase the uptake of our data management tools, which will in turn result in increased eciencies that may be less directly observable to the domain scientist. 78 3.6.4 Users Want a Bird's Eye View of Their Data The domain models enabled by Deriva make it easier for users to nd and access specic sets of data. However, we quickly learned that aggregate information about their assets were highly desired by users. Invariably, soon after users could see and explore their assets withDeriva, they would request various dashboards, summaries and roll ups. For example, program managers, lab managers, and principal investigators (PIs) often want to monitor the data submission progress. In practice we see requirements for many dierent types of dashboards including counts of assets and entities, assets in a current state, roll ups over periods of time, etc. Fortunately, it is easy to create these dashboards in Deriva as database views over the underlying data model, which can be queried and displayed like any other normal model element. Support for creating views with web APIs in ERMrest is currently limited, and given the importance of dashboards and summaries, this is an area where the platform will be expanded. 3.6.5 Control Vocabulary But Not Too Much We have experimented with dierent approaches to managing vocabulary. Initially, in CIRM our users wanted uncontrolled term lists to label specimens and experiments. This was soon plagued by numerous divergent terms and spelling errors. Fortunately, Deriva is extremely adept at helping users correct such issues. They eventually adopted a controlled terminology list that undergoes periodic review. Because they are a smaller more homogeneous group, their term list changes slowly and this approach works for them. On the other hand, in GPCR and FaceBase we began with strictly controlled vocabulary. However, this presented a roadblock for users. In GPCR, users stopped entering data when they could not nd a desired term for an annotation. In FaceBase, users entered their terms in non-standard locations of the metadata spreadsheet or added them to comments. This caused temporary data loss and lower quality of data annotations. We remedied these issues by taking a pragmatic approach to controlling the vocabulary. The data model still enforces 79 the use of vocabulary terms but allows users to add new terms when appropriate. Deriva aims to make it easier to reuse terms than to add new ones, yet at the same time is exible to allow new terms. For example, investigators with appropriate access control can dene new experiment types on the y by adding elements to term tables which can then be used by other investigators. Other investigators see the complete list of possible experiment types before being given the option to add a new type. 3.6.6 Human Factors are Critical to Success Human factors were a major driver in designing Deriva. We view this broadly from the perspective of how the technical components of Deriva interface into the daily work ow of the users, and the social structure in which research takes place. For example, an initial assumption was that users would be able to explore the data model rapidly by observing what links were available, and internalizing the model as they used the system. Unfortunately, this is not typically the case. While a computer scientist may think in terms of links and graph models for representing data, researchers in other scientic domains do not necessarily think this way. Fortunately, we have found that this barrier can be easily overcome by making users aware that there is an underlying model and providing a simple roadmap to the major model elements. One important human factors metric is the number of manual steps required to complete common operations. For example, in biomedical investigations, experiments often consist of multiple samples that are prepared all at once (many instruments can handle 96 or more simultaneous data acquisitions). To help these users, we extendedChaise to support multi- record input mode, where the user can add multiple data-entry records to the same graphical form. Users also requested shortcuts to produce multiple records with only a few varying elds. As a result, we also added a mechanism to copy initial content from an existing or draft record into the newly created data-entry elds. 80 3.7 Related Work Digital asset management systems have been used widely by creative and business organi- zations, however, the closest comparisons supporting science may be imaging [94] and mi- croscopy management systems [143]. There is a lack of general-purpose asset management capabilities that can span a wide range of multi-domain, multi-modal scientic data and which support the dynamic, rapidly evolving, heterogeneous research activities. Electronic Notebooks such as IPython and Jupyter, are another approach to organizing and visualizing scientic data. However, these approaches do not provide facilities to capture data from instruments. and they are not intended to manage large volumes of data throughout their many transformations. Current systems and tools to support data-driven discovery, such as computational pipeline systems (e.g. [61]) tend to focus on computation and data analysis. While anal- ysis is clearly important, we assert that data is the currency around which discovery is made and we should take the perspective that discovery is largely data-centric, not process-centric. Digital repository systems, such as DSpace [138] and Globus Publish [24], may be used to develop institutional repositories which support preservation of digital works and enable open access to data. Digital repositories are primarily concerned with publication, as opposed to the discovery process itself where one's understanding of the domain model may evolve considerably. Metadata catalogs have been a topic of research within scientic data management [43]. Within scientic asset management, additional demands on the catalog require that it be capable of handling complex domain models and enable frequent model evolution. Dataspaces [56], and SQLShare [77] also advocate the need for incrementally expanding and evolving data workspaces for individual or collaborative usage. [65] also argue that introspection is a necessity for dynamic, uid databases, a concept that we further develop in our approach. While these approaches agree with our observations of rapidly evolving, heterogeneous scientic data, they only address query and analysis not capture of assets, 81 organization and annotation. 3.8 Conclusions The challenges of data management in science applications require a new data-centric ap- proach and we have identied digital asset management as being suitable. We have identied a common set of principles for applying asset management to scientic data and have shown through diverse use cases that a platform built on these principles can benet a broad range of science applications. In this chapter, we have considered three questions: 1) can scientic asset management systems provide value to science investigations, 2) will users adjust their daily practices to use these systems, and 3) are there reusable platforms that can be readily adapted to diverse science applications. We answered these questions through a detailed description of two representative use cases: FaceBase and CIRM. We were also able to show that by focusing on human factors issues including user interface and daily work ows, we can provide tangible value to domain scientists (i.e. non-computer scientists) which results in them adapting their data creation and analysis activities to use these tools, as indicated by the quantiable uptake in data curation and usage. These use cases represent just two of the many real-world deployments ofDeriva. Similar results were seen in our other deployments, further demonstrating that the asset management approach can improve the eciency of the research process, and the quality of data produced by and used in experiments. In this chapter, we focused on biomedical applications, as they are complex and the users tend to be less computationally sophisticated. However, Deriva can be readily applied more broadly with, we believe, the same positive results. 82 Chapter 4 User-oriented Framework for Database Evolution The content of this chapter is based on the paper: Robert Schuler and Carl Kesselman. \CHiSEL: A User-Oriented Framework for Simpling Database Evolution". In: Distrib. Parallel Databases 39.2 (June 2021), pp. 483{543. issn: 0926-8782. doi: 10.1007/s10619- 020-07314-x. url: https://doi.org/10.1007/s10619-020-07314-x. In this chapter, we present a high-level, user-oriented, schema evolution framework with an algebra of specialized schema modication operators. The design of the framework is informed by the analysis and experiences with scientic asset management presented in Chapters 2 and 3. We also propose a rigorous evaluation methodology for comparing the user eort of database evolution languages, and we introduce a benchmark for evaluating the execution eciency of schema evolution expressions. We present the framework and its implementation, and we demonstrate its utility in exemplar use cases and performance evaluation. 83 4.1 Introduction As scientic discovery becomes increasingly data-driven, the diculty of managing large and complex volumes of data dramatically slows scientic progress due to excessively high overheads on researchers' time and eort [85], while poorly managed datasets threaten the validity of scientic results [12] and reproducibility [69]. For scientists, databases are well suited to the problem of modeling concepts of a scientic domain and managing information about the discovery process and the data used by and generated from a scientic investiga- tion. Despite this utility, databases have remained largely under-utilized in science due to the diculty of using them eectively [77]. Ideally, scientists could rely on specialized databases to manage research data throughout the discovery process which follows data from initial acquisition and contextualization at the instrument or sensor; ingestion and metadata extraction; annotation and metadata entry; organization and subsetting of datasets; analysis via computational work ows that produce derived data; tracking of provenance; collaboration and sharing; and nally packaging for publication { a comprehensive approach to scientic data management that we described as scientic asset management [125] in Chapter 3. To do so, however, scientists would need to model their domain and keep that model up-to-date as the research activities unfold. While the \20 questions" method has been advocated for eliciting requirements to drive the design of databases for science [145], particularly for large investigations like the Na- tional Virtual Observatory [144], it is overly restrictive for dynamic ongoing investigations. Researchers may begin with a conceptual model that re ects their initial understanding of the domain, yet that model may evolve daily [78] as the course of a scientic investigation progresses. Rapid development with a process of iterative design and frequent refactoring has been termed \agile" in the software development eld and has become an objective of the database community as well [39]. Studies of database use in general [84] and use of databases for science in particular [78], however, identify numerous problems observed with schema denitions { such as, lack of foreign keys, redundant columns, overloaded columns, 84 and under-specied column types, among other issues. The diculty of evolving the model to keep pace with the changing requirements and conceptualizations is a major limiting factor in the usefulness of databases for science. The task of evolving the database requires many complex operations that must be care- fully coordinated and sequenced in order to evolve the database from its current state to the desired state. Though others have made databases more accessible for use by scientists through reducing the diculty of database administration [77] and simplied web-interfaces for databases [41], little has been done to support scientists with the dicult task of evolv- ing the database to align with changing research methodologies and trends. The process of evolving a database by scientists is complicated by a number of factors: 1. even seemingly simple modications can require numerous database operations (e.g. SQL commands) to achieve, thus increasing manual eort and human error; 2. data must be migrated simultaneously along with the schema modication else one suers data loss; and 3. often complex database evolution involves operations that fall outside the scope of native database operations (e.g., SQL commands), thus requiring context switches between database and general-purpose programming languages to achieve, leading to more eort and chance for errors and reduced eciency. What scientists need are: 1. fewer and less complex sequences of operations to evolve a database to the desired state; 2. streamlined operations that both transform data and modify schema; 3. access to general-purpose languages for specialized and custom data transformations in the context of database evolution; and 85 4. ecient expression evaluation that is common to database management systems. To address these challenges, we present the Compositional High-level Schema Evolu- tion Language (CHiSEL), an extensible framework for dening a database evolution lan- guage (DEL) consisting of schema modication operators (SMOs) { operators that transform schema and data. We begin by identifying a succinct set of primitive SMOs that are based on relational and extended relational operations (e.g., select, project, join, aggregate, etc.). Next, we dene an open set of composite SMOs that are dened via arbitrarily complex alge- braic expressions over the primitive SMOs. Because the operations are dened relationally, the framework is able to leverage established techniques from query evaluation, in order to rewrite expressions into more ecient yet semantically equivalent ones. Finally, an appli- cation programming interface provides the necessary language bindings and conveniences to make the language accessible to scientists. Following this approach, we can reduce the oper- ations and simplify the overall database evolution procedures, while preserving the ecient execution of the database evolution operations. We show that this approach can be implemented practically and yield a language of SMOs that can signicantly simplify the task of schema evolution for our target users of databases for scientic data management. In this chapter, we make the following contributions: formal denition of a database evolution language comprised of an extensible set of composite SMOs layered over a concise set of primitive SMOs; description of a method for ecient planning and execution of schema evolution ex- pressions; presentation of a programming model and a system implementation for the framework; formal rubric and rigorous evaluation methodology for comparing expressions between schema evolution languages; 86 evaluation of our approach based on three distinct use cases that illustrate the appli- cations and advantages of the framework; we present work toward a new benchmark for schema evolution with tools for generating synthetic datasets and a driver to execute test cases dened by the benchmark; and performance evaluations based on both real-world and benchmark workloads that demonstrate the eciency of the framework. The next section gives an overview of the motivation and scope of the proposed frame- work. Section 4.3 introduces the algebraic formulation of the primitive and composite SMOs, followed by a presentation of the programming model in Section 4.4. The method for plan- ning and execution is described in Section 4.5, and an overview of the implementation is discussed in Section 4.6. We present a methodology for evaluating the eectiveness of the approach in Section 4.7, followed by a description and evaluation of case studies in Sec- tion 4.8. We then dene a benchmark for schema evolution in scientic data management systems and test results in Section 4.9. Related work is reviewed in Section 4.10, and we conclude in Section 4.11. 4.2 Motivation To motivate our discussion of database evolution, we consider a representative use case of investigators studying developmental dysmorphology. They generate imaging and genetics data on specic organ systems from genetic mutants and controls in order to associate pheno- typic expression with certain mutations as they manifest through subsequent developmental stages. During the initial phase of the investigation, they devise a rudimentary but sucient single-table database for microscopy and various assays that include mutation, phenotype, and specimen details (see Table 4.1). They favor free text entry for the attributes (e.g., development stage, mutation, experiment type, gene names, etc.), since they are not sure 87 a priori which controlled vocabulary terms will be needed. They intend to re-evaluate and improve the database over time. Table 4.1: Database table of experiments. ID Experiment ... Sample Stage Gene 0001 SISH ... S29 E15.5 Wnt4 0002 IF ... S07 adult CD4,Cxc13 0003 FISH ... S29 E15.5 WNT4 0004 Rnaseq ... S07 adult 0005 H&E ... S32 2P Wild type 0006 SISH ... S27 14.5 Bleo 0007 RNA-Seq ... S07 adult 0008 HE ... S32 2P Wild type As the investigation progresses, the research team's requirements change and their do- main conceptualization matures. They quickly outgrow the initial simplistic view of their study and begin to need a more accurate representation of it. The tabular denormalized representation of experiments and samples necessitates redundant data entry of specimen details since multiple experiments are performed with the same or similar subjects. Due to their use of locally coded terms and free form data entry, the database contains errors from data entry or dierences in terminology (e.g., H&E vs. HE). Entry of free form lists of terms compound these problems. Ultimately, the team begins to suer from de facto data loss as the data are harder to nd and reuse. To improve their database design, they would like to separate experiment and specimen details, introduce controlled terminology, align their existing data, allow association of more than one gene and x what has been entered, among other things. Usually evolving the database to accommodate these changes would require a complicated schema redenition and data migration activity: 1. create new table denitions; 2. introduce relationships between tables; 88 3. query data from the database and apply transformations possible in SQL; 4. extract and perform additional data manipulation not possible in SQL; 5. reload data into the new table design; and 6. clean up unwanted parts of the schema. The complicated nature of such a database evolution to update its schema and migrate its data would require considerable eort and be prone to errors. What the team needs are tools to simplify the database evolution task. Scope of Database Evolution. A database is structured according to its schema, which typically encompasses denitions of tables, relationships between tables, constraints on data that must hold, policies for accessing or mutating data, and may include model mappings. Transforming a database schema from one state to the next has been referred to as database evolution. Dierent types of schema transformations have been carefully delineated into the following categories: denition, modication, evolution and versioning [115]. Schema denition addresses the initial creation of the schema (e.g., creating tables), while schema modication addresses the subsequent changes to the schema (e.g., adding columns, drop- ping tables) without regard to transformations on the data. These rst two categories are generally addressed by the conventional Data Denition Language (DDL). Schema evolution, on the other hand, is dened as the transformation of a database schema in a manner that preserves data to the extent possible. Finally, schema versioning goes beyond schema evo- lution to encompass partial (read-only) or full (read and update) access to data via present or past versions of the database schema. The primary benets of schema versioning are to ease the migration of legacy applications [36, 37, 38] in the context of enterprise information systems. Scientists, such as the ones from our example that need to evolve a database of experiments (Table 4.1), are not generally burdened with migration of legacy applications, but instead need simplied means of performing the transformations themselves to increase the quality of current and future experimental data. The framework presented here is scoped 89 to the issue of schema evolution as dened earlier [115], while leaving the issue of schema versioning to be addressed by the underlying database service. Need for Comprehensive Transformations. Ideally, approaches to database evolution should attempt to transform as much of the schema (i.e., constraints, policies, mappings, etc.) as possible. Currently, most approaches focus only on the basic denition of the relation, i.e., table name, column names, and data types. At present, only PRISM++ [40] addresses constraint evolution, though it is a goal of future work in BiDEL [72]. Ultimately, to better assist the database user, approaches would not only encompass evolution of basic description but also constraints, access control policies, model mappings, and other extended metadata. From the earlier example database of experiments (Table 4.1), when projecting a subset of attributes (e.g., the columns sample, stage, etc.) from the exiting experiments relation, any foreign key reference constraints could be preserved in the resultant relation (e.g., a new table for specimens) so long as all attributes of the foreign key reference are projected. Need for Structural and Semantic Transformations. Database evolution encompasses complex transformations that may alter the structure and/or semantics of information in the database. We may think of some complex transformations as exclusively structural in nature, where data values are essentially unaected but the \shape" of tables are altered. For example, decomposing the specimens from experiments of Table 4.1 changes the structure of the data without altering data values. Alternately, some transformations blur the line between structural modication and semantic transformation of the data. These types of transformations alter not only the shape of the data but also the data values in a non- trivial way. From the earlier example database of experiments (Table 4.1), replacing the values of the stage column with values that are semantically aligned to a formally dened nomenclature (a.k.a., controlled vocabulary) for developmental stages, and then applying a foreign key reference on the aligned column to ensure that its values remain consistent with the new domain moving forward, transforms both the structure and semantics of the 90 relation. Summary of Requirements. The example above provided an illustration of just some of the requirements that can be derived from recent evaluations of real-world database evolution [36, 59], surveys [84], and reported experiences of database use in science [22, 125, 78]. We summarize the elicited requirements as follows: 1. Dene and modify domain concepts: (a) create, (b) drop, (c) rename, or (d) copy a table to track changes in domain concepts; (e) add, (f) remove, (g) rename, or (h) copy a column to capture change in understanding of properties of a domain concept; 2. Alter relationships between domain concepts: (a) create references or (b) change the target of references to capture relationships between concepts; (c) change cardinal- ity of a relationship from 1:N to N:1, (d) 1:N to M:N, or (e) M:N to 1:N to rene the understanding of the relationships between concepts; 3. Combine or separate categories of domain concepts: (a) combine or (b) separate rows of tables in order to capture categories of domain concepts; 4. Merge or split domain concepts: (a) merge or (b) split attributes of a domain concept in order to (de)normalize the concepts; 5. Reify domain concepts: (a) reify attributes of a table into a new domain concept, (b) or a new sub-concept, or (c) reify from nested attributes to rene the domain concepts; 6. Integrate data from external sources: (a) integrate data from another database, (b) from spreadsheets, or (c) from graphs (e.g., ontologies); 7. Increase the semantic coherence of data: (a) create a custom domain or (b) a locally-coded \vocabulary" from the existing values of a column; (c) semantically align the values of a column with a controlled vocabulary; (d) turn unconstrained \tags" into semantically aligned terms from a controlled vocabulary. 91 4.3 Schema Evolution Algebra A schema modication operator (SMO) [36, 40] transforms a database schema and its data, and a set of SMOs forms a database evolution language (DEL) [72]. In this section, we describe an algebraic approach to dening a DEL that consists of an open set of arbitrarily complex yet ecient SMOs. The increasingly complex operators of the language reduce the eort on the user to evolve the database, while the algebraic formulation of the opera- tors provides transparency for a database management system to optimize the expressions for increased eciency of execution. We delineate SMOs into two classes: composite and primitive. Primitive SMOs are dened as computationally atomic operations and provide a compact and concise foundation on which to build more complex operations. They are dened based on the usual semantics of the conventional relational algebra with extended denitions for schema evolution. Composite SMOs are dened by algebraic expressions or formulas over other SMOs in the language. As such, we can dene arbitrarily complex oper- ations by composing various combinations of other SMOs in the language, and because they are dened algebraically, a system can rewrite expressions into semantically equivalent but potentially more ecient expressions. 4.3.1 Characteristics of a SMO Given that the objective of our approach is to dene arbitrarily complex operations to codify observed patterns of database evolution as discussed in Section 4.2, certain characteristics must be satised in order for potential operations to be admitted to the framework. In summary, the schema evolution algebra that we dene here is relational in that the operators of the algebra accept and produce relations, the operators themselves are dened as strictly functional so that algebraic expressions may be rewritten by a planner or optimizer, we can then leverage functional composition to dene higher-level composite operators over a concise set of primitive operators, and nally database state is mutated by assignment of algebraic 92 expressions to the database. Relational. In other formalisms, the operators of the database evolution language accept as input a relational schema and a database, and produce as output a new version of the schema and database [36]. While they have taken this coarse-grained approach, we take a ner-grained approach and adhere to relational operations; i.e., operators that accept as input relation(s) and produce as output new versions of the input relation(s). By accepting and producing relations (rather than schema and database), our framework is able to leverage the extensive body of work on ecient query execution in the context of schema evolution. It also gives us the opportunity for greater reuse between various formulas because the building blocks of the language are ner-grained and more focused than if they were dened as whole database operations. Functional. Functional composition is a mathematical formalism for dening new func- tions based on a combination of other existing functions. Our approach leverages this for- malism to dene increasingly higher level schema modication operators out of simpler ones. In our formalism we rst dene a base layer of functions that are dened using extensions to the usual relational formalism and then new functions over them using the technique of functional composition. The functions form the operators of our relational algebra. In order to allow a system (i.e., a planner or optimizer) to be able to rewrite these algebraic expressions safely without violating the semantics of the original expression, the operators (i.e., functions) must be \purely" functional. An SMO must: 1. produce the same output given the same input; 2. not have side eects; 3. accept other expressions as inputs. While this principle restricts the types of operators that may be admitted to the language, it supports the key objectives of the approach: that the framework may allow arbitrarily 93 complex operations simply by composing other operators of the framework, and that the framework may rewrite expressions safely in order to exploit opportunities to increase e- ciency and reuse shared subexpressions. Primitives and Composites. Operators of the language fall into one of two categories: primitives or composites. Primitive SMOs are dened as atomic operations that are not further divisible, and they are generally sourced from the conventional relational algebra or from extensions to the relational algebra. Composite SMOs, on the other hand, are dened as algebraic expressions over SMOs. They dene increasingly complex operations that are functionally composed over other operators. Thus, the primitive operations form a concise and cohesive foundational layer of operations that are then leveraged by the composite operations to address the motivating requirements. Mutation. Given that SMOs adhere to functional principles, they do not mutate (data- base) state directly. Instead, the result of an SMO is to produce a new relation that may be assigned to the database. During assignment, the resultant relation is bound to a name in the database (i.e., conceptually similar to the CREATE TABLE ... AS ... expression in the conventional SQL language). It is only during the evaluation of the assignment operation that the database state is altered per the results of the expression. Notation. The general formula for a composite SMO is denoted: [lop] Op parameters rop7! algebraic expression where Op is the name of the operator, rop is the right operand, lop is the left operand only if the operator is binary, parameters are operator-specic (e.g., like a list of attribute names on a relational project operator), and the operator maps to (7!) an algebraic expression over other SMOs of the algebra. We again use inx notation for specifying the algebraic 94 expressions. For example, \l Op1 (Op2 r)," denotes a formula over a hypothetical binary operatorOp1 taking left operandl and right operand as the unary operatorOp2 applied to its input operand r. Example. CopyCol is a SMO that copies column(s) from one relation to another. The input relations,r ands, are left joined on condition', and then the attributes ofr (denoted R) and the designated attributes to be copied a 1 ;:::;a n are projected. r CopyCol ';a 1 ;:::;an s7! R;a 1 ;:::;an (r ' s) In the above equation, we see that the operator, CopyCol, is a binary function over inputs, denoted r and s, and it accepts a formula ' and a list of attribute names a 1 ;:::;a n as additional parameters. It maps the input to the expression as shown in the operator denition, where and respectively denote the Left Join and Project operators from the conventional relational algebra. 4.3.2 Preliminaries Before describing the operators of the algebra in detail, we brie y review the background relational concepts and notation. Let N denote the set of natural numbers. A relation schema R is a set of attributes fa 1 ;:::;a n g. Corresponding to each attribute a i , i2 N, there is a d i called the domain of a i . A mapping from the set of attributes in a schema to their corresponding domain values is a tuple t. A set of tuples is a relation r. The relation r(R) is said to be a relation r with relation schema R, while the relation r[A] is a relation of the subset of attributes A R. A subset of attributes of a relation that uniquely identies each tuple is called a superkey, while the minimum superkey is called its candidate key. In general, multiple overlapping candidate keys may exist in a relation schema. An attribute of a candidate key is called a prime attribute while an attribute that does not participate in any candidate key is called a non-prime attribute. Beyond merely 95 the denition of its attributes, a relation schema may also consist of constraints, expressed as assertions that must hold; policies, governing access to a subset of tuples of the relation; mappings, alternative renderings or representations of tuples in dierent contexts (import, export, user interfaces, etc.), and annotations, either semantic or presentation hints about the schema (e.g., that a column represents a URL to a le object that can be uploaded or downloaded). A database schemaS is a set of relation schemas and is said to have a database instance D which belongs to the setD of all possible database instances ofS. Schema evolution is therefore a transformation ofS that preservesD for allD to the extent possible. 4.3.3 Primitive SMOs Several primitive SMOs share the conventional denitions from the relational algebra [2]; e.g., select (), project (), join (o n), rename (), union ([) and distinct (). The assign ( ) operator binds a relation to a name. In addition, the framework species similarity join (o n ), deduplicate ( ), nest ( ), unnest ( f ), and shred ( ). Although not included within the conventional relational algebra, these operations can be found in the literation in some form [cf. 25, 50, 121]. Several of the primitive SMOs directly satisfy requirements enumerated in Section 4.2, while the remaining are required in support of composite SMOs. See Table 4.2 for a summary of primitive SMOs and the requirements that they address (note that `{' indicates that the SMO is required by a composite SMO). Next, we describe our formulation of the unconventional relational operators in greater detail. Similarity Join (o n ) joins two relations based on satisfaction of an inexact similarity measure () over the elements of two input sets (e.g., x2 X and y2 Y ), where x and y may be a subset of tuples from each input set. Deduplicate ( a i :::an ) conceptually compares all elements in a set (e.g., x2X and y2X) and eliminates tuples that represent the same entity according to the similarity measure (). Nest ( g 1 :::gm a 1 :::an ) aggregates the tuples of a relation according to a similarity measure 96 Table 4.2: Summary of primitive SMOs. Operator (Symbol) Brief Description Req. Assign ( ) Binds a relation to a name in the database. 1a, 1b Select ( ' ) Filters a subset of tuples from a relation based on condition '. Preserves schema (attributes, con- straints, policies, annotations, etc.) of the relation. 1d, 3b, 6a, 6b Project ( a i :::an ) Filters a subset of attributes of a relation based on a given list of attribute names (also allows user- dened functions). Preserves relation schema for all attributes in the set of ltered attributes. 1e, 1f, 4b Join (o n ' ) Combines attributes of two relations that satisfy condition '. Preserves schema for combined rela- tions except where con icting attributes exist. 4a Rename ( a=b ) Rename attributes of a relation. Transforms schema of the relation according to attribute renaming. 1c, 1g Union ([) Combines tuples from relations so long as relation schemas match. Preserves schema of the relations. 3a Distinct () Forces strict set semantics on a relation based on comparison of all or a subset of attributes of the relation. Preserves schema of the relation. { SimilarityJoin (o n ) Binary operator that joins two relations based on similarity comparison. Preserves schema for com- bined relations except where con icting attributes exist. { Deduplicate ( ) Eliminates duplicate tuples based on inexact match- ing using similarity comparison . Preserves schema of the relation. { Nest ( ) Aggregate-like operator that nests a relation using similarity comparison. Preserves schema for the grouping key attributes but for the aggregated at- tributes it drops constraints. { Unnest ( f ) Unnests non-atomic attribute values according to collection unnesting expressionf. Preserves schema for the unaltered attributes except for uniqueness constraints. { Shred ( q ) Shreds graph data using a graphical query language (e.g., SPARQL, XQL, etc.) expression. Relation schema is limited to the attribute names and data types generated by the expression q. 6c (). Like conventional aggregate operators, nest accepts a set of attribute names (g 1 :::g m ) as the grouping key and a set of attribute names (a 1 :::a n ) to be nested in the output 97 relation. Unlike conventional aggregate operators, attributes g 1 :::g m maybe grouped on a similarity measure (). Deduplicate is closely related, as it can reduce to nest by simply deduplicating on the grouping keys and omitting the nested attributes. Unnest ( f ) takes a user-dened function (f) for collection unnesting to map each input tuple to one or more output tuples. Unlike conventional nested algebra, the operator does not require that the input adhere to well-formed nested relational semantics, but instead allows for informally dened values (e.g., delimiter-separated text). Shred ( q ) takes graph data as input and a query expression parameter (q) specied in a graph query language (e.g., SPARQL 1 ). Scientic databases often must interoperate with information from ontologies and knowledge bases which are typically represented using a graph structure, such as Resource Description Framework (RDF) 2 . The process of extracting relations from graph data is commonly referred to as \shredding." 4.3.4 Composite SMOs Here, we describe the composite SMOs that have been identied based on the motivating requirements. We regard the class of composite SMOs as an open set, however, as our approach lays out a framework in which to continually expand the set of composite SMOs as new requirements motivate their introduction. See Table 4.3 for a summary of composite SMOs. Reify produces a new relation from two subsets of attributes of the input expression. It extends from the notion of \decompose" in other DELs [cf. 40, 72] with a formula suitable for decomposing functionally dependent subsets of attributes (i.e., FD(A!B)). The operator projects all designated attributes a i 2 A and b i 2 B from the input expression r and then forces set semantics based on the distinct values of a i 2 A as the output relation's unique key. The input sets should be non-overlapping (A\ B = ;). If desired, removing the oending (functionally dependent) attributes of r is simply a function of r 0 = b 1 ;:::;bm r. 1 See https://www.w3.org/TR/sparql11-overview/ 2 See https://www.w3.org/TR/rdf11-primer/ 98 Table 4.3: Summary of composite SMOs. Operator Denition Brief Description Req. r CopyCol ';a 1 ;:::;an s7! R;a 1 ;:::;an (r ' s) Copy column(s) from one relation to another. 1h, 2e Reify A;B r7! a 1 ;:::;an ( a 1 ;:::;an;b 1 ;:::;bm r) Produces a new relation from a subset of attributes of the input expression. 5a Reify sub a 1 ;:::;an r7! key(r);a 1 ;:::;an r Reies as a subconcept of the input expression by pre- serving a reference to the source relation. 2d, 5b Atomize f;a i r7! f (Reify sub a i r) Reies and normalizes val- ues from an attribute of the input expression. 5c Domainify ;a i r7! a i =t ( ( a i r)) Produces a domain of val- ues from an attribute of the input expression. 7a Canonicalize ;a i r7! t s ( a i =t;a i =s ( a i ;a i r)) Produces a canonicalized term set (i.e., a \controlled vocabulary") from an at- tribute of the input expres- sion. 7b d Align ;a i r7! t=a i ( a i ;s (ro n d)) Align the values of an at- tribute in one relation with the attribute(s) of another relation. 7c d Tagify ;f;a i r7!d Align ;a i (Atomize f;a i r) Transform unconstrained, nested values of an at- tribute into a reied, aligned sub-concept. 7d The decomposition satises the lossless join property as r =r 0 o n (Reify A;B r). Reify sub reies a subset of attributes as a subconcept from the input relation by preserving a reference to the source relation. It projects an introspected or inferred key, key(r), and the specied attributesa 1 ;:::;a n from the input expressionr. The resulting relationr 0 may be joined with r on key(r) as a foreign key reference. Atomize reies and normalizes values from an attribute of the input expression such that the output relation may satisfy 1st Normal Form (1NF). It rst reies the specied 99 attribute a i as a subconcept of the input expression and then unnests the values of a i using the user-dened function f. Domainify produces a custom domain (i.e., a \domain table") from an attribute of the input expression. It projects a column a i from the input relation r, deduplicates on a i according to the similarity comparison (again denoted), and then renames a i to term (denoted t). Canonicalize produces a canonicalized term set (i.e., a \controlled vocabulary") from an attribute of the input expression. The operator projects a column a i twice from the input relation r, renames them as t (for term) and s (for synonyms) respectively, then nests on t using the similarity comparison () and aggregates attribute s into a nested value (i.e., array). Align performs a semantic alignment of the values of an attribute in one relation with the attribute(s) of another relation acting as the authoritative term set. The operator rst does a similarity join between input relations r and d (where relation d plays the role of a domain or vocabulary) on similarity condition (), projects the joined relation removing the target columna i 2r and the synonymss2d, and renames the (canonical) term attributet as a i . Tagify transforms unconstrained, nested values of an attribute into a relation that repre- sents a reied and aligned subconcept. The operator begins by atomizing attribute a i with user-dened function f for unnesting, then aligns the unnested values of a i with relation d (playing the role of a domain or vocabulary) using a similarity comparison expression (). 4.4 Programming Model CHiSEL is realized in the form of an embedded domain specic language (DSL) [151] using Python as the base language. Users express schema evolution expressions with programming constructs similar to the \data frame" interface in R or the pandas 3 library in Python. By 3 See https://pandas.pydata.org/ 100 providing CHiSEL in the form of an embedded DSL rather than a standalone language, users can avoid the steep learning curve of an entirely new programming dialect. In addition, it has been noted that data scientists favor DSLs based on a familiar language like Python or R over SQL-like dialects [8]. Although CHiSEL is embedded in a general-purpose, multi-paradigm, programming language, it still retains a declarative programming model { expressions are translated into the algebra, planned, rewritten, and may be executed in a completely dierent sequence, perhaps with dierent but equivalent operations. Figure 4.1: Flow of operations in the framework. Figure 4.1 illustrates the overall ow of operations in the framework. User programs are written in a simple domain specic language (DSL) that the framework validates and translates into an intermediate expression of SMOs. Next, the framework rewrites composite SMOs until there are only primitive SMOs in the expression. Then the framework may rewrite the expression into a more ecient but semantically equivalent one by applying well known query rewrite rules. Finally, the framework rewrites the expression of SMOs into an execution plan of physical operators that actually perform the database transformation. Planning and execution are discussed in greater detail in Section 4.5. 4.4.1 Key Abstractions The key abstractions of the programming model are relations, expressions, catalogs, and user-dened functions. 101 Relations All operations in the programming model are performed on and result in relations. An extant relation is one that has been materialized in a data source, while a computed relation is one that has been produced by an operation. A computed relation may become an extant relation only when it is materialized. Computed relations may serve as temporary variables during the course of a schema evolution program in order to reuse intermediate results in other expressions. Per the usual denition, relations consist of a name, set of named attributes, mappings from attributes to their domains, and a set of constraints that must be satised by the tuples of the relation. Depending on the underlying data source, they may also include model mappings, annotations, and policy statements [41]. In addition, computed relations document their provenance by way of recording the schema evolution expressions used to produce them. Expressions Programs compose schema evolution expressions through functions on relations and over- loaded operator syntax (e.g., =, <, >, ==, j, &). The expressions produce computed relations, which encapsulate the underlying logical plan and they also eagerly precompute the relation schema (i.e., column denitions, etc.) in order to validate the expression and give immediate feedback to interactive users. At each step, the user can get feedback that the expression they are forming is valid per the data source's schema. Operations can be chained to create increasingly complex operations; i.e., computed relations support the same schema evolution operations as extant relations referenced from the catalog. Assignment is the only operation that persistently modies the database schema, by materializing a computed relation to the underlying data source. 102 Catalogs Data sources are represented via the catalog abstraction. Through the catalog, programs can query extant relations and materialize computed relations. The storage model of the underlying data source is opaque to the user program, as all data are represented through the common abstraction of the relation, described above. The catalog is organized into namespaces (a.k.a., schemas). Within each namespace, extant relations (a.k.a., tables) are bound to names in the catalog and described by their respective metadata (i.e., attributes, constraints, and other metadata depending on the capabilities of the underlying data source). In order for schema evolution expressions to be evaluated, they must either be assigned to the catalog or used by another expression that is assigned to the catalog, and then committed. User-Dened Functions Custom logic and third-party libraries may be seamlessly integrated via user-dened func- tions (UDFs) that are accepted as parameters by some of its operators. For convenience, default UDFs are provided: the similarity measure UDF used in similarity join and dedupli- cate is based on the edit distance algorithm of the Natural Language Toolkit [19], and the entity extraction UDF used by the unnest operator is based on a simple, standard, string splitting function. Like set-returning functions (SRFs) in other systems [110], UDFs may return zero-to-many tuples for each input tuple 4 . Conversely, as an embedded DSL, schema evolution expressions may be fully integrated into a larger program in the base language (i.e., Python or R). 4.4.2 Program Flow A typical program will begin by connecting to a data source, referencing one or more exist- ing tables (extant relations), using high level operations to specify transformations on the 4 Unlike SRFs that batch up results in memory before returning, UDFs can and should instead pipeline results according to the iterator pattern [62] for eciency. 103 extant relations, which creates evolution expressions encapsulated as computed relations, assignment of computed relations to new names in namespaces of the catalog, and nally committing the evolution operations on the catalog. Example program 1 # Connect to the data service 2 catalog = chisel.connect(`https://example.org/...') 3 # Get handle to an extant table of biosamples (optional) 4 biosample = catalog.schemas[`public'].tables[`biosample'] 5 # The evolve block acts like a BEGIN...COMMIT block 6 with catalog.evolve(): 7 # Compute the unique domain for a column of anatomy terms 8 rel = biosample.columns[`anatomy'].to_domain() 9 # Assign the computed relation to the catalog 10 catalog.schemas[`vocab'].tables[`anatomy'] = rel The statements shown in Listing 4.4.2, illustrate the schema evolution expressions nec- essary to create a new domain of terms from (previously) unconstrained attribute values. In the example, rst a connection is established to a database (catalog) as identied by its URL (line 2). Next, a table `biosample' from namespace `public' is referenced (line 4). The user starts an evolve block (line 6) that acts like a BEGIN...COMMIT block 5 where pend- ing relations are only computed and materialized at the exit of the block (after line 10). A computed relation is dened by referencing column `anatomy' and calling its to domain() method which creates a logical plan for the schema evolution expression to domainify the column, encapsulated in a computed relation, and returns it (line 8). At this point, the com- puted relation is only specied and bound to a local variable rel but has not been evaluated other than to pre-compute and validate its relation schema. The computed relation rel is then assigned to a persistent name in the catalog, `anatomy' in the `vocab' namespace (line 10). Finally, the pending expression is evaluated at the exit of the evolve block (line 5 The execution of the statements in the evolve block may not be performed in a single transactional unit, as it depends on the capabilities of the underlying database management system. 104 6{10), and only then will it be planned, optimized, executed, and ultimately materialized in the data catalog. In this manner, computed relations are evaluated lazily, allowing a batch of operations to be optimized and materialized together. 4.5 Planning and Optimization The overall design of our planning and execution method is inspired by the long-standing techniques for query evaluation [62], including the separation of \logical" and \physical" algebras. Initially a schema evolution expression is represented in a purely symbolic logical algebra that is amenable to rewrite rules and algorithms. The logical representation is eventually rewritten into an executable physical algebra that embodies the algorithms for executing the plan. As is the case with query planning, there may not be a one-to-one mapping between logical and physical operators. Figure 4.2: Overview of the planning and execution stages. Planning takes several stages (see Figure 4.2) including: translating scripts from the programming language syntax into symbolic expression trees, rule-based decomposition of composite SMOs, rule-based logical optimization, subexpression consolidation, and rule- based physical planning. As with query modication, the objective of rewriting a schema evolution expression is to improve (not necessarily minimize) the time- or space-eciency of the original expression [92]. 105 4.5.1 Expression Trees and Rewrite Rules Before we describe the planning and optimization procedures, we describe here the internal constructs of the planner. Logical operators Purely symbolic artifacts that consist of the operator name, its children, and operator-specic parameters. We use the namedtuple data structure { essentially a very lean tuple subclass with symbolic names for its otherwise purely positional values. The advantage of using namedtuples is that they can be compared and hashed eciently which we rely on in the rule matching and consolidation algorithms. As an example, Project (Section 4.3.3) is represented symbolically as Project(child, attributes), where child can be any other symbolic expression, and attributes can be a tuple of attribute names found in child's schema. Expression Trees A schema evolution expression is represented as a tree of symbolic operators (a.k.a., a logical plan). This is the primary data structure used in the planner. The initial stage of the planner accepts a set of these trees (i.e., a forest). In the simplest case, just one expression may be submitted at a time by the program, but operating on a larger set gives the planner more likelihood to nd work sharing opportunities between the expressions. For example, a symbolic expression tree derived from the Align operator over extant relations r and d would be dened as: Rename( Project( SimilarityJoin( Extant(r), Extant(d), similarity_udf), (projections...)), (renames...)) 106 where the details of the similarity UDF, projections, and renames are omitted for readability. Rewrite Rules We took an approach comparable to that used by the Catalyst optimizer [8] in Spark SQL to declare our optimization rules using functional pattern matching, such that rules are expressed in the form (`pattern', handler) where the pattern is used to match symbolic expressions and handler is specied as a function or a lambda expression that is red when the pattern is satised. For example, a simple \null propagation" rule could be specied as: (`Project(None, _)', lambda: None) The antecedent is a pattern that matches any Project operator that has a null (`None') child expression. The attributes argument can be ignored (` ') as it is irrelevant. The consequent is specied by a lambda expression (i.e., anonymous function) that propagates the null (`None'). We divide the rules into dierent categories for each of the steps used in the planning process. Decomposition Rules. These rules dene the decompositions of the composite SMOs according to their algebraic formulations. For example, theAtomize operator is decomposed per the rule: (`Atomize(child, unnest_fn, attribute)', lambda child, unnest_fn, attribute: Unnest( ReifySub(child, (attribute,)), unnest_fn, attribute)) Expression trees containing the Atomize operator would be rewritten according to its de- nition, which includes another composite operator, ReifySub. The resulting expression tree would be rewritten again according to the decomposition rule forReifySub and so on until the expression tree consists only of primitive SMOs. The composite SMO denitions (Sec- tion 4.3.4) are codied directly as decomposition rules. These rules are relatively easy to 107 write in order to introduce new composite SMOs into the framework. Logical Optimization Rules. The framework leverages rules from established query op- timization techniques { null propagation, operator fusion, operator pushdown, etc. For ex- ample, a rule to rewrite a costly Deduplicate operator into a less computationally expensive Distinct operator is specied as: (`Deduplicate(child, attributes, None)', lambda child, attrs: Distinct(child, attributes)) Here the pattern will match on a Deduplicate expression with any child subexpression and attributes (on which to perform deduplication), but the rule explicitly matches on expressions with a value of `None' in the place of the similarity function. The reason is that when no similarity function is given, the comparison will require an exact match in order to be satised, and in this case, a less expensive algorithm can be used to ensure uniqueness of all input tuples. The consequent is given as a function (lambda expression) that accepts the child and attributes of the antecedent and returns a Distinct operator. At present, CHiSEL implements only a minimal set of optimization rules to validate the approach, while we anticipate greater implementation of optimization rules as driven by user requirements in the future. Physical Planning Rules. These rules transform the symbolic logical operators into the physical operator implementations (discussed next). For example, a rule to rewrite a symbolic logical operator for Distinct into its physical operator equivalent is specied as: (`Distinct(child:PhysicalOperator, attributes)', lambda child, attributes: HashDistinct(child, attributes)) Note that the rule matches on a child of type PhysicalOperator, the base class for all physical operator implementations, and then res the lambda expression that returns the operator that implements the hash-based algorithm for ltering distinct tuples. 108 Rewrite Algorithm. The rewrite algorithm recursively descends the expression trees and evaluates each subexpression and rewrites the expression when a rule res. Rules are executed repeatedly until a xed point is reached. In each iteration, the algorithm evaluates whether the plan has changed from the past iteration. When a xed point is reached, the planner exits. Physical Operators Unlike the logical operators that are merely symbolic, the physical operators are executable. They are implemented according to the iterator pattern commonly used in query process- ing [62], so that intermediate results need not be materialized between operations but instead may be pipelined from operator to operator. Whenever possible, the physical operation is pushed down to the underlying database service rather than implemented natively in the framework. This way, most operations should be performed more eciently at the source. There are however situations where the operations cannot be pushed down, such as when non-standard relational operations are required (i.e., nest, unnest, similarity join, etc.) and in these circumstances native implementations are provided by the framework. To support ecient reuse of partial results, the planner may also inject operators to buer partial results that can be reused by other parts of the plan. 4.5.2 Logical Planning and Optimization Logical planning and optimization begins with translation of the user program into symbolic expression trees, followed by decomposing composite operators into primitives, rewriting using optimization rules, and nally consolidating shared subexpressions. First, the schema evolution statements in the user program are translated into an algebraic expression tree (i.e., \desugared") consisting of logical operator representations of the primitive and composite SMOs. Upfront validation is performed to compare the expression tree to the known schema of the catalog and reject expressions that violate it. Next, the planner rewrites composite 109 SMOs into primitives, and then the logical optimization rules are applied. 4.5.3 Subexpression Consolidation Work sharing is an established approach to improving query plan eciency. Schema evolution can benet from this optimization approach and due to our formulation of complex operators as functional compositions, we can take particular advantage of this technique. As noted, one or more expressions may be evaluated at a time (i.e., within evolve blocks). After the logical expressions are rewritten, a consolidation algorithm searches for and eliminates any repeated subexpressions within the expression trees. Algorithm 1: Consolidate algorithm Data: A set rels of computed relations Result: Consolidation of the computed relations 1 map initially empty map of plans7!tempvars; 2 while rels6=; do 3 counts CountPlans(rels); 4 vars ;; 5 foreach rel2rels do 6 rewrittenPlan, newVars ConsolidatePlan(rel:plan, counts, map); 7 rel:plan rewrittenPlan; 8 vars vars[newVars; 9 end 10 rels vars; /* new unconsolidated relations */ 11 end Consolidate. Algorithm 1 accepts a set of computed relations, rels, each of which en- capsulates its respective expression tree (a.k.a., logical plan), rel.plan. First, it initializes a map (line 1) that its subroutines use to record shared subexpressions as newly generated temporary variables. These temporary variables will not be materialized as they are only used for computing and reusing the results of shared subexpressions. The algorithm then calls subroutine CountPlans to compute the occurrence count of each subexpression (line 3) and then iterates over each relation (lines 5{9) calling the ConsolidatePlan subroutine to 110 rewrite the current relation's logical plan (line 6). The subroutine ConsolidatePlan returns temporary variables when repeated subexpressions are consolidated. The algorithm will re- peat (lines 2{11) over the accumulated set of these temporary variables until no more are produced. Algorithm 2: CountPlans algorithm Data: a set rels of computed relations Result: a map counts from plan7!count initialized to 0 for each plan 1 foreach rel2 rels do 2 plans frel:plang; 3 while plans6=; do 4 plan pop(plans); 5 counts[plan] counts[plan] + 1; 6 if counts[plan] == 1 then /* plan has not been encountered before */ 7 plans plans[children(plan); 8 end 9 end 10 end 11 return counts; CountPlans. Algorithm 2 accepts a set of computed relations, rels, and returns the count of occurrences for each subexpression in the set of computed relations, counts. The algorithm iterates over relations (lines 1{10) and traverses each computed relation's expression tree (lines 3{9) in breadth-rst order to count every occurrence of each subexpression (line 5). When it encounters a unique subexpression (line 6), it appends the child subexpressions onto the queue (line 7). After visiting and counting all unique subexpressions, it returns the counts (line 11). ConsolidatePlan. Algorithm 3 accepts an expression tree, plan; the counts of occurrences for all subexpressions, counts; and a map of the accumulated set of temporary variables, map. The results of the algorithm are the logical plan, possibly rewritten, and any new temporary variables that were produced by this round of consolidation. First, the algorithm checks the occurrence count of the expression (line 1), and if it is a repeated expression, then it 111 Algorithm 3: ConsolidatePlan algorithm Data: a logical plan, a map counts from plan7!count, a map of plan7!tempvar Result: a rewritten logical plan and a set of new tempvars 1 if counts[plan]> 1 then 2 if plan2map then 3 tempvars ;; 4 else 5 map[plan] ComputedRelation(plan); 6 tempvars fmap[plan]g; 7 end 8 plan TempVar(map[plan]); 9 else 10 tempvars ;; 11 children plan:children; 12 plan:children ;; 13 foreach child2children do 14 rewrittenPlan;newVars ConsolidatePlan(child;counts;map); 15 plan:children plan:children[frewrittenPlang; 16 tempvars tempvars[newVars; 17 end 18 end 19 return (plan;tempvars); rewrites the expression as a temporary variable reference to the repeated expression (line 8). In addition, it will check for the expression in the map of all temporary variables (line 2). If it has not yet recorded this expression, it will create a new (temporary) computed relation and add it to the map (line 5) and to a singleton set of temporary variables (line 6). If however the expression is unique (line 9), then the algorithm iterates over the child subexpressions (lines 13{17). It recurses to rewrite the child subexpressions (lines 14{15) and accumulates any new temporary variables (line 16). Finally, the algorithm will return the current expression, possibly rewritten, and any newly generated temporary variables for further consolidation (line 19). Time Complexity. Given a set of relations each containing an expression tree (i.e., a plan), the overall set of plans forms a forest of expressions F . Algorithm 2 iterates over F . Because of the symbolic representation of expression trees (Section 4.5.1), operations 112 to lookup and set the counts of each expression (Algorithm 2, lines 5 & 6) require only constant time, O(1). When there are no shared subexpressions in F , all subexpressions in F will be counted (once) by the algorithm. When the algorithm encounters a shared subexpression, however, it stops traversing its children. The worst-case time complexity of Algorithm 2 is therefore O(jFj). Algorithm 3 iterates over a single expression tree T . At most, it will visit each subexpression in T once and therefore its time complexity is O(jTj). Finally, Algorithm 1 evaluates and consolidates each computed relationR. First, it calls the CountPlans routine (Algorithm 2) for a cost ofO(jFj). Then, for each relation inR, it calls the ConsolidatePlan routine (Algorithm 3). While each iteration carries a complexity of O(jPj) for P2R, the overall cost of the complete iteration is O(jFj) given that F contains all P2R by denition. Therefore the time complexity of Algorithm 1 is O(jFj). 4.5.4 Physical Planning and Execution Finally, the consolidated logical plan is evaluated by the physical planning stage which takes as input the logical plans (i.e., symbolic expression trees) and produces a physical operator expression graph. Where possible, operations are pushed down to the underlying data source with source-specic operators. For example, an expression Project(Select(Extant(...), ...), ...) would be rewritten with a fusion of the operators as ERMSelectProject(...), thereby pushing down the projection and selection to the underlying database service which will execute the operations most eciently. Currently, the framework includes implementations of several key physical operator al- gorithms: select with single-column equality comparison, project of specied attributes or user-dened functions, nested-loops similarity aggregation (used for both nesting and de- duplication), hash distinct, nested-loops cross-join, nested-loops similarity join, table scan (for JSON and CSV \tabular" le format scanning), push-down database select, fused data- base select and project, and shred (for SPARQL query expression evaluation over RDF data les). 113 Physical planning concludes when all logical operator symbols have been rewritten as physical operators. The output is an expression graph with shared temporary variables. Unlike Cascades [63], here the logical and physical operators use distinct internal repre- sentations (i.e., logical operators as namedtuples and physical operators as iterators). The logical plan is merely symbolic and aids in both the functional pattern matching used in rule- based rewritting and the ecient hashing and comparisons for consolidation. The physical plan, however, is made up of executable operators implemented using the iterator pattern for pipelining during expression evaluation. 4.6 System Implementation The framework is implemented as a Python library and its interfaces are available both in Python and R via the reticulate library 6 . The architecture is shown in Figure 4.3. The main components are the interface, catalog, planner, rules, physical operators, and data source adapters. The interface provides a generous helping of syntactic sugar through operator overloading as described in Section 4.4. Since the interface layer is embedded in Python, no additional parser or interpreter is required. The execution of statements in the interface directly generates the logical expression trees (Section 4.5.1) that are fed to the planner. As described in Section 4.5, rules are implemented using functional pattern matching. They are grouped into functional categories for logical optimization, composite operator decom- position, and translation into physical operators. Since functional pattern matching is not natively supported in Python, we use the open source PYthon Functional Pattern Matching (pyfpm) library 7 . Whenever possible operations are pushed down to the data sources through source-specic physical operators, such as the fused select+project operator discussed in Sec- tion 4.5.4. This way, the system shifts its workload to the underlying data source where it will be executed most eciently. Otherwise, physical operators are implemented natively 6 https://rstudio.github.io/reticulate/ 7 See https://pypi.org/project/pyfpm/ 114 using high-performance collection implementations for best performance (i.e., data struc- tures and algorithms implemented in C rather than Python). The operators pipeline their results from child to parent in memory to avoid costly materialization of partial results. The data source adapters provide source-specic capabilities for introspecting an underlying data source's schema and materializing relations. The adapter normalizes the interactions with the data source so that a consistent interface to the data source can be provided via the catalog which represents the schema and related metadata about the data source's contents. Planner Catalog Physical Operators Interface Database Data Source Adapters Files Rules Introspection & Materialization Data source metadata Generic & source-specific operators Decomposition, Optimization & Transformation Figure 4.3: High-level architecture. 4.6.1 Data Sources The catalog standardizes interactions with the underlying data sources. CHiSEL supports both a relational and a semi-structure data source back-end. Relational Data Source CHiSEL supports the Deriva scientic data management system [125], which includes ERMrest [41], a RESTful interface to a relational data model implemented in the Post- greSQL database service. ERMrest supports simplied joins in the form of \linking" from one entity set to the next. Since ERMrest is not designed to support a relationally com- 115 plete query language, CHiSEL can push down some (select, project, scan, distinct) but not all operations. ERMrest also provides a schema interface that describes the relations, constraints, and extended metadata from its catalog. Semistructured Data Source The implementation also supports semistructured data sources, which must be organized as a directory of CSV or JSON formatted les. CHiSEL samples the tabular data and deter- mines the table denitions, column data types, and infers possible integrity constraints (i.e., uniqueness and candidate keys). For graph data, CHiSEL provides a shred function, and may either do a shallow introspection by parsing the SPARQL query expression parameter, or if desired, a deep introspection by executing the query and inspecting the variables re- turned in the results set. CHiSEL retains the inspected schema as part of its internal catalog, for use in later stages of expression evaluation. Extending the Data Sources CHiSEL may be extended with support for new types of data sources. Minimally, one must: extend a few abstract base classes with logical and physical scan operators, add a rule to rewrite the logical scan into a physical scan, add a data source introspection function, and optionally add a materialize function if writing to the data source is desired. Extending CHiSEL with support for semistructured data sources only required a little over 250 lines of code, including comments and whitespace. 4.6.2 Data Source Federation Multiple data sources may be involved in complex transformations of data. CHiSEL oper- ations over multiple catalogs can be combined even if the data sources for the operations dier. For example, one may rst attempt to canonicalize the domain of an attribute to cre- ate a spreadsheet that can be shared with domain experts. The domain experts can review 116 the canonical term list and make additional corrections that similarity algorithms might fail to correct adequately. Then, using the corrected canonical term list, they may go back to the source relation and align its terms with the curated term list, at which point both the aligned relation and the canonical term list may be assigned and persisted in the catalog. Federated data sources example 1 remote = chisel.connect(`https://...') 2 local = chisel.connect(`file:///...') 3 # get reference to remote table vocabulary terms 4 anatomy = remote[`vocab'][`anatomy'] 5 # get reference to local tabular data 6 biosamples = local[`.'][`biosamples'] 7 # align local table's column to vocabulary 8 fixed_biosamples = biosamples[`anatomical_source'].align(anatomy) 9 # assign and commit the aligned relation 10 with local.evolve(): 11 local[`.'][`fixed_biosamples'] = fixed_biosamples In the example above, a local semistructured data sourcebiosamples is aligned with a set of anatomy terms from a remote data source. This can also be a useful feature for evaluating the output of the operations o-line before executing them against the target database. One can compute several expressions, commit them to a local semistructured data catalog, review the results, then if satised with the transformations, rerun the operation against the target database. 4.6.3 Integrating Third-Party Libraries Our approach also allows for easy integration with third-party libraries through user-dened functions as parameters to certain operations. For example, the default implementations of the similarity functions used by the similarity join, deduplicate, and nest operations use the edit distance algorithm of the Natural Language Toolkit. This opens up the possibility for 117 integrating machine learning and other libraries into the schema evolution process (clustering for similarity, etc). Python's support for closures makes it possible to pass in parameters that are not currently envisioned by our implementation for a user-dened function such as a model generated by a machine learning algorithm to be used downstream (e.g., by a classier). 4.6.4 Interactive Environment The implementation further benets from being embedded in an interpreted language with full introspection of its objects available at runtime. As soon as the user has connected to a catalog, listing the schemas (i.e., namespaces) in the catalog is as simple as calling the catalog.describe() method; similarly, the table and column denitions can be described at runtime. In addition, schema descriptions can be visualized as a graph by calling graph() instead of describe() where applicable. To enrich the interface, we have implemented hooks used by the popular IPython [108] console and the related Web-based Jupyter Notebook [87]. Since computed relations pre-compute their schema, the user can describe and graph the expected schema of their work-in-progress, even before materializing the changes to the cat- alog. The prototype also implements the hooks for auto-completion (a.k.a., tab-completion), a common productivity-aid provided by editors and IDEs. For instance, foo.tables[<tab> will present the user a drop-down listing of the available tables in the foo schema. By lever- aging the well established executable notebook environment, the overall interface for using CHiSEL may be thought of as a de facto \workbench" for schema evolution. 4.7 Evaluation Methodology In order to evaluate CHiSEL in terms of its eectiveness to address complex transformations observed in scientic applications of databases, we dene here a methodology for quantita- tively comparing the eort required to evolve a database using the schema modication 118 Figure 4.4: Screenshot of CHiSEL usage in a Jupyter Notebook. operators of CHiSEL versus conventional SQL dialects. Our goal is to determine the eort as quantied by the number of explicit or implicit operations required to perform a given task. For example, the SQL \SELECT ... WHERE ..." statement translates explicitly to the relational Select operator. Whereas SQL does not have an explicit syntax for the re- lational Project operation and instead it is implied by the projection list within the SQL syntax for the SELECT statement. For the purpose of this comparison, our methodology is to 119 count both the explicitly identied operators (i.e., Select) as well as the implied operators (i.e., Project) as operations that impact the user cost of forming expressions. Our ratio- nale for using this in our metrics is that whether these are explicit or implicit they aect the cognitive load on the user and therefore represent real eort on his or her part. We do not, however, consider the SQL syntax for the often used \CREATE TABLE ... AS (query)" pattern in our metrics. This is eectively a mere assignment and the syntax is nearly boilerplate rather than representing a signicant cognitive load. Also, the cost of the table denition operation does not signicantly in uence the comparison since it is universal in all transformation statements (i.e., one either explicitly assigns as in CHiSEL, or explicitly CREATEs as in SQL). We also consider usages of functions { user-dened (UDF), aggregate, or other built-in functions { as additional operator overhead though they are not dened within the relational algebra itself. But to closely approximate an SMO, in a SQL approach one would need features provided only through additional functions (if at all possible). We therefore count required functions as additional costs in terms of eort required of the user. This is a conservative approach as the usage of UDFs requires additional coding eort beyond merely using a built-in operation. Next, we dene a formal rubric for how to score the eort or cost of a schema evolution expression. Then using the rubric, we establish the baseline costs of both CHiSEL and SQL expressions for each schema modication operator (SMO) identied in Section 4.3. 4.7.1 Rubric for Measuring the Cost of Schema Evolution Expres- sions We dene the following rubric for calculating the cost of schema evolution expressions in terms of the overall cognitive load on the user. Table 4.4 summarizes the rubric used for measuring the cost of expressions and gives examples to illustrate its application in both CHiSEL and SQL approximations of the SMO operators. 120 Table 4.4: Summary of scoring rubric for measuring cost of schema evolution expressions. Criteria CHiSEL example SQL example Cost R1 Denition of new tables in the database Assignment of a schema evolution expression to a table name in the data- base Use of \CREATE TABLE ... AS ..." expres- sions to dene a new ta- ble 0 R2 Operations supported explicitly in the language syntax Operations dened explicitly in CHiSEL such as to domain(), to vocabulary(), etc. Operations expressed in SQL such as SELECT, GROUP BY, DISTINCT, etc. 1 R3 Operations implied by parameters or other constructs of the syntax Operations such as pro- jections that are invoked implicitly as lists of col- umn parameters in an API call Operations such as pro- jections that are invoked implicitly as lists of col- umn names in a SQL statement 1 R4 Usage of aggregate and other built-in functions Passing a standard string splitting function as a parameter for unnesting denormalized values Use of ARRAY(...) in a projection to aggregate values from tuples into an array of values for an attribute of a single tuple 1 R5 Development and usage of user-dened functions Development and pass- ing of custom functions written in the host lan- guage (i.e., Python), for example to do a custom similarity measure Development and execu- tion of custom functions (i.e., \CREATE FUNCTION ...") for data or schema manipulation not sup- ported directly in SQL 1 R1 Creating or dening a new table does not incur a usage cost. We assert that the syntax for dening a new table is a rote task and can be considered boilerplate. Therefore we assign a cost of 0 for the use of syntax for dening or assigning tables. Creating a table in CHiSEL is performed by assigning a SMO expression to a named element of a schema, such as \catalog[`schema name'][`table name'] = ...". The SQL equivalent of creating a table is the DDL syntax of \CREATE table name AS (...)." In either case, these constructs are repetitive and do not vary other than the name of the table and/or schema to be dened. R2 Usage of operations that are directly and explicitly supported by the language constructs 121 carry a unit cost. In the CHiSEL dialect as an embedded domain specic language, expressions are formed typically by method calls on objects that represent various elements of the schema. For example, a Column object has a method to domain() for generating the Domainify expression. In SQL dialects, clearly the syntax supports \SELECT ... FROM ... WHERE ..." for generating Select expressions or syntax such as \... GROUP BY ..." for generating Aggregate expressions, and so on. R3 Usage of operations that are indirectly or implicitly formed through parameters or argu- ments to explicit operations carry their own unit cost. Some operations are not directly re ected in the syntax of SQL or the method calls or other constructs of CHiSEL. For example, theProjection operation in both SQL and CHiSEL is implied by a set of ar- guments to another call. Clearly, in SQL the projection is specied within aSelect op- eration as a sequence of attributes, such as \SELECT attribute1, attribute2, ... attributeN FROM ...". Similar implied projections can be identied in CHiSEL such as the list of arguments to the Reify operation such as, \table.reify(attribute1, attribute2, ..., attributeN)". R4 Usage of aggregate or other built-in functions add to the complexity of operations and carry an additional unit cost. These function increase the complexity of the expression that they are used within. Because CHiSEL expressions generally hide this complexity there are fewer examples for when users would need to resort to additional functions to extend its operators. One such example is with the Unnest operator (and the composite operators that are formed from it) that accepts an unnesting function which couple take standard library routines such as the Python string splitting function that is already the default paramter. In SQL, some dialects support aggregate functions such as ARRAY and UNNEST that can pack or unpack array type data, respectively 8 . R5 Development and usage of user-dened functions again add to the complexity of oper- 8 The exact function names vary by database management system implementation. 122 ations and carry an additional unit cost. User-dened functions (UDFs) clearly con- tribute to the complexity of forming schema evolution expressions whether in CHiSEL or SQL dialects. Within CHiSEL, users could supply their own UDFs as parameters to Unnest (in place of standard string splitting functions as discussed above) or to Deduplicate and other operations that accept a similarity measure function. While CHiSEL provides its own similarity measure function based on the well known edit dis- tance algorithm, there is a wide range of possible implementations that users may wish to supply in its place. In SQL, it will be necessary for the user to develop UDFs even to approximate CHiSEL operations such as identifying the primary key from querying the database metadata for a relation in order to approximate the CHiSEL form of the Reify sub operator. While the development of UDFs clearly comes at a cost, for these measures we assume the cost is amortized across many uses and contributes to a minimal overhead on each invocation of the function, and therefore we do not measure the development cost in the schema evolution metrics. 4.7.2 Comparative Costs of Schema Evolution Operations We then apply the rubrics (Section 4.7.1) to form a set of basic measures for expressing an SMO in CHiSEL and for the cost of approximating an SMO in SQL. The metrics establish a baseline \cost" of the expression. Some CHiSEL operations may have a close equivalent in SQL while others can only be loosely approximated. For example, theDomainify operator in CHiSEL may perform strict or fuzzy matching in order to deduplicate the entities of a relation. In conventional SQL, this would require development of a custom function (i.e., an UDF), or usage of a built-in function provided by a specic relational database management system, that implements a fuzzy similarity matching function and can only be roughly ap- proximated. We summarize these costs for CHiSEL and the SMO approximations in SQL expressions in Table 4.5. First the SMOs that take the form of conventional relational operators have clear equiv- 123 alence between CHiSEL and SQL. For Select, the CHiSEL expression takes the form of t.select().where(...) where `t' is a table name in the database and `...' a place- holder for the conditional formula to restrict the tuples in an optional where clause. SQL clearly takes the form SELECT * FROM t WHERE .... In both languages, the cost of the operation per Rubric R2 is 1. Project also takes a similar form of t.select(a, b, c, ...) and SELECT a, b, c, ... FROM t respectively, where \a, b, c, ..." are the col- umn names of t to be projected. Per Rubric R3, these implicit operations incur a cost of 1. Rename is just slightly dierent as usual t.select(b=a) and SELECT a AS b FROM t respectively, and again per Rubric R3 incur a cost of 1. Join in CHiSEL is expressed as t.join(s).where(...) using `s' as another named table in the database, and similarly in SQL it is expressed as SELECT * FROM t, s WHERE ... or SELECT * FROM t JOIN s ON ... and per Rubric R2 these incur cost of 1 in either language. Union is expressed as expression1 + expression2 in CHiSEL or expression1 UNION expression2 in SQL, and again applying Rubric R2 both forms incur a cost of 1. Finally, Distinct is expressed within CHiSEL composite formulas and the user does not need to directly invoke the opera- tions, while certain expressions formed in SQL will require the use of the DISTINCT keyword at a cost of 1 according to Rubric R2. The SimilarityJoin, Deduplicate, Nest, and Unnest operations are not directly exposed in the CHiSEL language as they are only required through the use of composite operators in the language. In order to approximate these operations in SQL expressions, the rst three of these would depend on a hypothetical but plausible SIMILAR(...) func- tion that performs non-trivial similarity measures beyond the simple pattern matching LIKE comparison or common regular expression operators available in many database management systems. While this function does not exist in standard SQL, most commercial and open source database management systems provide functions for performing string or word simi- larity comparisons based on edit distance or other fuzzy matching algorithms. Per Rubric R4, the usage of such functions incurs a cost of 1. ThusSimilarityJoin could be expressed as: 124 SELECT * FROM t JOIN s ON SIMILAR(...) with a cost of R2 + R4 for a total cost of 2. Deduplicate and Nest, however, require inherently more complex expressions to approximate in SQL.Deduplicate could be approx- imated in SQL using a similarity self-join, aggregating the resulting relation, and returning the rst unique value, as in the following expression: SELECT p.a FROM ( SELECT g.a, ARRAY(g.b) AS arr FROM ( SELECT DISTINCT ON (t.a, s.a) t.a AS a, s.a AS b FROM t JOIN t AS s ON SIMILAR(t.a, s.a) ) AS g GROUP BY g.a ) AS p WHERE p.a = p.arr[1]; where the innermost expression costs 3 R2 + R3 + R4 for a subtotal of 5, plus the next most inner expression costs 2 R2 + R3 + R4 for a subtotal of 4, plus the outer expression consists of R2 + R3 for a subtotal of 2, and therefore the whole expression comes at an overall cost of 11. Likewise, Nest is essentially the same expression but with additional projections and aggregations of the nested elements of the inner expressions, such as: SELECT p.a, p.arr_c FROM ( SELECT g.a, ARRAY(g.b) AS arr, ARRAY(g.c) AS arr_c FROM ( SELECT DISTINCT ON (t.a, s.a) t.a AS a, s.a AS b, s.c AS c FROM t JOIN t AS s ON SIMILAR(t.a, s.a) ) AS g GROUP BY g.a ) AS p WHERE p.a = p.arr[1]; which adds an additional aggregate function and per R4 raises the total cost for the expression to 12. The Unnest operation would require a function to unpack an array into separate tuples (e.g., unnest function in PostgreSQL) and a built-in or user-dened function to take unstructured input and convert it into an array (e.g., string to array also from PostgreSQL 125 could convert strings to arrays but UDFs would be required to unpack messier values into discrete units) in an expression such as: SELECT UNNEST(STRING_TO_ARRAY(a)), b, ... FROM t and by applying Rubric R2, R3, R4, and R5 comes at a cost of 4. These are of course only approximations, because there may be many ways to achieve the same operations in dierent but equivalent expressions. We have attempted here to provide the relatively straightforward implementation one might use to achieve the respective SMOs. Next, we turn our attention to the SMOs that in the CHiSEL formulation are \com- posites" of other operations. CopyCol is expressed by joining the target table t with the source table s and then projecting all of the columns t.* along with the addition of the desired column s.a from the source table s. We can use the following expression for this: SELECT t.*, s.a FROM t JOIN s ON ... and it has a cost of 2 R2 + R3 for a total of 3. Reify is expressed by forcing set semantics on a table t based on the distinct columns a1, ..., aN to be projected as the key columns of the new relation along with a subset of desired columns b1, ..., bM from a table t. SELECT DISTINCT ON (a1, ..., aN) a1, ..., aN, b1, ..., bM FROM t and by 2 R2 + R3 has a total cost of 3. Reify Sub is expressed by projecting the key of a table t along with a subset of desired columns a1, ..., aN. To approximate the respective SMO, the expression would depend on a user-dened function KEY() to infer the primary key columns from the given table. The following example: SELECT KEY(t), a1, ..., aN FROM t consists of R2 + R3 + R5 for a total cost of 3. To approximateAlign, we begin by assuming a table t with a column aN that we wish to replace with a \canonical" term in column b of table s based on a fuzzy similarity match between the target and canonical columns. To do 126 so, we would perform a similarity join on t.aN and s.b and then project all of the columns in t without aN (denoting the column just before the Nth term as aN minus 1, for illustration but of course the formula need not be limited to such a sequence of columns) and instead project out s.b. We get the following example expression: SELECT t.a_1,... , t.aN_minus_1, s.b AS aN FROM t JOIN s ON SIMILAR(aN, s.b) and it requires 2 R2 + 2 R3 + R4 for a total cost of 5. The nal set of SMOs to be discussed can be formulated mostly from combinations of the expressions described above. Atomize is a composite of Unnest and Reify Sub and by combining their respective SQL expressions comes at a combined cost of 7. Domainify builds onDeduplicate by deduplicating a single column and renaming it, adding cost per R3, therefore raising the total cost of the expression to 12. Canonicalize begins with a sub- expression to project a target column twice, such as SELECT a AS term, a AS synonyms FROM t and then nests synonyms. Thus the cost is that of Nest plus 1 per R3 for the renaming to bring the total cost to 12. Tagify is a composite of Align andAtomize and by combining their respective SQL expressions yields a total cost of 12. Finally, we return to the cost evaluation for the usage of CHiSEL expressions beyond the basic operations evaluated above. In general, the CHiSEL syntax exposes methods cor- responding to SMOs in idioms of its host language. For example, to Atomize a column, the CHiSEL expression ist.columns[ `a' ].to atoms(). Similarly,Domainify,Canoni- calize,Align, andTagify are expressed through methodsto domain(),to vocabulary(), align( domain ) and tagify( domain ), respectively. These operations per R2 each come at a cost of 1. Reify and Reify sub are exposed via table-level methods. Reify as in t.reify(f ... key columns ...g, f ... non-key columns ... g) andReify sub as in t.reify sub( ... non-key columns ... ). The expressions include an additional expense for the projections per R3 and therefore have total cost of 2 each. The costs of the CHiSEL and corresponding SQL expressions are summarized in Table 4.5. 127 Table 4.5: Cost comparison matrix for CHiSEL and SQL expressions for each SMO. CHiSEL Corresponding SQL Cost SMO Cost R1 R2 R3 R4 R5 Total Assign 0 0 0 Select 1 1 1 Project 1 1 1 Join 1 1 1 Rename 1 1 1 Union 1 1 1 Distinct { 1 1 SimilarityJoin { 1 1 2 Deduplicate { 6 3 2 11 Nest { 6 3 3 12 Unnest { 1 1 1 1 4 CopyCol 1 2 1 3 Reify 2 2 1 3 Reify sub 2 1 1 1 3 Atomize 1 2 2 1 2 7 Domainify 1 6 4 2 12 Canonicalize 1 6 4 3 13 Align 1 2 2 1 5 Tagify 1 4 4 2 2 12 4.8 Case Studies We applied the methodology of Section 4.7 to evaluate CHiSEL in terms of its eectiveness to address complex transformations observed in scientic applications of databases. To do so, we demonstrate the eectiveness of the framework in the context of three distinct, exemplar use cases. Throughout the case studies, we apply the evaluation methodology to compare the eort of performing the transformations in CHiSEL versus their alternatives in conventional SQL. 4.8.1 Data Commons Consortium In this scenario, on behalf of a NIH Common Fund \Data Commons" pilot consortium (Commons), we initially designed a database based on a direct rendering of data that had 128 been extracted from the Genotype-Tissue Expression (GTEx) project and an extract from the European Bioinformatics Institute 9 . These data extracts are considered \scientic meta- data" meaning they are metadata describing the data les produced at various stages of a biomedical investigation. In this case, the data described by the metadata are transcriptomic sequencing (fastq), alignment mappings (cram and crai), and variant call (vcf) les. In the course of this Commons project, we were instructed to evolve the database to re ect an emerging \common core metadata model" based on the DAta Tag Suite (DATS) [119] being developed by a working group of the consortium. The rst steps were to establish a connection to the remote database (line 2) and to a local collection of additional le metadata (line 3) to be integrated into the updated database. A temporary variable (rel in line 8) was computed based on selecting only the `GTEx' data, since the EBI data were not in scope for the purpose of the consortium. Also, a column (SAMPID) was renamed, since it had erroneous characters in it from the initial creation of the database. Connect to data sources and compute tempvar. 1 # Connect to the data service 2 catalog = chisel.connect(`https://...') 3 gtex_data = chisel.connect(`file:///...') 4 # Convenient handles 5 public = catalog.schemas[`public'] 6 kc7 = public.tables[`KC7 Search Demo'] 7 # Compute an temporary relation from kc7 table 8 rel = kc7.select( 9 kc7[`\ufeffSAMPID'].alias(`SAMPID'), *kc7.columns)\ 10 .filter(kc7[`DATASET'] == `GTEx') 9 The database for this use case was our own based on these real data extracts, but does not re ect in any way the internal data management of the GTEx or EBI projects. The database was initially set up for a demonstration project in a separate but aliated project. For more information on the GTEx project, see https://gtexportal.org/home/. For more information on the EBI data, see https://www.ebi.ac. uk/ena/data/view/PRJEB2784. 129 Next, as seen in the above listing, we evolved the initial relation (rel) into three core tables drawn from the scientic domain. The distinct domain values for the DATASET col- umn were made into their own table. The Subjects were reied (an observable functional dependency existed on the SUBJID column) from the source table. Finally, the Samples were projected with only the SUBJID to reference the appropriate rows of the Subjects table. The transformations in CHiSEL (Domainify, Reify, Select, Project) come at a cost of 5 per the rubrics (Section 4.7.1) and translate into a cost of 17 (12 + 3 + 1 + 1) for the comparable SQL expressions per the comparisons of Table 4.5. Listing 4.1: Reify and project core tables. 11 # Datasets 12 public.tables[`Datasets'] = rel[`DATASET'].to_domain() 13 # Reify the Subjects from the full kc7 table 14 public.tables[`Subjects'] = rel.reify( 15 {rel[`SUBJID']}, # new table's key column 16 {rel[`DATASET'], rel[`SEX'], rel[`AGE']}) 17 # Project the Sample columns from the kc7 table 18 public.tables[`Samples'] = \ 19 rel.select(*[...all columns not in subject...]) The source table also contained a non-atomic eld used to record the facility that had collected the biosamples. In Listing 4.2, we rst extracted the atomic values which eectively dened an association table from Samples to a reied Collection Center table based on the distinct domain of the SMCENTER column. The transformations in CHiSEL (Atomize, Domainify) come at a cost of 2 and translate into a cost of 19 (7 + 12) for comparable SQL expressions. Listing 4.2: Atomize and associate supplementary table. 21 # Collection center association from Samples 22 scc = rel[`SMCENTER'].to_atoms() 23 public.tables[`Sample_Collection_Center'] = scc 130 24 # Collection center table 25 public.tables[`Collection_Center'] = \ 26 scc[`SMCENTER'].to_domain() We then dened 7 \vocabulary" term tables from the domains of columns in the newly dened Samples and Subjects tables (Listing 4.3). The transformations in CHiSEL (7 Canonicalize) come at a cost of 7 and translate into a cost of 91 (7 13) for comparable SQL expressions. Listing 4.3: Create vocabulary term tables. 27 # Vocabulary term tables 28 public.tables[`Tissue_Terms'] = \ 29 public.tables[`Samples'][`SMTS'].to_vocabulary() 30 public.tables[`Cell_Terms'] = \ 31 public.tables[`Samples'][`SMTSD'].to_vocabulary() 32 public.tables[`Isolation_Protocol'] = \ 33 public.tables[`Samples'][`SMNABTCHT'].to_vocabulary() 34 public.tables[`Expression_Terms'] = \ 35 public.tables[`Samples'][`SMGEBTCHT'].to_vocabulary() 36 public.tables[`Library_Terms'] = \ 37 public.tables[`Samples'][`LIBRARY_TYPE'].to_vocabulary() 38 public.tables[`Sex'] = \ 39 public.tables[`Subjects'][`SEX'].to_vocabulary() 40 public.tables[`Age_Stage_Terms'] = \ 41 public.tables[`Subjects'][`AGE'].to_vocabulary() In Listing 4.4, the 3 external tables of le metadata in the local catalog gtex data were integrated into the database. Some of the columns of the le metadata tables were renamed during the select (not shown for space). Finally, the evolve block exits (not shown) and the assigned relations were materialized. The transformations in CHiSEL (3 Select, 3 Project) come at a cost of 6 and translate into a cost of 9 (3 3) for comparable SQL expressions as they require an additional non-standard expression to copy data from a 131 structured le into the database (see note below). Listing 4.4: Integrate semistructured tables. 42 # File metadata 43 rnaseq = gtex_data.schemas[`.'].tables[`rnaseq.txt'] 44 public.tables[`RNA_Seq_Files'] = \ 45 rnaseq.select(*[...list of renamed columns...]) 46 47 wgs = gtex_data.schemas[`.'].tables[`gtex-wgs.txt'] 48 public.tables[`WGS_Files'] = \ 49 wgs.select(*[...list of renamed columns...]) 50 51 wgs_vcfs = gtex_data.schemas[`.'].tables[`gtex-wgs-vcfs.txt'] 52 public.tables[`WGS_VCF_Files'] = \ 53 wgs_vcfs.select(*[...list of renamed columns...]) The initial single table database, of 15,757 rows and 83 columns, was evolved into a 15 table database using a compact script of 20 CHiSEL operations compared to 136 operations in SQL (see Table 4.6). In addition, Listings 4.2{4.4 cannot be fully satised in purely SQL operations: Listing 4.2 requires a set-returning function (i.e., one that returns 1+ tuples for each input tuple); Listing 4.3 requires a similarity comparison based grouping; and Listing 4.4 requires extended operators that can load data from an external source 10 . Table 4.6: Comparison of operation cost required per listing. Listing 1 2 3 4 Total CHiSEL 5 2 7 6 20 SQL 17 19 91 9 136 10 For example, COPY in PostgreSQL or BULK INSERT in Microsoft SQL Server. 132 4.8.2 Genomic Enhancers In a use case based on a recent scenario in the FaceBase Consortium [21] Data Hub, we used CHiSEL to transform part of the database schema used to describe genomic enhancer assays. For historical reasons, the enhancers were modeled in a relatively at table (i.e., denormalized), with free text lists of gene names and anatomical sites in uenced by the enhancers. Also, multiple categories of genomic loci were embedded in the enhancers table. A CHiSEL script was used to reify the genomic loci into independent classes of entities and rst to extract the atomic values from the embedded gene names and anatomical sites, then semantically align those values with existing nomenclature in canonical term tables. These transformations required 12 operations of CHiSEL SMOs as compared with 41 operations in SQL to perform the equivalent transformations. The rst steps were to establish a connection and get a handle to the enhancer table that had the denormalized structure to be evolved. In addition, handles to vocabulary tables for anatomy and gene names were also assigned. Connect to database and get handles to tables for future use. 1 # Connect to the database 2 catalog = chisel.connect(`https://...') 3 4 # For convenience, assign these tables to variables in the script 5 enhancer = catalog[`isa'][`enhancer'] # table we want to evolve 6 anatomy = catalog[`vocab'][`anatomy'] # a vocabulary table 7 gene_names = catalog[`vocab'][`gene'] # a vocabulary table The next step was to \reify" three embedded concepts from within the enhancer table. Each of these represented genomic loci { locations within a genome identied by their genome assembly, chromosome, and the start and end position on the chromosome. The \original" denotes the original species from which the sequence was identied, the \other" denotes the model organism in which the sequence was tested, and the \visualization" denotes the 133 genome assembly used to visualize the sequence. In this listing, the three concepts were projected as new child records of the enhancer records. The transformations in CHiSEL (3 Reify sub ) come at a cost of 6 per the rubrics (Section 4.7.1) and translate into a cost of 9 (3 3) for the comparable SQL expressions per the comparisons of Table 4.5. Listing 4.5: Reify three genomic loci sub-concepts from the original table. 8 # Reify a sub-concept, the `original species' loci, from the 9 # enhancer table into its own table structure. 10 original_species_assembly = enhancer.reify_sub( 11 enhancer[`original_species_assembly'], 12 enhancer[`original_species_chromosome'], 13 enhancer[`original_species_start'], 14 enhancer[`original_species_end'] 15 ) 16 17 # Reify the `visualization' loci sub-concept. 18 visualization_assembly = enhancer.reify_sub( 19 enhancer[`visualization_assembly'], 20 enhancer[`visualization_assembly_chromosome'], 21 enhancer[`visualization_assembly_start'], 22 enhancer[`visualization_assembly_end'] 23 ) 24 25 # Reify the `other' organism loci sub-concept. 26 other_assembly = enhancer.reify_sub( 27 enhancer[`other_assembly'], 28 enhancer[`other_assembly_chromosome'], 29 enhancer[`other_assembly_start'], 30 enhancer[`other_assembly_end'] 31 ) The next issue to be resolved was that of the embedded, unstructured, lists of anatomical sites and gene names associated with the enhancer assays. The list of closest genes 134 and list of anatomical structures were converted \to tags" aligned with the existing controlled vocabulary terminology from tables gene names and anatomy in lines 35 and 47, respectively. These steps created new relations and the \list of " prex was dropped from the corresponding column names in lines 41 and 55. The transformations in CHiSEL (2 Tagify, 2 Select, 2 Rename) come at a cost of 6 and translate into a cost of 32 (2 12 + 2 2 + 2 2) for the comparable SQL expressions. Finally, as usual, the new relations were assigned to a name in the database catalog during an evolve block. Listing 4.6: Convert denormalized lists of \tags" for anatomy and gene names into normal- ized, aligned, associated relations. 32 # Convert a nested, non-normal form, list of gene names into a 33 # normalized structure and align them to the gene name vocabulary. 34 enhancer_closest_genes = \ 35 enhancer[`list_of_closest_genes'].to_tags(gene_names) 36 37 # Rename the column `list_of_...' to `closest_genes' 38 enhancer_closest_genes = enhancer_closest_genes.select( 39 enhancer_closest_genes[`id'].alias(`enhancer_id'), 40 enhancer_closest_genes[`list_of_closest_genes'].alias( 41 `closest_genes') 42 ) 43 44 # Do the same for the nested, non-normal form, list of anatomical 45 # structures. 46 enhancer_anotomical_structure = \ 47 enhancer[`list_of_anatomical_structures'].to_tags(anatomy) 48 49 # Again rename the now normalized column. 50 enhancer_anotomical_structure = \ 51 enhancer_anotomical_structure.select( 52 enhancer_anotomical_structure[`id'].alias(`enhancer_id'), 53 enhancer_anotomical_structure[ 135 54 `list_of_anatomical_structures'].alias( 55 `anatomical_structures') 56 ) The initial enhancer table, of 77 rows and 27 columns, was evolved into a network for 5 new tables and 1 revised original table using 12 CHiSEL operations compared to 41 operations in SQL (see Table 4.7). Table 4.7: Comparison of operation cost required per listing. Listing 5 6 Total CHiSEL 6 6 12 SQL 9 32 41 4.8.3 Microscopy Core In yet another use case, we recreated the formative stages of development of the database used to managed data for the microscopy core facility for the USC Stem Cell center (CIRM) [127]. The database has collected experimental metadata describing over 8,192 whole slide images. The database began with a relatively at structure modeled around common concepts in the lab. Over time, the original domain concepts needed renement to re ect their maturing understanding of their experimental concepts. Initially, the terminology (e.g., for experiment type, tissue type, etc.) were uncontrolled, which led to great diculty in searching for data. To remedy these issues, the database underwent multiple phases of schema evolution. In this use case, we recreated these formative phases of their database by essentially reverting a snapshot of their data to the earliest form. Then using a CHiSEL script, we recreated the transformations that were needed in order to evolve it to its present state. As usual, the rst steps were to establish a connection to the database and get a handle to the Scans table that had been created by our denormalization routine to simulate the original state of the use case. No additional handles were assigned since this case study 136 begins as a single-table database. Connect to data source and get handle to the single table of its schema. 1 # Connect to the denormalized scans database. 2 catalog = chisel.connect(`https://...') 3 4 # Get handle to the `Scans' table. 5 Scans = catalog[`public'][`Scans'] In Listing 4.7, the script begins by \reifying" certain core concepts from the Scans table: Slides for metadata on the glass slides produced by the histology lab as requested by the microscopy core (lines 7{13), Specimens for details of the biological tissue samples sent to histology (lines 16{22), and Experiments for descriptions of the experimental methods and settings used by the scientists, relative to a set of Scans (lines 25{31). Note that, for simplicity of the demonstration, the columns for this scenario were coded with prexes for the class of information \Slide:...", \Specimen:...", \Experiment:...," respectively. This was merely for convenience for creating the examples used here and for readability. We leveraged this fact by using set-comprehensions for creating the projections of columns to be reied (for example, see lines 9{12 and their like). The transformations in CHiSEL (3 Reify) come at a cost of 6 per the rubrics (Section 4.7.1) and translate into a cost of 9 (3 3) for the comparable SQL expressions per the comparisons of Table 4.5. Listing 4.7: Reify core concepts of Slides, Specimens, and Experiments. 6 # Reify the Slides 7 Slides = Scans.reify( 8 {Scans[`Slide:ID']}, 9 {column for column in Scans.columns.values() 10 if column.name.startswith(`Slide:') and \ 11 column.name != `Slide:ID' 12 } 13 ) 14 137 15 # Reify the Specimens 16 Specimens = Scans.reify( 17 {Scans[`Specimen:ID']}, 18 {column for column in Scans.columns.values() 19 if column.name.startswith(`Specimen:') and \ 20 column.name != `Specimen:ID' 21 } 22 ) 23 24 # Reify the Experiments 25 Experiments = Scans.reify( 26 {Scans[`Experiment:ID']}, 27 {column for column in Scans.columns.values() 28 if column.name.startswith(`Experiment:') and \ 29 column.name != `Experiment:ID' 30 } 31 ) As seen in Listing 4.8, the script then generates a RevisedScans computed relation from Scans by projecting only those columns that represent detailed metadata about the whole slide images produced by the microscopy core. (The projection is again formed dynami- cally by a list-comprehension in lines 35{37.) The transformations in CHiSEL (Select, Project, Rename) come at a cost of 3 and translate into a cost of 3 for the comparable SQL expressions. Listing 4.8: Alter original Scans table to remove newly reied relations. 32 # Alter the Scans by dropping the Slides, Speciments and 33 # Experiments columns 34 RevisedScans = Scans.select( 35 *[column for column in Scans.columns.values() 36 if column.name.startswith(`Scan:') or 37 column.name in (`RID', `RCB', `RMB', `RCT', `RMT')] 138 38 ) Next in Listing 4.9, the script generates three new sets of terminology using increas- ingly complex operations. The Experiment Type was generated as a new simple \do- main" from the \Experiment:Experiment Type" column (lines 40{41). A more struc- tured \vocabulary" for Tissue was produced from the \Specimen:Tissue" column (line 44). The \Specimen:Genes," however, contained denormalized lists of gene names with a non-standard encoding of the values that required a user-dened function to customize how the values were parsed (lines 49{56). The custom split genes function was then passed to the to atoms method of the \Specimen:Genes" column object to yield a normalized as- sociative relation Specimen Genes. With those values normalized into a new relation, the to vocabulary was used to produce a canonical term set Genes of gene names (line 63). The transformations in CHiSEL (Domainify, 2Canonicalize,Atomize with a UDF) come at a cost of 5 and translate into a cost of 46 (12 + 2 13 + 7 + 1) for the comparable SQL expressions. Listing 4.9: Create new domains and vocabularies for certain columns. 39 # Create a custom domain based on the Experiment Type 40 Experiment_Type = Experiments[`Experiment:Experiment Type'] \ 41 .to_domain() 42 43 # Create a custom vocabulary based on the Specimen Tissue 44 Tissue = Specimens[`Specimen:Tissue'].to_vocabulary() 45 46 # Normalize the currently nested Specimen Genes 47 import json 48 49 def split_genes(s): 50 # a custom function is needed to parse and split the values 51 if s: 52 # encoding is not quite right, so fix the quotes then parse 139 53 values = json.loads(s.replace("'", `"')) 54 for value in values: 55 # then yield each value 56 yield value 57 58 # Unnest the Specimen Genes and create a vocabulary 59 Specimen_Genes = Scans[`Specimen:Genes'].to_atoms( 60 unnest_fn=split_genes) 61 62 # Create a vocabulary from the gene names 63 Genes = Specimen_Genes[`Specimen:Genes'].to_vocabulary() Beyond the steps described above for this use case, clearly the vocabulary would require some tuning and human review. The default edit distance algorithm could be replaced with alternative matching algorithms. In our review, the default algorithm did correctly nd synonyms such as seen in this tuple queried from the resultant relationf`name': `Swiss Webster', `synonyms': [`Swiss Webster', `Swiss (Webster)']`g, but it also incor- rectly identied synonyms such asf`name': `Six1', `synonyms': [`Six1', `Six2']g. Gene nomenclature would clearly benet from specialized similarity search algorithms. For this reason, CHiSEL allows alternative custom functions to be plugged in and used by the physical-level operators. At this stage, the vocabulary could be reviewed by the domain user, who can correct the vocabulary and then load it into the database using CHiSEL. Once - nalized, the computed relations described above would be materialized in the database per the usual evolve block. The initial single-table of Scans, consisting of 8,192 rows and 80 columns, was evolved into a normalized database of 7 new tables and 1 revised original table using 14 CHiSEL operations compared to 58 operations in SQL (see Table 4.8). 140 Table 4.8: Comparison of operation cost required per listing. Listing 7 8 9 Total CHiSEL 6 3 5 14 SQL 9 3 46 58 4.8.4 Summary of Results The use cases demonstrate that the compact statements from the CHiSEL framework can perform complex operations usually requiring more verbose scripting if left up to the conven- tional SQL language. Across the use cases a variety of SMOs were employed in each scenario showing both a uniqueness in the distribution and also an overlap indicating reusability of SMOs in distinct scenarios (see Table 4.9). Table 4.9: SMO operator usage per use case. SMO Commons Enhancers Microscopy Select 4 2 1 Project 4 { 1 Rename 1 2 1 Atomize 1 { 1 Canonicalize 7 { 2 Domainify 2 { 1 Reify 1 { 3 Reify sub { 3 { Tagify { 2 { When comparing the number of required operations using CHiSEL SMOs versus standard SQL, it required over 4 times as many operations in SQL to perform the transformations described in the use cases (see Figure 4.5). By reducing the complexity of operations that are exposed to the database user, CHiSEL can therefore reduce the amount of eort required by them to perform database evolution tasks. Code for the latter use cases are available at https://github.com/robes/chisel-ssdbm19. 141 0 50 100 150 200 250 Commons Enhancers Microscopy TOTAL Cost of CHiSEL vs SQL across Use Cases CHiSEL SQL Figure 4.5: Comparison of the required user eort (cost) of CHiSEL vs SQL schema evolution operations per use case. 4.9 Performance Evaluation We evaluated CHiSEL in terms of its eciency to improve the performance of its high- level operations by rewriting and reusing subexpressions. To evaluate the eciency of the approach, we compared the performance of CHiSEL with and without the use of the subex- pression consolidation algorithm (Section 4.5.3). By disabling subexpression consolidation, all SMOs essentially behave as atomic operations versus the optimized form that rewrite ex- pressions to reuse common subexpressions therefore reducing the amount of system resources required to execute the expression graph. 4.9.1 Real-World Experiments The rst performance evaluation experiments we present are based on the dataset and work- load from the Commons use case (Section 4.8.1). Setup The data for the experiments were based on the public data from the GTEx Analysis V7 release, which were available from the GTEx portal. As described in Section 4.8.1, we 142 combined the sample and subject phenotype data into a single table database. The dataset consisted of 15,598 records of combined sample and subject metadata from GTEx plus 159 records from the EBI study noted previously. Each record consisted of up to 83 attributes of integer, oating point, text, and timestamp data types. The tests were performed with a client-server environment using the Deriva ERMrest back-end and CHiSEL front-end. The server was virtualized with the back-end services running in a virtual machine (VM) guest. The environment is detailed in Table 4.10. Table 4.10: Test environment for real-world-based experiments VM Host VM Guest Client CPU 32 x 2.6 GHz Xeon 4 vCPUs 4 x 4 GHz Core i7 RAM 256 GB 16 GB 32 GB Storage 37 TB RAID 6 Array 200 GB vDisk 1 TB Fusion HDD OS CentOS 7.5 Fedora 28 macOS 10.13 DBMS PostgreSQL 10.5 N/A N/A Experiments We designed four experiments based on the key steps of the use case presented in Sec- tion 4.8.1. Each of the steps were separable from the overall script and demonstrated dierent aspects of the overall database evolution process. Experiment 1 isolates the concept reication depicted in Listing 4.1. Experiment 2 expands experiment 1 with the atomize and domainify statements of Listing 4.2. Experiment 3 expands experiment 1 with the canonicalization steps illustrated in List- ing 4.3. Experiment 4 isolates only the integration steps of Listing 4.4. In order to factor out the overhead of materializing to the remote database, the experi- ments instead materialize to a semistructured catalog on the le system local to the client. 143 Automatic garbage collection was also disabled on the client side. Each experiment was ex- ecuted 10 times. Between each, the local catalog was deleted, garbage collection was forced, and the experiment script slept for 5 seconds before performing the next run. Figure 4.6: Results by experiment and atomicity. Execution time of optimized expressions (Composite) shown in light gray and non-optimized expression execution (Atomic) shown in dark gray with error bars in red. Results The minimum and maximum measures for each experiment were dropped, and the mean and standard deviation of the remaining measures were plotted in Figure 4.6. Standard deviation was very low as shown in the gure. Across all scenarios, the results indicate that the composite SMOs can be eciently evaluated and improve performance of operator execution by nearly 42% over the unoptimized (\atomic") approach. Experiment 4 shows that the overhead of the consolidation algorithm is negligible for cases where there are no reusable subexpressions. 144 4.9.2 Schema Evolution Benchmark Experiments Earlier benchmarks for schema evolution focus on enterprise or web information systems workloads, and they seek to measure the success rate of supporting backward compati- bility [37]. Our focus, instead, is to develop a new benchmark for schema evolution for scientic databases to evaluate the eort and eciency of performing the database evolu- tion operations themselves. In other words, what we seek is not a benchmark to measure the performance of database operations after the database has undergone schema evolution to determine whether legacy queries can succeed, but rather this benchmark seeks to test whether a database evolution language (DEL) can express the necessary transformations to evolve the database to its desired conguration, at what eort to the user of the language, and nally how well it performs in doing so. To that end, we present here an early version of a benchmark with utilities to support generation of synthetic benchmark data and drivers to execute a suite of test cases modeled on the requirements for schema evolution of scientic databases (Section 4.2). The utilities are available as open source software and may be found at https://github.com/robes/chisel-benchmark. Synthetic Dataset Generation Here we present the benchmark data generator that we developed and used for producing the test datasets. The data generation utility produces synthetic data intended to simulate the properties of real-world data such as the GTEx dataset described above in Section 4.9.1. It generates a dataset of a synthetic core relation with columns for properties of the core concept with additional columns that can be thought of as representing subconcepts to the core concept described by the relation. This separation of \core concept" and \subconcepts" models the datasets described in the motivation (Section 4.2) and case studies (Section 4.8), with core concepts like scans or enhancers and subconcepts such as assays, specimens, or genomic loci embedded within a common relation. The benchmark data generator rst generates K subconcept relations of M rows each. It 145 then generates the core relation and for each of the N rows of the core relation, it samples a random row of the corresponding subconcept relation (based on a uniform distribution). Each relation, whether for the core concept table or its embedded subconcepts, may have simple columns of type integer, oating point, or text. These columns are populated with random values generated for each row. In addition, each relation may have 0, 1, or more text columns that simulate an uncon- trolled terminology. For these term columns, the generator randomly chooses a value from a source term list and probabilistically introduces errors { e.g., change value to all lower case, change to title case, append with white space, insert random character at a random index of the original value, or delete a random character from the original value { with a 1 per cent chance of one of those errors to be introduced on each row generated for the relation. Finally, relations may include a termlist column which expands on the method used to generate the term columns to generate a comma-delimited list of terms generated through the same process. Table 4.11: Parameters of the benchmark data generation utility. Parameter Description num Number of rows to be generated name Name of the core relation ctypes Column types to includefint, float, textg terms Source le for master term list terms sample size Number of terms to sample per term-based column num term columns Number of term columns per relation num term list columns Number of termlist columns per relation max term list choices Maximum number of terms for each termlist value num sub concepts Number of embedded subconcepts in the core relation num sub concept rows Number of rows generated for each subconcept relation The benchmark data generator accepts several arguments in order to generate synthetic data to simulate a wide range of scenarios. The complete parameters are described in Table 4.11. 146 Test Cases We designed several test cases based on a synthesis of the scenarios covered in Section 4.8 and from previously encountered situations discussed in [127] and [22]. These test cases share properties of the experiments of Sections 4.9.1 and expand on them with several new variations. In addition, the test cases each have multiple variants based on a parameter N used for example to determine how many new relations of a test case-dependent type are generated by the individual test case. Next, we enumerate and describe the test cases from the benchmark that were used to evaluate CHiSEL: Reify N Concepts: Generate N new parent relations (key, other columns...) from a distinct subset of columns of the core relation. Alter the core relation to leave only the new relation's key columns for reference from the core relation to the new relation. Reify N Subconcepts: Generate N new child relations (foreign key, other columns...) from a distinct subset of columns of the core relation. Alter the core relation to remove all columns of the subconcept relations. Reify Concept And N Subconcepts: Generate 1 parent relation (key, other columns...) and N new child relations (foreign key, other columns...), each from a distinct subset of columns of the core relation. Alter the core relation to remove all columns of the new relations except for the key columns of the new parent relation. Reify N Subconcepts And Merge: Generate 1 new child relation (foreign key, other columns...) from N distinct subsets of columns of the core relation. Alter the core relation to remove all columns of the subconcept relations. Create N Relations From Nested Values: Generate N normalized, child relations (foreign key, value column) fromN columns that contain non-atomic values representing denormalized relations embedded in the core relation. Alter the core relation to remove the non-atomic column. 147 Create N Domains From N Columns: Generate N new deduplicated \domains" (i.e., simple, unique, terminology lists normalized as a relation) from N distinct columns from the core relation. Create N Vocabularies From N Columns: Generate N new deduplicated \con- trolled vocabulary" (i.e., terminology including preferred term and synonyms, normal- ized as a relation) from N distinct columns from the core relation. Reify N Subconcepts And Create Domain From Columns: First, generate N child relations (foreign key, other columns...), then generate a new deduplicated \domain" relation merged from N unconstrained term columns (1 each from the N new child relations). Alter the core relation to remove all columns of the subconcept relations. Setup The data for the experiments were based on synthetic datasets generated by the benchmark utility (see Section 4.9.2). Four datasets were independently generated from the same set of parameters with 10 3 , 10 4 , 10 5 , and 10 6 rows respectively. Each dataset consisted of a set of columns representing a synthetic core relation along with sets of columns of an additional 3 embedded, subconcept relations. Each relation had randomly generated primary key, integer, oating point, and text columns. Each had 2 unconstrained term columns and 1 termlist column (i.e., denormalized list of terms). For the source term list that was sampled to form the term-based columns, we extracted the noun dataset from WordNet [100] to form a list of 82,115 nouns. Together the original relation consisted of 28 columns. The datasets were generated with the parameter settings shown in Table 4.12. The tests were performed on a single-node environment using a CHiSEL local le system data catalog. For specications of the host environment, see Table 4.13. 148 Table 4.12: Parameter settings for dataset generation. Parameter Setting num 10 3 , 10 4 , 10 5 , 10 6 name core ctypes int, float, text terms sample size 100 num term columns 2 num term list columns 1 max term list choices 5 num sub concepts 3 num sub concept rows num / 3 Table 4.13: Test environment for benchmark-based experiments. Host CPU 8 x 2.3 GHz Core i9 RAM 32 GB Storage 1 TB SSD OS macOS 10.15 Results Each condition (control versus optimized), for each parameter, of each benchmark test case was executed for 10 rounds. After each round, the test driver utility forced garbage collection that was otherwise disabled so as not to interfere with the execution of the tests, tore down the output relations to restore the test environment to its initial state, and then slept for 1 second to let the system cool down. Like the results of Section 4.9.1, the minimum and maximum measures for each round were dropped, and the mean and standard deviation of the remaining measures were plotted in Figures 4.7{4.14. Again, standard deviation was very low but may be seen in red error bar caps shown in the gures. We label the \control" condition for execution of the schema evolution expressions with the consolidation algorithm (Section 4.5.3) disabled. The condition labeled \optimized" indicates execution with the consolidation algorithm enabled. The control and optimized conditions correspond to the \atomic" and \composite" labels of Figure 4.6, respectively. 149 Figure 4.7: Reify N concepts Figure 4.8: Reify N subconcepts 150 Figure 4.9: Reify concept and N subconcepts Figure 4.10: Reify N subconcepts and merge 151 Figure 4.11: Create N domains from N columns Figure 4.12: Create N vocabularies from N columns 152 Figure 4.13: Create N relations from nested values Figure 4.14: Reify N subconcepts and create domain from columns 153 The results again indicate that CHiSEL's execution strategy and subexpression consoli- dation algorithm is capable of reducing the execution time of schema evolution expressions. In general, whenever an increase in the number of relations are \reied" (i.e., creation of a new top-level parent or child relation from a source relation), the execution time of the con- trol condition (without subexpression reuse) grows much more rapidly than the optimized condition. This can be seen clearly in Figures 4.7, 4.8, 4.9, and 4.10. Similarly, when creating new normalized relations from previously nested, non-atomic column values, as the number of relations to be normalized increases, the execution time of the control condition grows much more rapidly than that of the optimized condition as seen in Figure 4.13. On the other hand, when creating a deduplicated domain or controlled vocabulary from previously uncontrolled column values, the performance improvement for the optimized condition is only modest as seen in Figures 4.11 and 4.12. This is not surprising as the computational overhead of deduplication is very high and dominates the other benets such as IO savings from the subexpression reuse. It does, however, point to the possibility for extending the benchmarks to include scenarios where deduplicated relations are reused. Such scenarios would clearly benet from subexpression reuse. Finally, in Figure 4.14 we again see signicant improvements in execution eciency from CHiSEL in the optimized condition when rst reifying N subconcept relations and then creat- ing a deduplicated domain from the merged values of one column each from the N subconcept relations. Overall, the results from the benchmark indicate that CHiSEL's approach can im- prove the execution eciency especially for increasingly complex schema evolution scenarios. On some scenarios, at the upper end of the scale, the eciency aorded by our approach can reduce execution time by roughly half of that required by the unoptimized approach. 154 4.10 Related Work Here we outline the most relevant work in schema evolution, then we specically discuss database evolution languages, and nally make a detailed comparison of our approach with two alternative approaches. Schema Evolution Schema evolution has been an area of active research with a long history. See [117] for a taxonomy of schema evolution, [116] for an early proposal for a schema evolution language, [115] for a survey of schema evolution issues that remain relevant, and [67] for a more recent review of schema evolution research. Recent studies have examined the history of schema evolution in online database applications [36] and characterized the life cycles of tables [154] within evolving schemas. These studies have informed the design of our framework. Model management [17, 14] proposed to make models of various forms (from database schemas to software design models) and mappings between those models into \rst-class objects" with well dened model-management operators (MMOs) to manipulate them [96, 97]. Utilities based on this approach [148] assist users with schema versioning by automat- ing the task of generating backward-compatible SQL views and object-relational mappings (ORMs) through examining subsequent versions of a schema. Database evolution languages like ours, however, work by taking a source schema and SMO expressions and produce a revised database schema. Database schema quality analysis has been explored in [88] and [101]. Data cleaning is often done within the context of database evolution and oers complementary approaches that support schema evolution. Wrangler [85] explores an interactive approach to data clean- ing, while CleanM [57] presents a query language for data cleaning. While these approaches address concerns for data repair, they do not encompass the evolution of database schema itself. DAHLIA [98] helps users visualize schema evolution changes, while DB-MAIN [75] as- 155 sists the user with database application re-engineering based on a Computer-Aided Software Engineering (CASE) approach. Wrapper generators have been proposed [30] to insulate a program from changes to a database schema. These approaches generally address issues of analysis, visualization and development tools for application updates rather than directly aid in the task of evolving the database schema. Database Evolution Languages To ease the burden of evolving the database and to aid in an agile approach to database development, database evolution languages (DELs) have been proposed which consist of schema modication operators (SMOs) that unify schema and data evolution operations. Among the most recent and advanced examples of DELs include PRISM++ [40] which introduced an empirically comprehensive suite of SMOs and BiDEL [72] which followed with a relationally complete DEL. Initially, PRISM [36] proposed a language of SMOs for database evolution that was followed by PRISM++ [40] that added integrity constraint modication operators for con- straint evolution and invertible operations for rewriting updates. PRISM and PRISM++ were motivated by analysis of schema evolution histories of several database applications [37] from which they dened empirically complete languages by identifying the most fundamental operations required to support the observed schema evolution operations. Later, CoDEL [71] proposed a relationally complete formulation of SMOs that was fol- lowed by BiDEL [72] as part of the InVerDa database that proposed a bi-directional DEL to enable full schema versioning. The primary contribution of these DELs is to enable backward compatibility to support information systems upgrades under schema evolution, however, the need for scientists is to simplify the schema evolution task itself as outlined in the motivation (Section 4.2). Simpler, high-level operations for schema evolution have been envisioned previously [124] but only at a conceptual level. To address this need, our framework directly simplies the task of schema evolution by formalizing an extensible DEL with increasingly sophisticated SMOs to reduce manual eort. 156 Though PRISM and BiDEL dier in terms of their empirical versus theoretical com- pleteness, they both oer advantages over conventional DDL and DML. Their respective DELs unify the otherwise disjoint steps to evolve schema and data in cohesive operations, thus reducing the eort and chance for error on the part of the database user who must otherwise write scripts involving a carefully coordinated combination of DDL and DML statements. In addition, the \invertible" or \bidirectional" SMOs proposed by PRISM++ and BiDEL, respectively, enable schema versioning { the ability for a database to answer queries (and potentially updates) over historical versions of the database { thus enabling backward compatibility to database-driven enterprise applications and rapid trail-and-error of schema changes by the database administrator. While these approaches make signicant strides towards the goals of reducing the eort of database evolution and enabling more agile methods for database design and refactoring, gaps remain between what these DELs provide and what the end user may need. Since these DELs strive for (empirical or theoretical) completeness and generality, they dene operations at a relatively low level of abstraction, with operations to add a table, drop a column, partition a table into two tables, merge two tables into one table, etc. Such a low-level abstraction leaves the user with still many operations to coordinate in order to achieve complex transformations of a database. For example, neither proposal suggests SMOs for moving a column from one table to another, changing the cardinality of the relationship between two tables, normalizing non-rst-normal-form relations, semantically aligning column values then applying integrity constraints, and many other tasks that a domain user ultimately may be attempting to accomplish in an overall database evolution. While it may be possible to dene an empirically or theoretically complete set of primitives as a common foundation for dening a DEL, there is more likely a need for an open set of more complex schema modication operators that are as diverse as individual users and usages of databases. Without a framework in which to perform complex database evolution tasks, users will again be left to write ad hoc scripts to coordinate multiple SMOs. 157 4.11 Conclusions Schema evolution remains one of the most dicult tasks in maintaining databases. Here, we introduced CHiSEL, which oers high-level operations to simplify complex schema evolution tasks. Its design has been driven by recent surveys and reports of experiences with data- intensive scientic applications. We dened the desired properties of SMOs, delineated composite from primitive operators, and dened an extensible language of SMOs. Based on this formalism, we described a method for leveraging established techniques from query evaluation for ecient execution of schema evolution operations. We presented the system design of CHiSEL, which supports a scientic data management platform that is in everyday use by several large-scale science applications. We introduced a rigorous methodology for evaluating database evolution languages in terms of the user eort required to compose schema evolution expressions. We reported on three case studies that illustrated the utility of the framework for common scenarios as scientic databases evolve through the course of active investigation. We evaluated the eectiveness of CHiSEL compared to conventional SQL in each of these use cases per our evaluation methodology to show that our approach reduces the user eort by a factor of 4 in these common scenarios. Finally, we also demonstrated that the performance of schema evolution operations using our approach can yield an ecient framework open to optimization strategies, in some cases reducing execution time by roughly half. As part of the performance evaluation, we introduced the beginnings of a new schema evolution benchmark along with supporting utilities. We extended our experimental results based on real-world workloads with additional results from the benchmark workload that further reinforce the benets of CHiSEL to enable ecient execution of complex schema evolution expressions. CHiSEL is under active development and available online 11 . 11 https://github.com/informatics-isi-edu/chisel 158 Chapter 5 Co-Evolution of Data-Centric Ecosystems The content of this chapter is based on the papers: Robert Schuler et al. \Towards Co- Evolution of Data-Centric Ecosystems". In: 32nd International Conference on Scientic and Statistical Database Management. SSDBM 2020. Vienna, Austria: Association for Computing Machinery, 2020. isbn: 9781450388146. doi: 10.1145/3400903.3400908. url: https://doi.org/10.1145/3400903.3400908; and Robert E. Schuler and Carl Kesselman. \Managing Database-Application Co-Evolution in a Scientic Data Ecosystem". In: 2022 IEEE 18th International Conference on e-Science (e-Science). Salt Lake City, Utah, USA: IEEE, Oct. 2022. In this chapter, we extend the schema evolution framework introduced in Chapter 4 to address the problem of schema evolution in the context of data-centric ecosystems. First, we characterize the problem of coupled-evolution of a data-centric ecosystem and we propose design patterns for components of an ecosystem to adapt to changes in the database schema. Then, we dene a novel set of model management operators (MMOs) and describe how they may be integrated with the schema modication operators (SMOs) dened in Chapter 4. Finally, we provide case studies and an analysis of model mapping usage. 159 5.1 Introduction As science has become increasingly dependent on data management as a critical resource in enabling scientic discovery, database-driven applications are becoming an integral part of scientic infrastructure [64]. Databases, therefore, do not exist in isolation but at the heart of an ecosystem of database-dependent applications, such as data browsers, data en- try forms, computational pipelines, and numerous others (see Figure 5.1). Unfortunately, database designs can quickly become outdated and subject to ongoing transformations [40] due to changing requirements that force the \evolution" of the database and correspond- ing upgrades to database-dependent applications. As the database evolves to respond to the inevitable changes that occur during a scientic investigation, a new pressure exists on the infrastructure to evolve along with it { known as the application-database co-evolution problem [112, 142]. Typical approaches to developing databases and dependent applications are to rst design a semantically coherent database schema then develop applications that query and manipu- late data according its schema. When new requirements necessitate changes to the database, the schema is altered and the dependent applications are upgraded to use it. Unfortunately, this ideal is not often followed as database administrators often prefer to minimize appli- cation maintenance even at the expense of sacricing semantically clean database schema as requirements drive changes to the system, leading to what has been termed a de facto \database decay" [141]. Often the database schema (a.k.a., database model) is mapped to an application model in a form more convenient to the software developer. For example, Object-Relational Mapping (ORM) [160] frameworks map a relational schema to objects of an object-oriented program- ming language. In addition, database-to-application mappings have been explored in Model- Based User Interface (MBUI) development where database schema (tables, columns, keys, etc.) are mapped directly to user interface (UI) templates that generate the UI automati- cally [146]. In general, models specify the representation of data used in a particular system 160 RID (PK) type (FK) dataset (FK) local_identifier control (FK) … RID (PK) experiment (FK) biosample (FK) bio_rep_num tech_rep_num RID (PK) species (FK) stage (FK) anatomy (FK) … Experiment Replicate Biosample Bag (.zip) Figure 5.1: A typical data-centric ecosystem based on Deriva including (clockwise from top-right) web clients, command-line clients, export bundles, and visualization. or application and mappings specify how to translate data from one model to another [49]. Typically the database's model is simply referred to as its schema while the representation of the schema used by the application is referred to as the application or conceptual model. Model Management approaches [14] seek to reduce the burden of updating mappings by inferring a new version of mappings when a model changes. There are, however, limitations to the established Model Management approach. In practice, the schema evolution procedures (i.e., scripts, etc.) may be unknown because they are performed by an out-of-band process, leaving the Model Management approach reliant on a schema matching algorithm of varying ecacy and its own set of limitations. Second, although Model Management has been applied to the application model scenario, the typical formulation of the model management problem is to preserve a data integration system { i.e., the ability to query against a global mediated 161 schema when data sources change. To simplify database evolution for scientic data, we describe an architecture pattern for building components that interact over well-dened interfaces that allow adaptive or decoupled operations centered on data objects. Such an architecture can be realized in the Deriva [22] platform for scientic data management, which was designed to handle data throughout the research life cycle from its early phases where domain conceptualizations are not well understood through the later phases where well-dened, interoperable data are published for reuse by others in the research community. Depicted in Figure 5.1, the approach enables an ecosystem of data-driven applications that can co-evolve as the models change. The database applications that comprise this ecosystem are designed to adapt to the database schema at all points in the system life cycle, by enhancing the conventional entity- relationship (ER) model [26] used to describe a database schema with schema \annotations" that specify model mappings from the base ER model to alternative renderings for various specic application contexts. Unlike conventional development methods, where applications are built to work with a specic version of a schema and may break whenever the database schema is transformed, these applications are built either to adjust on-the- y in real-time to schema changes or to revise their congurations on next startup. Here, we call these applications either model-adaptive in the former case or model-neutral in the latter. For clients that do require a specic xed understanding of the model, \boundary objects" (Bag, in the gure) allow for mediated access to the database over stable representations of the data. Sociologists have identied the concept of boundary objects as exible shared artifacts that have dierent meaning to dierent communities but have enough common interpretation so as to provide a bridge between communities as being essential in supporting multi-disciplinary collaboration [139]. In most modern scientic collaborations, data are the critical shared artifacts and hence can be interpreted as boundary objects. These boundary objects provide a means of decoupling those clients that are not built to adapt automatically to changing models, and yet still support their interactions by providing the stable interface 162 of the boundary object. Finally, we take an approach to update application models by integrating model man- agement and schema evolution in a unied framework for database evolution. When a given schema is evolved, the model mappings are co-evolved to produce a new application model that is isomorphic to the updates in the source schema; i.e., if a new table is added to the database schema then a new concept is added to the application model to represent the new table. By integrating schema modication operations with model management, the database and application models may be co-evolved. Instead of inferring the new mappings from an observed revision of schema or conceptual model, we seek to identify and correlate transformations on the mappings given a set of evolutionary operations on the schema. In this chapter, we make the following contributions: we identify the design criteria that must be met in order to create co-evolving data- centric ecosystems of data services and data-driven clients; we describe an architecture pattern for co-evolving data-centric ecosystems comprised of models, mappings, boundary objects, and database-client interaction models; we then describe how we have extended a database evolution language to integrate additional capabilities for model management, including additional formalisms, rules denitions, and new language syntax; we present use cases that demonstrate the utility of the architecture, the boundary objects, and an analysis of the enhanced schema evolution framework. In the next section, we discuss related work. In Section 5.3, we enumerate the require- ments for co-evolving data-centric ecosystems. In Section 5.4, we specify architecture pat- terns for such ecosystems. In Section 5.5, we show how we extended a database evolution language with integrated capabilities for model management. In Section 5.6, we present an analysis of model mappings in real-world scientic data management systems and a compar- 163 ison of our approach against a popular database migration utility. In Section 5.7, we present case studies that demonstrate the utility of our approach, and we conclude in Section 5.8. 5.2 Related Work The most conventional approach to co-evolving a database and the database-driven appli- cations that depend on it is to dene a new database schema and migrate the data while upgrading the applications to the new database. Others have noted that this situation is either avoided [141] or met with \tears" [35] because of the diculty. One way to ease the migration of database-driven applications is to develop backward-compatible views (i.e., SQL VIEW denitions) that provide legacy applications with a stable interface to the database, however, the development and maintenance of views becomes its own burden over time. In addition, updateable views are generally supported only on \simple" views with very limited query expressions (cf. [111]). To ease this burden, database researchers have explored automatic generation of backward- compatible views [40, 72]. Others have proposed a message bus layer to mediate access to the database [142] and provide a stable interface for database applications in the event of schema changes. Model management systems [96] coupled with schema matching [16] at- tempt to ease the burden of migrating application mappings for model-driven applications. These approaches only achieve decoupled-evolution not co-evolution { i.e., applications do not evolve in sync with the databases in these approaches, rather they allow the applica- tions to continue to function even as they are hard-wired to a legacy version of the database schema. Model Management [14] leverages schema matching to repair a schema mapping and propagate the changes to a source or target schema to the source-to-target schema mappings. Several systems have explored this approach [15, 97, 14, 96, 17]. While these approaches may repair a schema mapping between the source and target, we introduce an approach to 164 both repair and produce new isomorphic mappings. For example, when a column is added to a source schema it should be propagated to the mapping and the target schema. Database Migration [161] tools, such as SQL Alchemy Alembic 1 , Entity Framework Mi- grations 2 , and Active Record Migrations 3 , are used to execute schema change scripts that are organized into forward and backward revision procedures. These allow software developers to upgrade or rollback changes to a database in a coordinated manner with software changes to database-dependent applications. The actual code blocks (e.g., up and down procedures) are typically edited by the software engineer, while some utilities provide automated code generation limited to propagating changes to the application model back to database schema change scripts. Database Migration utilities in themselves could be used to complement an approach like ours by organizing the forward and backward propagation scripts but have little overlap otherwise. MoDEF [148] extends from model management with an application-driven approach, where based on incremental changes to the application model it automatically generates update scripts to evolve the database schema and propagates corresponding changes to the schema mappings. The MoDEF approach can be used to generate the forward and backward change methods that t into the Database Migration framework thus partially automating a key step in the overall process. The scope of their approach, however, is limited to schema evolution without regard to data evolution. Unlike the typical model management approach, rather than inferring schema changes from incremental versions of the application model, in our approach schema evolution operations dictate how the schema is evolved and integrated model management operations are coupled with the schema evolution operations to encompass schema, data, and model mapping changes. Our approach demonstrates how to evolve the database-dependent applications along with the database schema. In addition, our approach diers from these by arguing for a 1 https://alembic.sqlalchemy.org/en/latest/ 2 https://docs.microsoft.com/en-us/ef/core/managing-schemas/migrations/ 3 https://guides.rubyonrails.org/active_record_migrations.html 165 holistic approach to developing co-evolving data-driven systems where components are built at the minimum to be model-neutral (i.e., not hard-wired to a schema version) or better yet adaptive to the current version of the database schema, coupled with a utility that integrates schema modication with model management operations. These techniques in combination can produce seamlessly co-evolving data-centric ecosystems. 5.3 Design Criteria In order to support co-evolution of data-centric ecosystems, we identied the following key design criteria that need to be met. 1. Database must serve as the repository of record for all system models: In many systems, the database is the repository of record for only the database's \physical design" (i.e., its relational schema as realized in a vendor-specic format). Application models and mappings, such as Object-Relation Mapping (ORM) [160] or Model-Based UI Development (MBUID) [51] models are recorded elsewhere, typically in a software conguration management (SCM) system such as Git 4 , Subversion 5 , or CVS 6 . In addition, security policies are also specied in other systems, at the more formal end in a manner negotiable by actors using interoperable standard security protocols. In other cases, there are no formal models of the application or policy, and these are therefore encoded in an ad hoc manner such as application programming logic. So in a conventional system, the data model, application model (if any), and policy model (if any), are specied disjointly in ways that may not relate explicitly to the data and can easily drift out-of-sync with one another. 2. Data services, such as database servers, metadata catalogs, and object stores, must provide access to models and support comprehensive data 4 https://git-scm.com 5 https://subversion.apache.org 6 https://cvs.nongnu.org 166 access interfaces: In other words, the data service should allow clients to express queries over all elements of its data, not just typical Web service interfaces that are very limited to predened, narrow access patterns. Clearly, the data service as the repository of record must also allow access to the models under its control to other interacting components. Access may be role-based and limited to just the parts of the model and data permitted to according to the client's identity. 3. Database-oriented applications must adapt to database schema changes or be decoupled from the database: In current approaches, applications are built to a particular version of the database schema. They formulate and issue queries with xed expectations of the schema against which they are fetching or updating records. In order for systems to co-evolve, applications must not be xed to a specic version of the database schema. Instead, they must be adaptive (or at least neutral) to the schema version. The application must have the ability to inspect and have logic to interact with an arbitrary model. Applications that are not able to adapt must be decoupled from the database. 4. Decoupled database applications must be aorded reliable interface con- tracts: Interface contracts can be maintained by a boundary object, which allows actors in a data-centric ecosystem to communicate over discrete units of well-dened data. When applications are not adaptive, they rely on boundary objects that are stable in their schema. The boundary objects therefore serve as a form of interface contract between the components of an ecosystem. Model-bound applications may therefore use the boundary object as a means to communicate with a database. 5. Users must have tools to support the synchronized transformation of data, application, and policy models: The database as the repository of record is only useful if the user can transform all system models in a coordinated fashion, else ap- plication components that rely on the models will be given inconsistent views of the 167 system and interactions will fail. Users need simplied tools for integrated change management, i.e., the ability to transform the schema, data, application models, and policies in a coordinated fashion. 5.4 Architecture Pattern for Co-Evolving Data-Centric Ecosystems Here, we introduce an architecture that addresses the design criteria for co-evolving data- centric ecosystems. Software ecosystems have been dened as \the set of businesses func- tioning as a unit and interacting with a shared market for software and services, together with the relationships among them. These relationships are frequently underpinned by a common technological platform or market and operate through the exchange of information, resources and artefacts"[79]. We are specically concerned with ecosystems where the rela- tionships of the software and services are dened in terms of the exchange of various units of data. Such an ecosystem can be described as \data-oriented" where data management services are passive actors in the ecosystem while other clients coordinate their activities by collaboratively mutating the data store [130]. 5.4.1 Role of Models, Mappings, and Boundary Objects We begin by examining the role of models, mappings, and boundary objects in a co-evolving data-centric ecosystem. We illustrate a typical scenario for how an ER model is mapped into multiple application contexts for use by clients as shown in Figure 5.2 taken from an actual deployment that supports biomedical research. The example depicts an ER model of a realistic representation for a bioscience experiment conducted on biological samples. In our example, we assume a relational model though our architecture does not exclude alternative structures for describing data, such as logic-based, ontological, or other. A model as per the usual denition should describe the concepts (entities, classes, etc.) and their properties 168 (attributes, elds, etc.) along with their relationships (references, predicate paths, etc.). In addition, in order for the database to serve as the repository of record, the model is complemented with additional details about how to interpret the model and how to map it for distinct usages or applications that consume it. RID (PK) type (FK) dataset (FK) local_identifier control (FK) … RID (PK) experiment (FK) biosample (FK) bio_rep_num tech_rep_num RID (PK) species (FK) stage (FK) anatomy (FK) … Experiment Replicate Biosample { “RID”: “1-3SZA” , “type”: “OBI:1271” , … } Experiment Record { “asset_mappings”: [ { “column_map”: […], “metadata_query_templates”: [ “/attribute/D:=isa:dataset/accession={accession} /E:=isa:experiment/local_identifier={experiment} /R:=isa:replicate /S:=isa:biosample/local_identifier={biosample} /dataset_rid:=D:RID,replicate_rid:=R:RID” … } 3. “tag:isrd.isi.edu,2017:bulk-upload” annotation 2. “tag:isrd.isi.edu,2016:visible-columns” annotation 1. ER Model {“compact”: [ [“isa”, “experiment_pkey”], { “entity”: false, “source”: [ { “outbound”: [“isa”, “experiment_type_fkey”]}, “name” ], { “entity”: false, “source”: [ { “inbound”: [“isa”, “replicate_experiment_fkey”]}, { “outbound”: [“isa”, “replicate_biosample_fkey”]}, { “outbound”: [“isa”, “biosample_species_fkey”]}, “name”] …} [{ “RID”: “1-3SZA” , “type”: “RNA-seq assay” , “species”: “Mus musculus” , “stage”: “E16.5” ,…}, …] 7. Contextualized Experiment Record 4. Get Schema 5. Apply mappings, plan queries 6. Query 8. Presentation 9. Get Annotation 11. Query 12. Update 10. Make queries from templates Figure 5.2: Example ER model (left) with schema annotations (middle) specifying model mappings for a model-adaptive interactive application (top-right) and a model-neutral command-line client (bottom-right). The example ER model is mapped to multiple contexts and then used either in interactive web applications or command-line clients. A mapping, sometimes called a morphism [96], is typically a declarative specication for generating alternative renderings of a data model often involving transforms between meta-models (e.g., object-relational mapping, relational to JSON serializations, data harmonization between disparate data sources, etc.). A con- text is a labelled environment in which a data model may be realized for a class of usage. For example, in the development of interactive displays there may be detailed or compact contextualizations of a network of entities in an ER model. In the compact context, one may only wish to display a short summary of the attributes of an entity that are enough for 169 an interactive user to quickly browse a list of entities to pick one for further investigation. While a detailed context of the entity may render all of the attributes of an entity along with summaries (compact contextualizations) of related entities. For example, one may want to display the complete details of an experiment along with only summaries of the materials and methods used in the experiment. For non-interactive scenarios, one may wish to establish bulk export or import contexts for generating or consuming more machine-readable extracts of the database for use with batch processing systems. In these contexts, potentially large subsets of the database may be extracted in similar fashion as an extract-transform-load (ETL) process. By dening these in terms of contextualized mappings of the model, these processes can be evolved along with the core model should it be modied. The extracts of data that are produced or consumed by clients of the database may be thought of as boundary objects. They may be used as well-dened, self-contained extracts that serve as the stable interface to the database. The database itself may evolve over time, but as needed the boundary objects may evolve on a separate time scale and may in fact remain stable over time. New versions of the database may be mapped to previous formats of boundary objects in order to preserve the interface contract with clients of the database. Inevitably, a database will be used by third-party systems that may not evolve on the same time frame as the database and thus need to operate in a decoupled manner where stable boundaries to the database will enable them to continue to function and interoperate in the data-centric ecosystem. The base ER model in Figure 5.2 is (1) mapped into a \compact" display via Deriva's visible-columns schema annotation (2) while in (3) a bulk-update schema annotation species query templates to be used by a data upload client. Steps (4){(8) depict the interactions of a model-adaptive client that understands how to navigate the model, map it per the schema annotation, and generate contextualized data records. Steps (9){(12) depict the interactions of a model-neutral client that simply retrieves a specication of query templates that it uses to perform data upload operations. 170 Mappings are a key aspect of how applications upgrade themselves in response to schema evolution events. In the example, bioscience experiments are represented by experiment, biosample, and replicate relations. Natively, if an experiment entity in a third-normal form (3NF) were accessed with only a direct serialization of its properties (\Experiment Record" in the gure), given that most of its elds are foreign key (FK) attributes, the record would not contain human-readable information. For the information to be useful, most applications will need to join theexperiment entity with other entities (such as \vocabulary" tables not depicted in the gure for brevity). The denormalization of theexperiment relation is specied through mappings which enable the application to query data for a particular context of use (\Contextualized Experiment Record" in the gure). The records for these mapped contexts then contain more comprehensive information (e.g., the term's human- readable name). Mappings as Rooted, Entity-Resolution Expressions While model mappings could be arbitrarily complex expressions, we have found that a tractable subset of expressible map- pings are sucient for enabling database-client interactions in practice. Thus we assume that model mappings take a form that we refer to here as rooted, entity-resolution expressions. Conceptually, an expression of this form is as follows: it starts at the root node (an arbi- trarily selected relation in the schema); optionally, other nodes (relations) are joined based on an explicit path of primary key-foreign key relationships; optionally, simple lters (i.e., comparisons between attributes and literals only) may be applied; nally, the expression terminates at a target node (relation). See Figure 5.3 for a specication of the syntax for such an expression in Backus{Naur form (BNF). The expressions are commonly occurring patterns, such as SQL-based queries with a FROM ... JOIN ON ... pattern that project attributes from a single relation. The mappings must include a path through the relational model followed optionally by a projection. The most simple path could be a single TABLENAME or could be a sequence of 171 mapping := path [SEPARATOR projection] path := pathcomp [SEPARATOR pathcomp]* pathcomp := TABLENAME | joincond | filter joincond := CONSTRAINTNAME | COLUMNNAME [JOINSEPARATOR COLUMNNAME]+ == COLUMNNAME [JOINSEPARATOR COLUMNNAME]+ filter := COLUMNNAME OP LITERAL projection := COLUMNAME [SEPARATOR COLUMNAME] Figure 5.3: Syntactic structure of expressions used for mappings. join conditions (joincond) and/or conditional lter expressions (filter). Join conditions are specied either by a constraint name (CONSTRAINTNAME) or explicit primary key to foreign key comparison. The constraint name may be decorated with the \directionality" of its usage (i.e., inbound or outbound) from the perspective of the prior component of the path. Filter expressions are assumed to be non-join comparisons, i.e., they may only be a comparison (OP) between a named property of the model (COLUMNNAME) and a literal value (LITERAL). Finally, the optional projection is specied by a delimited sequence of named properties (COLUMNNAME) relative to the path. It should also be noted that the exact syntax of the mapping expressions may vary. In Figure 5.2, two examples of these expressions can be seen in syntactically dierent but semantically similar forms. A JSON object for the rooted entity-resolution expression can be seen at top-middle (2), while a URL serialization can be seen at bottom-middle (3). Anatomy of a Boundary Object As specied in Requirement 4, unlike adaptive appli- cations, decoupled applications must interact by passing data, which we may call \boundary objects". But what makes a good boundary object? In order for a boundary object to meet the requirements outlined, it will need to be stable { that is, not necessarily changing with changes to the schema. Since it will not necessarily re ect the current and continually changing schema of the database, it should also be self-describing so that it is eectively a fully, self-contained extract of data. If a client has to go back to another reference source to 172 understand the contents of the boundary object, then the object is susceptible to becoming uninterpreted. If a client has access to an object for which the description of its elements and record of its source are lost, then most likely the object itself is of no use. Provenance Schema Entities Assets Anatomy of a Data Bundle Bibliographic details, attribution, identification of sources, etc. Descriptions of the entities, attributes, data types, relationships, etc. Structured data conforming to a well-known meta-model Unstructured data conforming to a standard or de factor standard file format Figure 5.4: Anatomy of a boundary object. We illustrate the anatomy of a boundary object in Figure 5.4. The object consists of course of the data, which may be delineated into so-called structured data, i.e., entities, and into bundled elements that conform to standard or de facto standard le formats, i.e., digital assets. The object also contains a description of the structured data, i.e., schema, so that all of the entities, their attributes, types, relationships, etc. may be understood by any client that encounters the object. Finally, the object includes basic provenance so that the sources of the data contained in it can be traced. Examples of such objects include Research Objects [10] and Big Data Bags [23] (bags). 5.4.2 Database-Client Interaction In this section, we take a closer look at the interactions between database and client (see Figure 5.5). We also explore the categories of model dependencies or neutrality among database clients. As introduced earlier, we classify clients of the database into categories of model-adaptive, model-neutral, and model-bound. In this architecture, the data service 173 must provide access not only to query and manipulate the entities but also to inspect its schema which should include declarative mappings to direct clients how to apply the model in dierent usage contexts. Deriva's specialized database service (ERMrest [41]), for instance, extends the usual ER model with access policies (e.g., ACLs) and \annotations" which allow for arbitrary key-value pairs to be associated with model element denitions (database, schema/namespace, table, column, foreign key). The model can therefore be an- notated with contextualized mappings to give hints or directives to clients for how to interact with the data. Boundary objects dened by these mappings can be used to mediate the in- teractions with the model-bound clients that are unable to directly consume and understand the schema itself. Boundary Object Model-Bound Client Schema Mappings Entities Model-Neutral Client Model-Adaptive Client Data Service Figure 5.5: Overview of the three primary patterns of database-client interaction. Model-Adaptive Interaction We categorize a model-adaptive client as one that inspects a data service to understand the database schema in order to perform operations on data. An adaptive client is one that not only avoids hard-wired model dependencies but can apply its own logic (rules, heuristics, 174 etc.) to determine how to interact with a data service based on its schema. In this case, the mappings provide hints to in uence the way in which the client interacts with the model. For example the model-adaptive, interactive client depicted in Figure 5.2 (upper-right) could read the ER model and determine based on its own heuristics how to display the Experiment entity, but it can also use the \visible columns" annotations (a form of mapping to a display context) to override its default display logic to show or hide specic attributes of the model. Figure 5.6 illustrates an outline of high-level steps that a model-adaptive client performs when interacting with a data service. Conceptually, the client follows a process of retrieving the current state of the database schema and mappings, interprets the model with respect to the various application contexts specied by the mappings, generates potentially complex query statements based on the denormalizations specied by the mappings, fetches query results, and then (in the case of a visual, interactive application) displays the results for the user. Within the current Deriva ecosystem, we provide model-adaptive clients for search, display, data entry, and data curation. Model-Neutral Interaction Not all clients need be fully model-adaptive in order to support co-evolution in a data- centric ecosystem. We further categorize an application as model-neutral, such that it does not hard-wire database query expressions but need not comprehend the full schema. These clients begin by retrieving a specic model mapping (or \annotation" on the schema) from the data service. There may be annotations for describing mappings from the base ER model to bulk export and import renderings, auxiliary services for exporting (meta)data, generating specialized metadata extracts for genome browser integration, and any others per the ecosystem's requirements. These mappings may specify explicit query expression templates that the application may execute in order to fetch query results. ERMrest [41], the relational data service of the Deriva platform, allows clients to pose arbitrary query expressions, in a roughly Select-Project-Join-Aggregation (SPJA) [2] equivalent language, as 175 Get Model Apply Mappings Compose Queries Fetch Query Results Apply Data Presentation Display Data Data Service Model-Adaptive Client Figure 5.6: Conceptual steps enacted by model-adaptive database clients. named Web resources; eectively an on-the- y transient view denitions compatible with any Web client. These lightweight views of data are eortless to maintain as they are executed on- demand. In order to co-evolve a model-neutral client that depends on a particular mapping, the data administrator needs only to update the mapping's query expression templates so that the client can use it to generate mapped data in the output model it supports. On each invocation of the model-neutral client, it will generally retrieve its associated annotation from the database and thus update its query templates. Figure 5.7 illustrates the conceptual high-level steps that a model-neutral client enacts in order to process data. Unlike the model-adaptive client, this type of client does not necessarily come endowed with internal logic to understand the native ER model nor be able to adapt the model to its usage scenario. Instead, it begins by retrieving specic model mappings that should encode explicit query expression templates (these templates can be 176 Get Model Mappings Populate Query Templates Fetch Query Results Process Data Output Results Data Service Model-Neutral Client Figure 5.7: Conceptual steps enacted by model-neutral database clients. described by the structure dened in Section 5.4.1). The client lls in the parameters of the query templates (i.e., like the parameters of a prepared statement), executes them to fetch data, then performs its processes and outputs the data per its own application-specic logic. When the ER model of the data service is evolved, only the mappings need to be updated in order for a model-neutral client to continue to interoperate with the data service. Model neutral clients within the current Deriva ecosystem include clients for bulk-upload and -download of data assets and associated metadata, and domain specic data analysis tools. Model-Bound Interaction Inevitably, a data-centric ecosystem will consist of clients that are bound to a specic model due to conventional development methodologies. Such clients must be decoupled from the data service itself, and instead must interact with a stable boundary object as the dened 177 interface contract. A decoupled, model-bound client may be the producer as well as the consumer of a boundary object or \bag" as long as the semantics and structure are agreed upon by both parties of the interaction. In Figure 5.5, the arc between the boundary object and the data service may require a mediating layer to transform the inbound or outbound bags to and from the data service and the decoupled clients. Model-Neutral Export Service Model-Bound Consumer Data Service Model-Neutral Import Service Model-Bound Producer Mediation Services Provenance Schema Entities Assets Boundary Object Model-Bound Clients Figure 5.8: Decoupling and mediation of model-bound clients from the data service via a model-neutral service layer for producing and consuming boundary objects. This mediating layer could therefore become another brittle point in the architecture of a data-centric ecosystem, ideally however, the mediating layer itself could be implemented by model-neutral clients as depicted in Figure 5.8. In fact, the relatively xed transformations required for mediation are ideal for model-neutral clients to perform, as the transformations can be described by a series of query expression templates (as discussed in Sections 5.4.1 and 5.4.2). By pushing the mediation into model-neutral import and export services dictated by model mappings, further ensures that as the ER model is transformed that the bags used as the boundary objects may remain consistent. Since the data service is the source of record for both schema and model mappings, they can be evolved together in a synchronized fashion. Within the current Deriva ecosystem, we provide an export conguration for Big Data Bags that is driven by mediating export specications to support interactions with model-bound clients. 178 5.5 Schema Evolution With Model Management The architecture presented above depends on well-dened data models and mappings. To evolve the ecosystem requires that one transform the data model and the model mappings in a consistent way that indicates how to render the model in dierent application contexts. 5.5.1 Model Mappings in the Deriva Platform Here we take a more detailed examination of how model mappings may be realized in the Deriva platform. ERMrest provides interfaces to introspect the database schema and an open set of schema annotations that describe how to map the schema to application concepts. Figure 5.9 presents a more detailed example of a small database schema mapped to multiple application concepts. Schema Annotations Deriva provides Web service interfaces for clients to inspect the underlying database schema. The schema is further embellished with an open set of schema annotations that instruct the client application in how to interpret parts of the schema. For example, a column type may be text type but in fact contain an Uniform Resource Locator (URL) to a data le (i.e., an \asset"). There is a simple column-level annotation tag:isrd.isi.edu,2017:asset to instruct client applications to treat the text value as an URL for purposes of linking to or up- dating the le. More sophisticated annotations, however, exist for the purpose of instructing client applications to map the database schema to a user interface. In many ways, this form of mapping closely resembles the mappings one might encounter in an Object-Relational Mapping (ORM) framework with some notable extensions which will be discussed. 179 Figure 5.9: Example of a table schema (left-hand side) mapped to multiple views or con- ceptual entities (right-hand side). Arrows indicate schema mappings or correspondences between schema elements. Models and Mappings To understand the forms of models and mappings in Deriva, we begin with a comparison to ORM frameworks. An ORM framework provides a mapping from a relational model to an object model in an object oriented programming paradigm. For instance, a table in a relational model is often represented as a class of objects in an object model and each row of the table as an object of the corresponding class. The columns of the table map to the properties of the class denition. Often the foreign key relationships dened in the relational model are translated into object references in the many-to-one direction and into object collections in the one-to-many direction or in cases of many-to-many relationships. 180 With some nuances, this is essentially the extent of conceptual application mappings found in an ORM framework (e.g., Java Persistence API, Rails Active Records, .NET Core Entity Framework). Schema Mapping InDeriva, mappings encompass and extend beyond the expressiveness of the typical ORM framework. For a given table in the relational schema,Deriva mappings may dene several mappings for dierent application \contexts" as noted earlier. For example, Deriva's data entry application relies on the \edit" context which denes a mapping for a table into a rep- resentation suitable for generating online data entry forms. Mappings in the \entry" context are nearly consistent with those found in ORM frameworks, where the table's columns and outbound foreign key relationships are mapped to a set of columns and entity references that can be used as the basis of user data entry and update of the record. A more embellished context is for an application that displays a record (i.e., row) from a table with collections of related tables and navigation controls for user actions that can be taken on the primary ta- ble's data or that of its related tables' data. In this context, labelled \detail" in theDeriva nomenclature, the columns and both inbound and outbound foreign key relationships are mapped to dene a particular presentation order, and they also support arbitrary length traversal of foreign key relationships going deeper into the schema to expose related data that may require multiple joins of tables to produce. Unlike ORM frameworks that generally allow denition of relatively simple table-to-object mappings with at most traversals of an association table (a.k.a, join table), Deriva allows denition of any connected relation to be mapped into the conceptual model. Mapping Fragment Each contextualized mapping, consists of a set of correspondences [53] between elements of the schema or mapping fragments (cf., [149]). Each mapping fragment species a scalar 181 column or a related entity set to be included in the mapping. The expressiveness of these mapping fragments is roughly equivalent to a conjunctive query (CQ) expression, with some restrictions. Following in the pattern of the rooted, entity-resolution expression described in Section 5.4.1, the mapping fragment begins \rooted" on the annotated relation, it may include an arbitrary sequence of inbound or outbound foreign key relation traversals, and terminates on a column (i.e., a single free variable in CQ terminology). The mapping fragment allows options to instruct the client to project only the terminating column, to project all attributes of the related entity set, or to apply an aggregate function and project the resulting values of the aggregation. The latter being similar to a conjunctive aggregate query (CAQ) expression. In addition, the mapping fragment syntax allows lters on the attributes of the related entity set. These mapping fragments are composed into a cohesive mapping designated for a particular application context. 5.5.2 Motivating Example In this example, a research team decides to manage their data on biological experiments and results usingDeriva, which provides a Web service backed by a relational database [41] and includes model-based user interfaces [146] specied by \schema annotations" that specify how to map the schema to the user interface. The schema annotations include separate mappings for each usage \context" including \detail" for displaying complete information on an entity and its related entities, \compact" for a short-form display of an entity useful in search and browse interfaces, \entry" for web form specication, and several other variants. The research team decides to get up and running quickly with a single table of experiments with columns for the description (desc), gene under investigation (gene), and list of le names for the results (files). ThroughDeriva, users may enter or edit data using web forms, search and browse over experiments by composing queries over the database with user-friendly Web based search interfaces, and view the details of experiments they nd interesting. For each usage context, 182 the team denes mappings for their desired representation of the database schema in that context. For brevity of the example, we limit the illustration (Figure 5.10) to the table schemas and the schema mappings for only the experiment's \detail" context. In reality, there would be mappings for each of the contexts of the experiment relation and mappings for the other relations dened in the database schema. Figure 5.10: Simple example of a relational schema for experiments, mappings from schema to conceptual model (dashed lines with arrowhead), and the conceptual model of experi- ments. Three separate scripts (1, 2, 3) show operations performed to co-evolve the schema and its mappings to the conceptual model. The team quickly realizes that the experiment description (desc) should be renamed 183 overview and the table should include an explicit protocol column for capturing informa- tion on the specimen preparation, data acquisition method, data analysis steps, and other parameters of the experiment. They run a script (step 1) that renames the columns and swaps the mappings from using the old column name to the revised name. It then creates the new column for protocol and introduces a mapping for it. They also decide to x the ordering (not depicted in the script) so that protocol follows immediately after overview in the mapping. As researchers in the lab record more of their data, they realize that the nomenclature of genes re ect personal biases between preferred names and at times suer from typos and other data entry errors. They decide that it is time to adopt a controlled vocabulary so that the team can all use the same terms to identify genes. They create a table for gene nomenclature (gene) that includes both a preferred name and collection of synonyms (not depicted) by importing the nomenclature from a spreadsheet acquired from an authoritative source (gene csv). They run a script (step 2) that automatically matches the terms used in the gene column to the new gene table and replaces the values with foreign key columns that reference the primary key (id) for each gene term. The script also updates the mappings for the experiment to replace the properties with references to the gene table's name column. Eventually, the team nds that the free text listing of le names in the files column is becoming unwieldy as the amount of data in the system grows. They would also like to be able to edit, add, remove individual les more easily which would be possible if they were represented as a distinct set of entities. They run a script (step 3) that creates a new table from the individual values (i.e., atoms) parsed from the free text files column. It automatically detects the primary key in the parent table (experiment) and creates a foreign key column (exp id) that references the primary key (id) of the parent table. The script also automatically maps the related les table into the experiment's \detail" context. They drop the original files column which automatically prunes mappings that were based on it. 184 The above example illustrates several aspects necessary for coupled evolution of model mappings instigated by the evolution of the database schema. Table 5.1 summarizes the requirements needed to support the coupled evolution of models and mappings. In the next sections, we will introduce a set of model management operators to support these requirements. Table 5.1: Summary of requirements for conceptual model evolution instigated by an evolution of the database schema. Schema Evolution Schema Mapping Adaptation Add column or constraint Introduce the column or constraint into the conceptual entity that represents the relation. Rename column or con- straint Replace the column or constraint names in the map- pings with their new names. Drop column or constraint Remove any mapping fragments that rely on the re- moved column or constraint from any conceptual entity in the conceptual model mappings. Create table Introduce new mappings for conceptual entities to rep- resent the relation in the conceptual model. Rename table Rename the conceptual entity that represents the table in the model mappings. Drop table Remove the set of mappings for the conceptual enti- ties based on the relation, and remove any mapping fragments that rely on any column or constraint in the relation. Create table as ... Introduce a set of mappings to new conceptual enti- ties that are isomorphic to the transformations on the source relation. 5.5.3 Model Management Operators Model Management [14] has been dened around the problem of data exchange and oper- ations suitable for repairing, updating, or propagating mapping changes when a source or target schema changes. The usual model management operators (MMO) assume that evo- lution of the source schema or target schema happens out-of-band. MMOs often rely on schema matching as a key part of automating the migration of changes. While the formal 185 denitions of MMOs do not explicitly describe how to introduce new constructs, closely re- lated work do apply model management to conceptual model mappings [149]. Here we ask the question, what model management operations are needed in order to facilitate mapping transformations during schema evolution in order to produce an isomorphic evolution of the conceptual model? We dene a mapping system here to be a set of mappings from a source database schema to a conceptual application model. Each mapping in the mapping system is rooted on a table in the database schema and may traverse an arbitrary number of related tables (i.e., related by foreign key references) and terminate on a column or aggregate function applied to a column (e.g., a count, sum, array or other function on the target column). Let S be a schema for database D and M be a schema mapping from S to a conceptual model T . We deneM as a set of mappingsv from table schema to conceptual entity where each mapping v is in turn dened as the set of mapping fragments q that encode a query expression for a single attribute value or set of related entities. For simplicity, the query is represented as a set of symbols s where the symbol may represent a column or constraint. A column symbol denes the free variable (or projection) of the mapping fragment while a constraint denes a relationship. Next, we dene what might be viewed as auxiliary routines with respect to the established set of MMOs described earlier. These routines facilitate changes to the model along with schema changes. Consider the conventional assortment of operations found in Data Denition Language (DDL), such as create, drop, rename table; create, drop, alter table; create, drop, alter column; and Integrity Constraint Modication Operators (ICMOs) such as create, drop, rename key or foreign key. Given that a mapping system from relational schema to conceptual application model relies rst and foremost on these constructs, we need a set of MMOs that can support evolution under these operations. 186 Operator Denitions and Semantics We summarize the primitive model management operators in Table 5.2. Here we dene the semantics of the operators using set operators and set-builder notation. The rst two oper- ators Union and Dierence may be described semantically in terms of the set operations of the same name. When adding new relations to a schema, they may each have corresponding mappings to one or more contexts (detail, compact, etc.). The existing model mappings must be augmented with the mappings for the newly created relations. Union(M1, M2) takes two sets of mappings and returns the combined set of mappings in both M1 and M2. Conversely, when dropping a relation from a schema the mappings rooted on that relation must also be removed from the set of mappings sourced on that schema. Diff(M1, M2) takes two sets of mappings and returns the reduced mappings from set M1 less those of M2. Table 5.2: Summary of model management operators for coupled evolution of model map- pings. Operator Description Union Combines mappings from two models. Difference Returns the mappings that result from removing a subset. Graft Introduces mapping fragments into an existing mapping in a model. Prune Eliminates all mapping fragments that contain a given symbol (e.g., representation of a column or constraint). Replace Removes a symbol from any mapping fragment and in its place intro- duces an alternative symbol. Schema evolution involves not only creating and dropping whole relations but also altering relations by adding and dropping columns or constraints. A column or constraint is sucient to represent a new mapping fragment. For example, adding a new column into a relation should be re ected in the conceptual model with the introduction of a new property mapped from the column. The semantics here may be expressed simply in set-builder notation. Graft v;q (M) takes a mapping fragment q and \grafts" it into the mapping v in M. It's semantics can be described asfv 0 jv 0 2M^:id(v;v 0 )g[fv 00 [fqgjv 00 2M^id(v;v 00 )g where 187 id(a;b) is a binary predicate that tests the unique identity of two mappings. Conversely, when dropping a column or a constraint from a relationr in schemaS, potentially many mappings may be invalidated by the removal of the column or constraint. The mapping set M can be repaired by removing any oending mapping fragments that would violate the updated schemaS 0 . Prune s (M) removes any mapping fragments from any mapping inM that includes a specied symbol s. Its semantics may be described asffqjq2v;s = 2qgjv2Mg. The last operator to be discussed supports schema changes where column or constraint names are changed, which can potentially aect mappings throughout the entire set of model mappings. For example, when a column is renamed, it may invalidate any number of mapping fragments that use the column in a projection or selection. In order for the model mappings to return to a valid state, the symbolic representation of the column must be updated to the new name. Replace s;s (M) nds and substitutes a symbol s with symbol t in any mapping fragmentq in any mappingv in the set of model mappingsM. Its semantics can be described asfffs 0 js 0 2q^s 0 6=sg[ftjs 00 2q^s 00 =sgjq2vgjv2Mg. When an entire relation is removed from the database schema, the Diff and Replace operators can be used together to satisfy the required model changes. Algorithm for the Prune Operator For brevity, we describe only the algorithm of the Prune operator in detail (see Algorithm 4). It serves well as a representative example of the steps needed by the operators to adapt the model mappings. InDeriva, there are partial mapping fragments called \source denitions" that can be reused in other mapping fragments. Those mapping fragments may be referenced by any number of mapping fragments in the same relation. In some sense, they are mapping templates to be realized within a given (contextualized) mapping of the relation. Thus, the Prune algorithm must search for symbol matches within these templates too and if found, must prune any mapping that depends on that template. This search for dependent mappings is less computationally expensive as it is restricted to only those mappings of the 188 same relation. Algorithm 4: Algorithm for the Prune operator Data: a model M, and a symbol s Result: model M that does not contain the symbol s 1 foreach match2 find-symbol(M;s) do 2 match.container.remove (match.mapping); 3 if match is a mapping template then 4 foreach dep2 find-deps(match) do 5 dep.container.remove (dep.mapping); 6 end 7 end 8 end 9 return M ; The functionfind-symbol searches all mapping fragments inM, not just those rooted on a particular relation. The rationale is that mappings that allow arbitrary join expressions can potentially lead to sprawling networks traversed to map elements to the conceptual model. For added eciency, we could limit the search algorithm to just the connected components for the root of each mapping, however, in practice most of the usage scenarios we encounter are fully connected networks or nearly so. The function find-dependencies, however, searches only those mappings inM that are rooted on a relation identied by match's anchor property. The rationale here is that the syntax of mappings in our implementation only allows reuse of mapping templates within the realm of a single relation. This is by denition as a mapping is always rooted on a relation and therefore a reusable mapping template must also be rooted on the same relation. The function is dened recursively such that it searches all reusable mapping templates to determine the transitive closure over the dependency graph of reusable mapping templates, then with the full set of all invalidated mapping templates it searches all regular mapping fragments for those that are invalidated by the removal of any of the now invalidated mapping templates and returns this full set. With the above denitions of MMOs, model mappings can be co-evolved along with 189 a schema under evolution. This re-interprets MMOs, not as retroactive repair of model mappings that hinges on the eectiveness of schema matching algorithms, but as proactive transformations that are coordinated with schema evolution. 5.5.4 Extending the Schema Modication Operators As described in Chapter 4, CHiSEL introduces a relational algebra of Schema Modication Operators (SMO) designed to reduce eort for performing complicated schema evolution tasks. The algebra is divided between \primitive" (Section 4.3.3) and \composite" (Sec- tion 4.3.4) operators where the semantics of the primitive operations are extended from the relational algebra [2], and building on these primitive SMOs, an open set of composite SMOs are dened by algebraic expressions over other (primitive or composite) operators of the algebra. CHiSEL's SMO expressions are analogous to the SQL CREATE TABLE <name> AS <expression> where the expression may be arbitrarily complex. Here we extended the SMO denitions with both integrity constraint and model mapping transformations. Thus not only do the revised SMOs evolve the schema and data, but they also co-evolve the integrity constraints and model mappings. To the best of our knowledge, this is the rst attempt to extend SMO denitions to support comprehensive co-evolution of database- application ecosystems. We began by integrating MMOs into CHiSEL's primitive SMOs. See Table 5.3 for a description of changes to the primitive operations. Extending the Primitive SMOs To extend the primitive SMOs (Section 4.3.3) of the algebra uniformly, we introduced a general procedure (Morph) for morphological transformation of the relation schema. The input to the Morph procedure is a relation schema and a restriction and/or renaming of attributes (i.e., columns) in the schema. The procedure transforms the structure of the output relation schema with respect to its column denitions, key constraints, foreign key reference constraints, model mappings, additional schema annotations (Section 5.5.1), and 190 Table 5.3: Revised denitions of primitive SMOs to address changes to Integrity Constraints (ICs) and Model Mappings (MMs). SMO Required changes to Integrity Constraints (IC) and Model Mappings (MM) Select All ICs and MMs belonging to relation are preserved. Union All ICs and MMs belonging to relation are preserved. Distinct All ICs and MMs belonging to relation are preserved. Deduplicate All ICs and MMs belonging to relation are preserved. Project ICs that cover a subset of columns in the projection are preserved, and MMs that involve the projected attributes or preserved ICs are preserved. Rename ICs are altered to include the renamed variant of their constituent attributes, and symbol names for columns and ICs are swapped in mapping fragments in MMs of the relation. Join Overlapping attributes from the relations are renamed to avoid con icts, and changes to ICs and MMs are performed following the semantics of the rename operator. Nest ICs that cover a subset of columns in the grouping attributes are preserved, and MMs that involve the grouping attributes or preserved ICs are preserved. Unnest ICs that cover a subset of the columns minus the attribute to be unnested are preserved, and MMs that involve the attributes not being unnested or preserved ICs are also preserved. access control policies. The algorithm of the Morph procedure is described in Algorithm 5. The algorithm rst generates a list of column denitions by copying any column def- initions in the base relation that match a restricted or renamed attribute. The column denition includes its name, domain, nullability, default, comment, and in addition, access control lists (ACLs), and nally schema annotations. Next, the algorithm generates a list of key denitions by copying them from the base relation where all unique columns are pre- served in the morphism. The columns of the key are dened based on renamings or the original column names. A new and temporary name is generated for the key (following the common pattern of \table name column names... key") and the new name is mapped to the old names (plural, as constraints may have more than one name in some database im- 191 Algorithm 5: Algorithm for the Morph operation Data: relation r, attributes A 1 Q an empty set of new-to-old names; 2 D an empty set of dropped names; 3 C an empty list of columns to be dened; 4 foreach column c2r do 5 if c2A then 6 if c is renamed in A then 7 Q[c:name] new name for c in A; 8 c copy and rename c; 9 end 10 C:append(c); 11 else 12 D D[fc:nameg; 13 end 14 end 15 F an empty list of foreign keys to be dened; 16 foreach fkey f2r do 17 if all columns of f2A then 18 f copy of f; 19 f:cols renamed f:cols if renamed in A; 20 name make name(`placeholder';f:cols); 21 Q[name] f:name; 22 f:name name; 23 else 24 D D[ff:nameg; 25 end 26 end 27 Let K be the revised set of key denitions and update Q and D by analogous procedures as above...; 28 mappings copy of r:mappings; 29 M stub create-model-stub(mappings;F ); 30 foreach name2Q do 31 Replace (M stub , Q[name], name); 32 end 33 foreach name2D do 34 Prune (M stub , name); 35 end 36 return (C;K;F;mappings); plementations). Keys whose set of unique columns are not preserved in the revised column denitions are skipped and their respective constraint names are added to a list of dropped 192 constraints. Nearly identical steps are taken for foreign key constraints. Next, the mappings rooted on the new relation are morphed relative to the revised column and constraints. A deep copy of the original relation's annotations is taken, and a \stub" model is created that simulates a database schema while only knowing the specication of a single relation schema. The key feature of the stubbed model (M stub ) is to allow introspection of the foreign keys dened on a table relation without the referenced primary key tables actually being dened. For all renamed columns and constraints, the Replace operator is executed, and for all dropped columns and constraints, the Prune operator is executed. Recall the MMO operations take models as input and output, hence the utility of the model stub to simulate a full database schema without having to actually instantiate an entire schema. Finally, the transformed relation schema is retired based on the placeholder name, revised column, key, foreign key denitions and annotations, and the unaltered comment and ACLs of the input relation. The procedure described above is performed essentially verbatim by the Project SMO since it allows both a restriction of attributes and a renaming of them simultaneously. The Rename SMO is simply a Project that renames but does not drop any attributes of the input relation. The Join and SimilarityJoin SMOs rst identify name collisions between attributes and rename them \fleft|rightg original name". It then runs the Morph rou- tine on the left hand relation, drops the keys, then runs the Morph routine on the right hand relation and merges the resulting column denitions and foreign key denitions. It drops all keys since by denition joins may produce multiple rows for either input relation thus potentially violating uniqueness constraints. The Union SMO by denition preserves all at- tributes, so the only change to the relation schema is renaming of constraints and subsequent changes to mappings. TheDistinct, Deduplicate, andNest SMOs morph the relation sim- ilar to Project where only the grouping or distinct on attributes are initially preserved and their related constraints and mappings accordingly. Then in the case of the Nest SMO, the nested attribute is added back to the column denitions while changing its type to an array 193 type of the same base type as the domain of the attribute (e.g., int[] from int). Finally, the Unnest SMO begins by morphing the relation given all attributes except the unnested attribute. Then the unnested attribute is added to the revised relation's column denitions. All key denitions are dropped since by denition unnesting a non-atomic attribute results in 0-to-many output tuples for each input tuple. Since CHiSEL SMOs are algebraic and CHiSEL expressions may be composed over an arbitrary combination of the SMOs, the Morph routine may be executed many times in the evaluation of each statement in a script. But note that because the model is \stubbed" for the MMO operations, only the input relation's mappings are considered rather than the entire database schema's. This way the Prune and Replace MMOs only search and manipulate mappings within a narrow scope limited only to the current statement of CHiSEL expressions. Lastly, the temporary placeholder names for constraints that are dened in the Morph routine must be nalized. The Finalize routine (not illustrated) takes the input relation and the (assigned) table name, renames the table denition, and for each constraint (key or foreign key) renames the constraint and executes the Replace MMO over a stub model based on the relation schema. Extending the Composite SMO Formulas We now summarize how we integrated MMO operations into the composite operators (Sec- tion 4.3.4). Ultimately, the composite operators are dened by a set of functional pattern matching rules (described in Section 4.5) where the antecedent is the name and parame- ters of the composite operator and the consequent is the formula dening the composition of SMOs that determine its behavior. We extended the original CHiSEL formulas by rst introducing a set of rules for ICMOs and MMOs (see Table 5.4) and then revised the SMOs by introducing these operations into the consequent of the corresponding functional pattern matching rules. As an example, the Reify operator takes a subset of attributes of a relation, projects 194 Table 5.4: Summary of integrity constraint modication and model management primitives for SMOs. Operator Description AddKey C Adds a key based on a set of unique columns (C), and it adds a key-to-property correspondence to mappings. DropKey k Drops a key constraint (k). AddForeignKey C;p;K Adds a foreign key based on a set of columns (C) to reference a set of key columns (K) of a table (p), and it adds a fkey-to-property correspondence to mappings DropForeignKey f Drops a foreign key constraint (f). those attributes of a relation, and forces set semantics (i.e., distinct tuples) based on a subset of the projected attributes. This new relation logically consists of unique attributes by virtue of forcing set semantics on it, but it previously lacked the ICMO to add a key constraint. The revised functional pattern matching rule of Reify: (`Reify(child, keys, attributes)', lambda child, keys, attributes: AddKey( Distinct( Project(child, keys + attributes), keys), keys)) was updated as shown above. It introduces the new AddKey operation into the denition of Reify so that a new key constraint is created based on the columns in the keys set. Relational Formalism for DDL For completeness, we further extended the algebraic formulas for the schema modication operators to include Data Denition Language (DDL) operations to create, alter, drop, and rename tables. Table 5.5 summarizes the DDL formalisms. To create a new relation in 195 the database, a specication of the relation schema (i.e., table denition) and extended attributes (i.e., schema annotations, mappings, constraints, policies, etc.) is assigned to a relation name in the catalog. To alter a relation (e.g., change a column name, add or drop a column, etc.), a transformed relation is assigned to an existing relation in the catalog. In order for the semantics of the operation to be understood as an alter versus a replacement of an existing relation with a new one, the source relation of the expression must match the assigned destination relation (in the next section, we will see the rules that accomplish this). To drop a relation, a NULL relation (denoted ) is assigned to an existing relation in the catalog. Finally, to rename a relation, a rename expression must be assigned to a relation in the catalog. Table 5.5: Formal Denitions for DDL Operations Operation Formula Comments Create r R Assignment of relation denition Alter r r Reassignment of projected relation Drop r Assignment of relation Rename s r Assignment of renamed relation Transformation Rules for DDL Operations To support these new DDL formalisms, we dened pattern matching rules to identify mutations to existing relations (see Listing 5.1). The expression planner in CHiSEL (Section 4.5) is based on functional pattern matching rule denitions. New rules are introduced in the form of a tuple (expression, function) where expression is a string that species the pattern to be matched and function is a handler function to be executed and returned when the expression matches. These rules are executed against symbolic expression trees that CHiSEL produces from parsing the user submitted schema evolution statements. Multiple rule sets are executed by the planner, some for logical optimization for example, and lastly for transforming the symbolic expression trees into physical operator plans. It is these nal rules that were modied in order to support the extension for DDL operations. This approach makes extending CHiSEL relatively easy 196 as new operations are added to the logical optimization or physical translation rules. The rule for translating a symbolic, assignment expression into a physical, create table operation can be seen in Listing 5.1 lines 1{7. The matching expression must be a symbolic Assign with a child expression of type str (i.e., a string). This will match on a child that consists of a string serialization of a table denition, which is specied in a JSON document format. The handler (lambda) function instantiates a Create physical operator with a child Metadata operator based on the deserialized JSON document string (json.loads(child)). The rule for transforming an assignment expression into an alter table operation can be seen in lines 8{19. Here the matching expression must be an Assign (line 9) of a Project (line 10) of an ERMrestExtant (lines 11{14), where the source and destination schema and table names are the same (lines 15-16). ERMrestExtant is the symbol to represent an extant (i.e., existing table) in the ERMrest database service. The handler (lines 17{18) is abbreviated to reduce unnecessary detail. It instantiates an Alter operation with a child ERMrestProjectSelect operation. The latter is a fused select and project operator for the ERMrest service that is used here primarily for the transformation of the table metadata (i.e., updating constraint denitions, schema annotations, etc.). The rule for transforming an assignment expression into a rename table operation is similar to the alter operation and can be seen in lines 20{29. It matches on an Assign (line 21) of a Rename (line 22) of an ERMrestExtant (lines 23{26). Its handler also instantiates an Alter operator with a child ERMrestProjectSelect because the execution of the alter operation is designed to handle table renames as well as other table denition changes. Finally, the rule for transforming an assignment expression into a drop table operation can be seen in lines 30{34. Here the expression matches on an Assign of a special Nil symbol to a named schema and table. Its handler instantiates a Drop operation that takes a Metadata operator as input. Listing 5.1: Rule denitions to translate (logical) assignment expressions into Create, Alter (and Rename), and Drop (physical) operators 197 1 # Create relation rule definition 2 (`Assign(child:str, schema, table)', 3 lambda child, schema, table: 4 Create( 5 Metadata(json.loads(child)), 6 schema, table) 7 ), 8 # Alter relation rule definition 9 (`Assign(' 10 ` Project(' 11 ` ERMrestExtant(' 12 ` catalog, src_schema, src_table),' 13 ` attributes' 14 ` ), dst_schema, dst_table)' 15 `if (src_schema, src_table) == ' 16 ` (dst_schema, dst_table)', 17 lambda ...: 18 Alter(ERMrestProjectSelect(...), ...) 19 ), 20 # Rename relation rule definition 21 (`Assign(' 22 ` Rename(' 23 ` ERMrestExtant(' 24 ` catalog, src_schema, src_table),' 25 ` attributes' 26 ` ), dst_schema, dst_table)', 27 lambda ...: 28 Alter(ERMrestProjectSelect(...), ...) 29 ), 30 # Drop relation rule definition 31 (`Assign(Nil(), schema, table)', 32 lambda schema, table: 33 Drop(Metadata(...), ...) 198 34 ), ... 5.5.5 Implementation Previously, CHiSEL's SMOs evolved the relation schema only in terms of the attributes, do- mains, nullability, comment, defaults, ACLs, as well as keys and foreign keys that could be satised by the revised attributes produced by the operator. The morphological transforma- tions, however, were loosely implemented by each SMO, did not consider constraint naming, and did not attempt to produce an isomorphic transformation of the model mappings in the relation's schema annotations. In the revised implementation, the Morph and Finalize rou- tines are now uniformly dened and shared among the SMO implementations for consistency, and they carefully preserve mappings found in each relation's schema annotations. Also, CHiSEL provided limited functionality beyond its novel set of SMOs. In order to perform more basic DDL operations (e.g., create table, drop table, etc.), one needed to useDeriva's lower-level Python API. For completeness, CHiSEL now provides all common DDL operations in addition to its richer set of SMO operations. For example, it now pro- vides interfaces for adding, dropping and altering integrity constraints explicitly; adding, dropping, and altering columns; creating, dropping, renaming and altering tables; moving tables to dierent schemas; and creating or dropping schemas (a.k.a., name spaces). In ad- dition, CHiSEL now integrates its MMO operators into these DDL operations. For example, renaming a column will invoke the Replace MMO to rename the column as it may appear within mappings throughout the overall model. It is available as an open source software and is released on the Python Package Index (PyPI) software repository as deriva-chisel 7 . 7 https://pypi.org/project/deriva-chisel/ 199 5.6 Evaluation To evaluate the utility of our approach we rst analyzed the schema and model mappings in several Deriva data management systems supporting active research projects or consortia. Next, we present a qualitative comparison of CHiSEL and a database migration utility on several key tasks based on the earlier motivating example. 5.6.1 Schema Mappings in Scientic Databases We analyzed the usage of model mappings in several real-world deployments of the Deriva data management platform. Each deployment has been in operation to support a scientic application, typically as the primary data resource for their research. Table 5.6: Summary of deployments evaluated. Deployment Description CIRM A microscopy Core Facility supporting regenerative medicine particularly with respect to kidney disease. PDB-dev A prototype archiving system for structural models collected by the Protein DataBase (PDB). PBCC A research consortium centered on pancreatic beta-cell model- ing. Synapse A collaborative investigation seeking to map the whole synap- tome of a model organism. FaceBase The data resource for an international craniofacial research. GUDMAP The data resource for the Genito-Urinary Molecular Atlas Project (GUDMAP) and Re-Building a Kidney (RBK) con- sortia. Overall, we found extensive usage of mappings from table schemas to the contextu- alized representations supported by Deriva's model-based user interface application (see Figure 5.11). In most deployments, the vast majority of tables have mappings associated with them. Not surprisingly, the majority of mapping fragments (i.e., \sources" in Deriva parlance) are simple columns with the next largest category being direct (i.e., foreign key) 200 references, then relationships via associations (a.k.a., join tables), and nally the arbitrary join mappings (labelled \N-Join") unique to Deriva (see Figure 5.12). Figure 5.11: Overview of mappings used by deployments. Figure 5.12: Types of mapping fragments in use. Again, in most deployments we found that the vast majority of tables are mapped to multiple representation \contexts" which could not be supported by the typical ORM-style 201 mapping systems. Additionally, in most deployments nearly 20% to 30% and in one case over 50% of tables relied on mappings that included either arbitrary joins or aggregates in their mapping fragments (see Figure 5.13). Figure 5.15 shows that on average 2 to 4 contexts are specied per table (note that 1 is a minimum for any mapping denition inDeriva and the total type of contexts supported byDeriva is 5 for various search, browse, and edit modes). Figure 5.14 shows that when arbitrary joins are used in mappings that they average between 3 and 4, but can range up to 7 traversals of foreign key references. Maintenance of these non-trivial mappings further reinforces the need for frameworks that can evolve mappings along with schema and data evolution. Figure 5.13: Non-trivial mappings. SinceDeriva will gracefully handle errors in the mappings by simply ignoring a mapping fragment that has an error and omitting it from the overall contextualized mapping denition, we examined the model mapping errors present among the deployments (see Figure 5.16). These errors took the form of an erroneous column name or foreign key name within a mapping fragment. Although the oldest and most stable deployment (CIRM), did not have any model mapping errors, all other deployments had errors in their usage of model names. 202 Figure 5.14: Usage of arbitrary joins in mappings. Figure 5.15: Usage of multiple \contexts" in mappings. 203 Figure 5.16: Errors found in model mappings. 5.6.2 Comparison Between CHiSEL and EF Core Migrate We now present a qualitative comparison between CHiSEL and the database migration fea- ture in Entity Framework Core (EF Core) 8 , the ORM framework for the .NET platform. Using the EF Core migrate utility, a developer changes the C# class denitions representing \model" objects mapped to database tables. They run migrate which tracks the history of the database schema in a special database table created in the user database. It evalu- ates the changes between the last version of the table schema and its corresponding class representation in an attempt to infer changes required to the table schema such as adding, renaming or dropping a column or constraint. After determining the changes, it produces a migration script to evolve the database \Up" or \Down" so that developers can migrate or rollback changes to the database schema. The capabilities of migrate are limited in scope to model evolution (manually performed by the user) and corresponding schema evolution (automatically generated scripts by the migrate utility). In contrast, as described earlier, users of CHiSEL write the schema evolution script in a user-friendly scripting language with 8 https://docs.microsoft.com/en-us/ef/ 204 high-level operations and CHiSEL's operations evolve the schema, data and now schema mappings. We illustrate the comparison by continuing the narrative from Section 5.5.2 and then summarize the results in Table 5.7. The research team decides to track researchers' contri- butions for the described experiments using the following steps: 1. Create Table: create a table PI with Id, Name, Role, Affiliations, Phone, Street, City, and State columns. 2. Rename Table: rename PI to Person so that the table includes all members of the research team and students. 3. Rename Column: change column Name to FullName for added clarity. 4. Add Column: add column ORCID to include global, resolvable identiers for each person. 5. Reify: create a table Address with columns Street, City, State, and a reference to the Person it belongs with in a one-to-many (1:N) relationship, since the address elds represent a distinct concept and a person can have more than one address (e.g., oce, home, lab, etc.). Also, drop the address columns from Person. 6. Change Cardinality: create association table (a.k.a., join table) with foreign key references to Person and Address to allow many-to-many relationships, since people can share and reuse addresses (e.g., shared lab spaces or oces). 7. Normalize: create a table Affiliation with a reference to the Person table and split the values of the original Affiliations into separate rows in the new table, since it was more dicult to enter and x entries that led to messy entries. Also, drop Person's Affiliations column. 8. Create Vocabulary: create a table Role with columns for name and synonyms. Populate it with the terms found in Person's Role column, since in practice they 205 found a multitude of variants on the same concepts (e.g., \PI", \pi", etc.) that made it hard to nd data consistently. 9. Align to Vocabulary: Last, they match the terms in Person table's Role column to the appropriate row of the new Role vocabulary table and create an association table to enforce consistency and allow for multiple roles per person. Table 5.7: Evaluation of Schema (S), Data (D), and Model (M) evolution between CHiSEL and EF Core migrate. Type of CHiSEL EF Core Operation S D M S D M 1. Create Table x n/a x x n/a x 2. Rename Table x x x * * x 3. Rename Column x x x x x x 4. Add Column x n/a x x n/a x 5. Reify x x x * - x 6. Change Cardinality x x x * - x 7. Normalize x x x * - x 8. Create Vocabulary x x x x - x 9. Align to Vocabulary x x x * - x We implemented the above hypothetical schema evolution tasks in both CHiSEL and EF Core. The code and data for each of the evaluations are available online 9 . We scored the outcome of each step: successful (x), limited (*), unsuccessful (-), or not applicable (n/a). Unsuccessful operations indicate data loss (without additional intervention). Limited operations could not be semantically expressed or required additional manual intervention. The comparison demonstrates that CHiSEL can support schema and mapping evolution to a similar degree as EF Core, while also supporting the evolution of data that is out of scope for EF Core migrations (see Table 5.7). 9 https://drive.google.com/file/d/1BIK5-lz02Z21QLaKNZRzh71uU65PtcnB 206 5.7 Case Studies Several deployments that use the described architecture exist today, and all have undergone a continual evolution of their database schemas with corresponding evolution of the database- dependent applications built around them [22]. We illustrate the utility of the approach in case studies that each highlight specic aspects of the approach. The rst case study provides an examination of the major schema evolution events in one of our deployments, the second looks at the role of boundary objects in enabling decoupled actors to interact, and the third and fourth illustrate the utility of integrating schema evolution with model management. 5.7.1 Bioscience Data Hub FaceBase is a comprehensive online resource for craniofacial and dental research [21]. The FaceBase Data Hub is built on theDeriva platform and exhibits the architecture described above. Over the past 5 years, the database schema has undergone continual evolution with changes occurring several times each year. In particular, the database schema went through three major transformations. In Figure 5.17, we illustrate the core elements of the three major schema versions (left of the dashed line) and their mappings (right of the dashed line). Due to space limitations, we can only illustrate a small fraction of the actual model and mappings. The mappings shown are a conceptual illustration of what are termed the detailed contextualization of the schema, used by interactive applications that display an entity and its most closely related entities. The original database (not depicted) was migrated from a legacy content management system into the Deriva platform, shown as the initial schema (S0). The core of the model consisted of the dataset and file tables, which were associated with \controlled vocabu- lary" terms for species, anatomy, gene, etc. (not depicted). The schema was \annotated" with mappings to user-friendly display names and data presentation templates, managed 207 RID (PK) type (FK) dataset (FK) local_identifier control (FK) … RID (PK) experiment (FK) biosample (FK) bio_rep_num tech_rep_num RID (PK) species (FK) stage (FK) anatomy (FK) … experiment replicate biosample RID (PK) assay_type (FK) local_identifier control (FK) … RID (PK) device equipment file … RID (PK) species (FK) stage (FK) anatomy (FK) … assay imaging sample RID (PK) project (FK) title description … dataset RID (PK) project (FK) title description … dataset Dataset - Project - Title - Description - … - Sample - Species - Stage - … - Assay - Type - … - Imaging - Device - Equipment - … Dataset - Project - Title - Description - … - Biosample - Species - Stage - … - Experiment - Type - … - Replicate - … Relationship Transformation Mapping S 1 S 2 id (PK) url size checksum … id (PK) project (FK) title description … S 0 file dataset Dataset - Project - Title - Description - … - File - URL - Size - … file table (not depicted) schema mappings Figure 5.17: Major schema evolution events in the FaceBase Data Hub: initial schema (S 0 ), revised for greater experimental details (S 1 ), and evolved for better reproducibility (S 2 ). within the data service (Requirements 1 and 2). Deriva's model-adaptive web interface (Chaise) was able to generate an on-the- y mapping of the ER model to a user-friendly denormalized rendering (conceptually illustrated at right) for search and browse (Require- ment 3). After about a year and a half, a more detailed schema (S 1 ) was developed to improve the interoperability of the data and as such was informed by exchange formats such as the Investigation, Study, Assay (ISA) format [118] and the design of other bioscience repositories. 208 The dataset and file tables were conserved, while new tables for biological specimen (sample), genomics assay, and imaging assay details were added and made to reference the file table to identify assay-specic data. The schema was annotated to hide or show or reorder the display of certain attributes, and \bulk update" mappings were dened in order to specify how data upload should be processed (Requirement 1). The model-adaptive web client was able to update its on-the- y mappings to a denormalized model for presentation (Requirement 3). Deriva's model-neutral upload client reads the bulk-upload annotation (Requirement 3) to translate user-generated data bundles (Requirement 4) to the ER model. The nal major schema evolution discussed here, occurred after about another year when FaceBase reorganized its genomics data representation to support an automated bioinformat- ics pipeline. To do so required that the model be able to identify explicitly which biological- or technical-replicates were the origin for sequence data. The revised schema (S 2 ) was trans- formed by adding a replicate table and generalizing the assay table into a synonymous experiment table that unied experimental details from both assay and imaging tables. Thus the simple relationships from assay and imaging to sample where transformed into a qualied association 10 via the replicate table. Details of the data les were transformed into additional categories by introducing new le type-specic tables, e.g., raw sequence data, processed data, genomic tracks, imaging, mesh les, etc. (not depicted). While this use case predated CHiSEL, an early prototypical library of scripts was used to evolve the schema and mappings to re ect the new and more complex relationships of the model (Re- quirements 5 and 1). The updated mappings guided the interactive Web and bulk data clients such that they were able to update their representations of the model and continue functioning without alteration (Requirement 3). In addition, Deriva's bag export service was driven by the revised mappings which specied how to extract data into bag format for downstream usage by model-bound clients such as the pipeline and third-party tools used by researchers (Requirement 4). 10 That is, the replicate table has additional parameters rather than just serving as a pure binary association table. 209 Throughout these major schema evolution events, and the many smaller evolution events before, after, and between, the core platform the embodies the architecture remained un- changed, except for ongoing feature enhancements. Since the FaceBase Data Hub is used by researchers throughout the worldwide craniofacial and dental research eld, downtime must be minimized and care must be taken with each evolution event whether large or small. All changes to the database are tested and conrmed on a cloned \staging" server. When the database administrator is condent that the change scripts are implemented correctly, then they are executed against the production server. Generally, the execution of these scripts takes seconds to minutes and once complete the entire deployment of database, web ser- vices, desktop apps, and command-line interface clients are simultaneously upgraded in the process. 5.7.2 Boundary Objects for Bioinformatics We used the \boundary objects" implemented as Big Data Bags (bags) to support a 3rd party bioinformatics pipeline and other agents that interact with the FaceBase Data Hub. Figure 5.18 illustrates this usage scenario. Data are initially produced by collaborating geneticists using either bulk or single-cell RNA sequencing or chromatin immunoprecipitation sequencing. The specimens are initially sequenced (1) at their institutes or a vendor followed by data processing through an on-premise compute cluster (2). The raw sequence data and processed data are bundled (Requirement 4) and transmitted to the Data Hub using Deriva's model-neutral upload client that pulls its conguration (a schema annotation) from the data service on each invocation (Requirements 1 and 3). The upload client's conguration instructs it in how to interpret the data layout, extract metadata from le names, execute queries against the data service, and nally to upload data and record new entities in the database. A bioinformatics pipeline was operated on a Cloud infrastructure that was developed and loosely integrated with the Data Hub [132]. Using theDeriva download client, the pipeline 210 Data Services 3. Cloud-based Pipeline 2. On-premise Compute Pipeline 4. Genome Browser Trackhub 1. Sequencer 5. Researcher (Consumer) bulk data sequence data track data processed data sequence data processed data sequence data Figure 5.18: Illustration of a bioinformatics pipeline supported by \bag" (boundary object) collections in the decoupled interaction model. (3) pulls sequence data in bag format from the data services and deposits processed data using the Deriva upload client (Requirement 4). A separate automation routine uses the Deriva download client to extract a subset of processed data, known as genome annotation tracks (track data), from the database and push them to a genome browser track hub (4) accessible to the UCSC Genome Browser visualization service [86]. At any time, a researcher (5) may retrieve select data (bulk data) from the data services in the bag format (Require- ment 4), as generated by theDeriva export service which generates the export specication from schema annotations (Requirement 3). The bags contain provenance about the sources of metadata in the form of named queries (Requirement 2), metadata records in CSV or JSON format that describe the assays and biosamples, and references with checksums to all of the required data les accessible via a Deriva object store. This allows for the bags to be materialized on a cluster with tooling that supports retry to overcome transient network failures. Bags for the pipeline include metadata that are based on a minimal information 211 model sucient for bioinformatics pipelines that was designed after careful evaluation of widely adopted metadata models such as ENCODE [32] along with bundling of le ref- erences for sequence reads (FastQ), sequence alignment (BAM), quality control (FastQC), feature counts (count, tpm, fpkm, rpkm), and annotation tracks (bigBed, bigWig). To enable collaboration over data in this community, bags have serve as well-dened, self-describing artifacts for exchanging collections of genomic and imaging data between collaborators to en- able processing, discovery, and dissemination of results. Through the decoupled interaction model enabled by the bags acting as boundary objects, and Deriva upload and download clients that utilize congurations to keep in sync with the current database schema, we have been able to evolve the schema and mappings of the data services without disruption to the decoupled clients that interact with the data services. 5.7.3 Epidemic-Type Aftershock Sequence Models After updating CHiSEL with support for DDL operations and ICMOs, we worked closely with the Southern California Earthquake Center (SCEC) to trial the use of CHiSEL to dene a data management system for their Epidemic-Type Aftershock Sequence (ETAS) models. CHiSEL was used to dene a small schema to represent ETAS models, forecasts, and evaluations in a database implemented in DERIVA. Given a small example script, a member of the SCEC sta was able to develop a CHiSEL script and run it to develop a proof-of-concept. In combination, Deriva and CHiSEL promote an iterative or agile method of development. The SCEC sta member tweaked his CHiSEL script numerous times to produce the nal proof-of-concept of the ETAS management system. This exercise provided initial validation that CHiSEL could be used by researchers to develop a database from a use case motivated by the day-to-day needs of an important research application. It also helped to inform and motivate the further development that resulted in the integration of MMOs into the DDL operations and ultimately revise the SMOs with integrated MMO 212 capabilities. The starter script for this case study is preserved as a GitHub Gist 11 . 5.7.4 Pancreatic -Cell Consortium Here we discuss a deployment of Deriva for a pancreatic -cell consortium (PBCC) 12 that used CHiSEL to streamline the evolution of an underlying data model and application. PBCC's mission is, \to understand -cell biology and diabetes through a cross-disciplinary approach for the assembly of spatiotemporal multi-scale whole cell models of human pancre- atic-cells." An initial data model was developed for the consortium similar to the structure depicted in Figure 5.2. New requirements, however, motivated a reexamination of the model in order to extract and submit mass spectrometry data to the PRoteomic IDEntier Database (PRIDE) [155]. Three new tables were added: mass spec data table linked to biosample and dataset tables, pride project table to represent the submission event, and an associ- ation table to link pride project and mass spec data. Four existing tables were evolved with new attributes to support the PRIDE submission, and another 7 new tables were added for managing new controlled vocabulary terms. A biologist worked independently to develop the schema evolution scripts using CHiSEL andDeriva APIs to evolve the deployment and support the new requirements. Modications through CHiSEL automatically updated not only the schema but the model mappings (Requirement 5). The transformations were all performed in the Jupyter Notebook environment popular with scientists and informaticists 13 . 5.8 Conclusions In this section, we have presented an architecture for co-evolving data-centric ecosystems and examined the utility of the architecture through case studies involving deployments with actual scientic collaborations. In that context, we have also presented an approach to 11 https://gist.github.com/robes/3d6153d77e3832fc4628342aee9ecb48 12 See https://pbcconsortium.isrd.isi.edu/ 13 The notebooks and detailed description of the changes may be found at https://github.com/ informatics-isi-edu/pbc-pride-2020 213 co-evolution of schema, data, integrity constraints, and model mappings in support of data- centric ecosystems. Our approach integrates a concise set of model management operations with a database evolution language consisting of both conventional DDL operators and novel schema modication operators (SMOs). We evaluated the schema mapping characteristics of several active scientic applications that have built data management systems based on Deriva. The analysis demonstrates extensive usage of schema mappings that specify contextualized representations of the schema for key application usage scenarios (e.g., edit, view, search). It also demonstrates that simple mappings of direct element-to-element correspondences, from table column to object property, are insucient to express the mappings from a database schema to a database application. Most of the deployments in our study make use of schema mappings involving an arbitrary number of joins to project attributes from distantly-related entities. We also presented a qualitative comparison of our approach with an enterprise data- base migration utility that takes an application-driven approach to propagating incremental changes from the application model down to automatically generated schema change scripts. We have shown that our approach not only can match the ecacy of their migration ap- proach for evolving schema and mappings, but also it can go one step further by co-evolving the data as well. 214 Chapter 6 Database Evolution, by Scientists, for Scientists: A Case Study The content of this chapter is based on the paper: Robert Schuler et al. \Database Evolution, by Scientists, for Scientists: A Case Study". In preparation. In this chapter, we present a simplied software development methodology for database evolution for scientists and a case study of database evolution by a scientist in the context of a scientic asset management system for cell modeling, using the extended schema evolution framework presented in Chapters 4 and 5. We include a detailed analysis of the activities and processes employed by the scientist during the schema evolution. The results demonstrate that a scientist can successfully evolve a complex information system driven by new research requirements using the approaches we have presented. 6.1 Introduction Relational Database Management Systems (RDBMS) play a critical role throughout enter- prise data centers. In science, however, relational databases are yet to see the same level of adoption, even as science has become increasingly data-intensive. The research lifecy- cle from data acquisition, to data analysis and interpretation, and nally publication and 215 dissemination of results all rely on rigorous handling of digital assets that are key to the research. Some have suggested that the database itself should be viewed as a scientic in- strument in its own right [76]. Others have shown that scientists can conduct data analyses for their research using the database's native Structure Query Language (SQL) [77]. Though progress has been made, a key obstacle remains, the dicult task of developing, maintaining and evolving the database's schema (i.e., its internal structure for organizing data following principles of the relational model). In order for scientists to fully utilize the RDBMS, they will need to be able to go beyond querying the database and actually dene, build and evolve the database schema. Though the need for science to adopt better practices for data management as pro- moted by the \FAIR data" guidelines [163], may be uncontroversial, developing complex information systems is a challenge even for a database administrator (DBA). In science, the \20 questions" method [145] has been proposed as a means of bootstrapping the design of information systems for science applications. Yet, it is well established that information systems undergo signicant pressure to adapt to new requirements often within a small num- ber of months [141] thus obsoleting the original schema if it is not maintained vigorously. A long running study involving scientists showed that they could, with some training, use SQL for data analysis and create new \views" to incrementally evolve a simple database schema bootstrapped from importing spreadsheets as new table denitions [78]. Such itera- tive development approaches are consistent with Agile methodologies [47] and the database community has aspired to enable more agile approaches to evolving relational databases as well [39]. There remains however a need for simplied tools and methods for scientists to create and maintain their own information systems. Unlike most prior studies of Deriva usage, where knowledgeable data engineers sup- ported domain scientists by developing and evolving the relational schema for real-world scientic applications, here we investigate the capacity for domain scientists to evolve their application's relational schema by and for themselves. The case study was conducted in 216 the context of the Pancreatic -Cell Consortium (PBCC), which uses Deriva to organize, archive, and share data for a consortium of researchers studying pancreatic beta cells and diabetes. This case study follows the evolution of this database (shown in Figure 6.1) to support a new endeavor of the consortium focused on whole cell modeling. As part of this update, several tables of the existing schema had to be restructured, new foreign key rela- tionships were established between tables, data had to be migrated, and tables were rened with altered columns and foreign keys to t the new purpose. public.ERMrest_Client public.Catalog_Group _acl_admin.group_lists common.person Beta_Cell.Plate_Type Beta_Cell.Plate_Status Beta_Cell.Derived_Image_Data Beta_Cell.Image_Data Beta_Cell.Process Beta_Cell.Dataset Beta_Cell.Biosample Vocab.File_Type_Term Vocab.Experiment_Type_Term Beta_Cell.Collection_Biosample Common.Collection Beta_Cell.Cell_Line Beta_Cell.Protocol vocab.species_terms vocab.anatomy_terms vocab.cell_line_terms Beta_Cell.File Beta_Cell.Ingredient Vocab.UniProt_Term Beta_Cell.Mass_Spec_Data Beta_Cell.PHYRE2_Model Beta_Cell.Protocol_Type Beta_Cell.Protocol_Step vocab.cellular_location_terms Beta_Cell.Protocol_Step_Additive_Term Vocab.Additive_Terms Beta_Cell.Specimen Beta_Cell.PDB_Model Vocab.Organism_Term Vocab.PDB_Term Beta_Cell.Mesh_Data isa.project Beta_Cell.Experiment vocab.specimen_type_terms Vocab.Quantification_Term Vocab.Instrument_Term Vocab.Modification_Term Beta_Cell.Pride_Project isa.person Vocab.Submission_Type_Term Beta_Cell.Pride_Project_Mass_Spec_Data viz.landmark viz.model_mesh_data viz.model vocab.file_format_terms Vocab.Tissue_Term Vocab.Disease_Term Vocab.PDB_Terms Vocab.Cell_Line_Status_Term Vocab.Cell_Line_Plate_Term Vocab.Pride_Additional_Term WWW.Page_Asset WWW.Page Figure 6.1: High level view of the initial database schema showing tables (boxes) and rela- tionships (arrows). 217 In this chapter, we report on our evaluation of the schema evolution performed by a domain scientist. We make the following contributions: we outline a schema evolution methodology for enabling agile database evolution that can be performed by scientists using a simplied schema evolution framework; we characterize the overall set of recorded actions (e.g., creating a table, renaming a column, adding a reference, querying data, manipulating data, etc.); we also observed and documented the overarching process taken by the scientist, and we distilled it down to a macro process of distinct phases; we drilled down into the phases of active schema evolution to describe the detailed inner loops observed within them; and nally, the case study demonstrates that with simplied tools and processes a domain scientist can evolve a complex relational schema for a real-world scientic application. Though Deriva is a specialized data management system for data-centric science, it fully embraces and exposes relational database concepts, thus the observations and lessons learned here may be broadly generalizable to the database research community. In the next section, we begin with a review of related work. In Section 6.3, we describe our concise methodology for database evolution by scientists. Then we present the details of the case study in Section 6.4 by describing the PBCC database and the schema evolution as performed by the scientist. In Section 6.5, we present our analysis of the case study, and we conclude in Section 6.6. 6.2 Related Work Schema evolution has been studied extensively from a variety of perspectives. The schema evolution history of MediaWiki, the open source software underlying Wikipedia, was ex- amined [37]. They proposed a succinct set of schema modication operations that could 218 support the observed evolution of MediaWiki and automatically create backward compati- ble views to ease database migration. The schema evolution of several open source software projects were studied [154] and from their analysis characterized the life cycle of a table. The specic eect that foreign key relationships have on their involvement in schema evolution showed a direct correlation between topological complexity and increased evolution of the table [46]. An extensive prole of schema evolution in 195 open source software projects was conducted in order to classify the behavior of schema evolution to identify overall proles of how schema are evolved in dierent software projects [153]. A case study [44] concerning re- lational database schema evolution and dependencies among the advanced features of stored procedures and views identied limitations in current tools to support the understanding and handling of schema change management with complex dependencies. The utility of database evolution history in order to complement common information sources of schema, data, and applications was explored to understand and assist the process of database reverse engineering [31]. Wrangler [85] presented an interactive and iterative approach to data cleaning. There are similarities to our approach in terms of the methodology but dierence in scope as it is limited to only the data outside of the context of a relational database, hence no evolution of schema. Dahlia [98] provided utilities for users to visualize the schema evolution changes, but does not address the need for utilities to enact the schema change. DB-MAIN [75] assists engineers with database migration and program modication by examining the propagation of requirements changes through database design and application dependencies as the data- base evolves. DaSIAn [93] oers an approach that helps to estimate the impact of schema change on dependent database applications. SQLShare [77] proposes a DataBase-as-a-Service (DBaaS) model to reduce the burden on scientists to administer database applications by oering a multi-tenant, hosted, SQL database. In a report [78] on 4 years usage of SQLShare by 591 users, they showed that scientists used SQLShare in a pattern of import data (and implicitly create table), evolve by 219 creating SQL Views over the original imported data tables, and query to perform analyses. These studies demonstrated the ability for scientists to use SQL databases for queries and relatively simple view denition workloads. With the exception of the SQLShare studies [77, 78], most other studies have examined schema evolution in the context of typical industry scenarios employing professional database administrators or software engineers. Here we present a case study involving scientists developing and evolving an information system as a critical resource for their research. 6.3 Schema Evolution Methodology The Deriva platform is complemented by a database evolution language intended both to simplify user tasks and allow them to be performed in a development environment commonly used by scientists. Specically, the Compositional High-level Schema Evolution Language (CHiSEL) oers a database evolution language (DEL) that encompasses both data denition language (DDL) operations and schema modication operators (SMO) that perform more complex schema changes [129], as described in Chapters 4 and 5. Since CHiSEL is delivered as an embedded domain-specic language (DSL) within Python, scientists can use executable notebooks (e.g., Jupyter Lab [87]) as a novel \schema evolution workbench" for database interactions. Normally, a SQL console or enterprise database administrative console might be used (e.g., pgAdmin) to interact with the relational database. Being able to employ an executable notebook means that all of the exploration and evolution of the database can be executed and documented for others to review, either to learn from, for historical record, or to validate before rerunning on another (production) system. CHiSEL integrates with Jupyter Lab, such that visualization of database schema and query results may be displayed directly in the notebook. Although the languages are programmatic in nature, when combined with a Read-Evaluate-Print-Loop (REPL) they become an interactive interface for exploring and evolving the database. 220 Since most scientists lack experience in database administration and software engineering, they are typically unfamiliar with software development methodologies such as Agile [52]. We sought to distill down a general set of best practices in software engineering based on our own experiences developing database-driven applications for science [22] into a methodology to help guide them in the evolution of their database. 6.3.1 Policies We begin by recommending a concise set of policies about the database and scripting envi- ronment. It follows: The \production database" is the database associated with the production system used on a daily basis to search, curate, archive, and retrieve data. Limit updates to this database to tested, repeatable (idempotent) procedures. The production database serves as the \system of record" for the database schema. That is, do not consider scripts (SQL, Python, etc.) in source code repositories au- thoritative. Ground truth is always the production database. When developing and testing schema evolution scripts, use a recent \clone" of the production database. Keep them private to an individual doing exploratory work or development for the purpose ultimately to learn and apply updates to the production database. Such databases can be created and dropped at will. Schema evolution scripts should be written for idempotent execution { as in producing the same result even if repeatedly executed { in order to guard against partial error conditions when performing operations that span atomic transaction blocks. Schema evolution scripts should be preserved in a source code repository for repeata- bility and to keep track of the schema evolution provenance. 221 6.3.2 Processes We then recommended a coarse outline for the development process. 1. Conduct exploratory work on a database cloned from production. Experiment, drop, and repeat until done exploring. 2. When developing new schema evolution scripts, test scripts against cloned database instances, commit schema evolution scripts to source code repository, drop the (cloned) database, and repeat until done scripting. 3. Stage changes for production database, by again creating a fresh clone of the production database and execute the schema evolution script. Check the results and drop the (cloned) database. 4. Finally, when updating the production database, begin making a backup of the data- base, checkout a schema evolution script from the source code repository, and execute the script against the production database. 6.3.3 Practices Finally, we enumerate a set of iterative design and development practices. 1. Start with an informal (conceptual) design. At this point, do not be encumbered by precise technical details. Consider what are the key concepts (e.g., specimens, protocols, etc.) in the terms you would naturally use. Once you have a grasp of the concepts, consider how they are related. For example, a specimen is taken from an animal subject and prepared using a particular step from a protocol. Then think about the cardinality of these relationships. For example, do you conduct longitudinal studies on the subjects in several experiments carried out at dierent time points or developmental stages? Illustrate the above concepts and relationships in a simple non- 222 technical diagram, review it with your stakeholders (i.e., collaborators, PIs), and rene until you arrive at a consensus. 2. Rene the high level (conceptual) design. Consider the characteristics (i.e., attributes) of each concept. For example, a specimen may be characterized by its species, genotype, age, etc. What are the characteristics that could be used to uniquely identify an \instance" of each of the concepts? This becomes its natural key. Then, consider what attributes should be constrained to \controlled vocabulary" terms. For example, \species" can be sourced from the NCBI Taxonomy or \gene" from an established nomenclatures (e.g., NBCI Gene, MGI gene names, etc.). Update your design and review with stakeholders. 3. Develop the database schema incrementally. First, consider the user interactions with the database: searching, browsing, viewing and entering data. As you develop, test these interactions and adjust your database schema to improve the usability. Next, build out iteratively: create a core set of tables and test the resulting system. Then expand outward from the core, dening another related set of tables. For example, you started with tables for experiments and specimens, and next you build out the model for protocols and protocol steps. Finally, keep rening, evaluating, reviewing: work with \friendly users" until you reach a minimal viable system; i.e., it does not have to encompass all possible usage scenarios you envision for the database. Repeat these steps for new usage scenarios for your database. In the following sections, we describe how this approach was used to evolve a real-world scientic database. 6.4 Case Study: Evolution of PBCC Pancreatic -Cell Consortium (PBCC) is a community of scientists, clinicians, engineers, and digital artists working together to develop new methods for understanding and treating 223 diabetes [136]. The consortium's primary goal is to understand -cell biology and diabetes through a cross-disciplinary approach and assemble a spatiotemporal multi-scale whole-cell model of the entire human pancreatic -cell. In the initial phase of PBCC, there was a requirement to archive and share the experimental data collected by various labs with other consortium members, especially with scientists involved in the computational processing of the data. For this, a PBCC database was developed and implemented using Deriva. Over time, the requirements of PBCC changed. The consortium expanded from only collecting experimental data to generating processed data, integrating various data types, and creating whole-cell models. Instead of only sharing experimental data among consortium members, there was also a need to archive processed data, models, and processing protocols for carrying out iterative whole cell modeling and to support data dissemination to the entire scientic community. 6.4.1 PBCC Database The original PBCC database archived raw and derived data. The database has a schema con- sisting of 11 namespaces, 57 tables, 654 columns, 129 keys 1 , and 118 foreign keys. Figure 6.1 illustrates the entire database schema for the Pancreatic Beta Cell Consortium (PBCC). A Dataset table recorded the dataset type and description of a set of experiments conducted together. Then it had a related Experiment table that stored metadata of an experiment, including experiment type and protocol. The Experiment table further connected to the Biosample table that contained metadata on each data point collected in an experiment. The Biosample table can contain data points for any data type, for example, Proteomics, Cryo-Electron tomography, Soft X-ray tomography, etc. The Biosample table is further connected to Image Data and Derived Image Data tables which store the raw and derived image data. The Image Data and Derived Image Data tables are only for imaging data. Separate tables for other data types were created later per new requirements. 1 In Deriva, any set of columns that are both UNIQUE and NOT NULL are considered \keys" and should not be confused with PRIMARY KEYs as dened by SQL 224 Figure 6.2: Evolution of the PBCC database schema. A subset of the original schema is shown on left and the revised schema after evolution on the right. 6.4.2 Evolving the PBCC Database One major problem with the original schema was it attempted to represent all types of imaging data in a single Image Data table. This did not allow dierent feature columns for dierent imaging types. In fact, the columns in theBiosample table were biased towards Soft X-ray tomography (SXT), as SXT was one of the rst data types available for archiving in the database [158, 90]. Another major challenge with the existing database was structuring the processed data. TheDerived Image Data table only stored data computed from the raw data available in Image Data. If new tables were added for non-imaging data types, for example, Mass Spec Data, it required a separate table for the new data type. This created redundancy 225 in the table denitions over time. Also, as mentioned above, the consortium moved towards generating whole cell models [114] instead of just sharing experimental data, so there was a need to archive computed models as well. Therefore, a change in the schema was required so that experimental data from multiple data types were archived more eciently, and a single processed data table was available for any type of computed, processed or derived data. The major evolution steps performed include: 1. The Biosample table was separated into tables containing information on specic data types. For example, the existing Biosample table contained SXT and Fluorescent mi- croscopy data [157, 158, 90], and it was divided into two separate tables, SXTDataset and FIDataset. Biosample columns were biased towards SXT metadata. After cre- ating two tables out of Biosample, the SXT specic columns were removed from the FIDataset table. 2. The original Dataset table was renamed to Project. 3. A new Dataset table was created to unify and contain the set of all data points across dierent data specic tables (SXTDataset and FIDataset).The Dataset table includes columns that are common across all data types while a new set of child tables contain the specialized attributes for distinct data type. 4. The Image Data table in the original schema contained metadata and an actual le of the entries in data specic tables, so it was renamed to File. 5. Derived Image Data was renamed to ProcessedDataset. A new \join table" named Processed Dataset JointTable was also created between tables ProcessedDataset and Dataset. This helped in referring to the raw data that was used to create entries in the ProcessedDataset table. 6. Several new relationships and multiple foreign keys were created based on new require- ments as illustrated in Figure 6.2. 226 7. Data was transformed and migrated during the schema evolution to t the revised schema. 6.4.3 Characteristics of the Case Study The schema evolution was performed by one biologist from PBCC with 10 years experience programming, 7 years in Python, and he had taken one course in databases in his past training but had no practical experience with relational database management. He had not been involved in the design or development of the original PBCC database. Initially, we met online for two 1-hour meetings to provide an overview of theDeriva programming interfaces and a review of the schema evolution methodology in Section 6.3. He was provided online documentation forDeriva and an online document describing the methodology. Each week, he met with another biologist from PBCC to review the schema design, get timely feedback, and ensure that the schema changes aligned with the requirements of the consortium. This second biologist's role was purely oversight, and she did not perform any of the schema evolution herself. One co-author responsible for the design and execution of the case study joined the two biologists in their weekly meeting to answer questions related to Deriva features and relational schema concepts but not to help with the schema evolution task itself. The biologists then met bi-weekly with the entire PBCC team to review the database schema so that all stakeholders could oer feedback on the revised design and implementation of the system. The schema evolution task was decided by the pair of biologists that met weekly based on requirements they elicited from discussions with the consortium as a whole. The consortium determined the overall goal, the pair of biologists iterated over a design to meet that goal, and the lone biologist decided how to script the schema evolution operations to achieve the design. The time frame for the case study spanned 5 weeks, with the biologist working on the schema evolution script for a few hours each week in between his other teaching and research responsibilities. The schema evolution was scripted in a Jupyter notebook using 227 CHiSEL supplemented with additional Python code for data manipulation. Meetings were conducted using online video conferencing. 6.5 Analysis of the Evolution Jupyter notebook [87] is an interactive development environment popular among data sci- entists. A notebook is divided into an arbitrary number of \cells" (i.e., blocks) of code and comments followed by the output of each cell's execution. The schema evolution for this case study was scripted in a Jupyter notebook containing roughly 195 notebook cells and 677 lines of code and comments overall. First, we analyzed the ne-grained actions taken to characterize the discrete units of code in each cell. Next, we assessed the overall process that emerged from the sequence of actions. Finally, we quantied the amount of activity spent on each action and step of the process. 6.5.1 Recorded Actions The notebook's cells were categorized as shown in Table 6.1. Action 1 covers the initial application programming interface (API) import statements. Action 2 setup the database connection, while Action 3 cloned the original database. The scientist dened a set of his own utility functions for commonly repeated operations of querying and displaying schema or data (Action 4). Comments (Action 5) were used to document a plan of action to be performed over a sequence of subsequent cells. Reviewing the API documentation (Action 6) when uncertain about a method signature or functionality tended to be more critical during schema modication steps rather than data manipulation. Describing (Action 8) and visualizing (Action 7) the schema were frequently used throughout the notebook but especially upfront during initial exploration and then around schema modication activities, often as a way of viewing the \before" and \after" of a change. Querying (Action 11) and printing data (Action 9) were commonly associated with both 228 Table 6.1: Types of actions recorded in the executable notebook. Action Description 1 Initialization 2 Connection 3 Clone database 4 Dene helper function (code) 5 Comment 6 Review (API) documentation 7 Graph schema (Visualization) 8 Describe schema/table 9 Print data 10 Code (other) 11 Execute data query (DQL) 12 Execute data manipulation (DML) 13 Execute schema modication (DDL) 14 Execute schema modication (SMO) schema modication and data migration. Even when making changes to schema that had no impact on data, displaying data was often performed, perhaps to reinforce the description and visualization of the schema with example data. Transformation of data was performed with general purpose programming code blocks (Action 10) and was closely correlated with data manipulation statements (Action 12). Data transformation activities were performed with DataPath APIs that resemble DQL (Action 11) and DML (Action 12). Data queries (Action 11) were either in support of displaying data (Action 9) or to populate local variables used in data transformation actions (Action 10 and/or 12). Schema evolution was performed in CHiSEL operations resembling DDL (Action 13) and in CHiSEL's SMO expression language (Action 14). 6.5.2 Observed Process We observed several phases of the overall activities in an emergent process, characterized as a macro process (Figure 6.3) composed of more detailed internal processes of the evolution 229 (Figure 6.4). We also identied the recorded actions that fall into each phase of the process in Table 6.2. Figure 6.3: Observed macro process. An overall loop can be seen beginning with Conceptu- alization and Design, proceeding through several Schema Evolution phases (A{E), pausing for Review, then repeating until complete. The macro process (Figure 6.3) began with a team-wide discussion about the overall conceptualization and design of the PBCC database. Moving into active development pro- ceeded to the denition of several custom functions (A) that wrapped Deriva APIs to re ect personal preferences and avoid repetition. For example, one function performed a query (DQL) operation to fetch all data for a table, load it into a Python Pandas object, and display the output. Next, the current, production database was cloned (B) into a working copy of the database. Then the notebook documented a phase of exploration (C) with multiple actions to describe and visualize the database schema. Once a plan of action was determined, the evolution began and proceeded through a few observable phases (D.1 { 5): rst, coarse grained actions were taken to reshape core tables (restructuring some 230 Table 6.2: Relationship between actions and process. Phase Related Activities A 1, 4, 5 B 2, 3 C 7, 8, 9, 11 D 5 { 14 E 8 tables, deleting rows no longer needed, creating new tables, creating \vocabulary" tables, populating new tables or new columns in old tables); next, new foreign key relationships were dened (new one-to-many relationships, reversing one-to-many relationships, convert- ing one-to-many into many-to-many, querying, transforming, and updating data to represent the new relationships); and nally, after the larger changes were complete then ne tuning was performed (renaming tables, renaming columns, dropping unwanted columns). Once these were completed, the scientist conrmed (E) that the changes were per his design and the development phases concluded. The team met to review changes leading back to further renement of the conceptualization and design. A deeper dive into the evolution phases reveals a process of an outer iterative loop with inner loops to modify schema or data (see Figure 6.4). The outer loop (\Evolution Block") began with review and planning steps that consist of actions to describe and visualize parts of the schema at dierent levels of granularity, querying and displaying data of interest, reviewing schema evolution goals and deciding next steps. From there, inner loop \Schema Modication Loop" or \Data Transformation Loop" was enacted. To modify the schema, the scientist executed DEL statements followed by a review of the schema and often data too. The scientist frequently reviewed API documentation before executing DEL statements, particularly in earlier stages. To prepare data, the database was often queried rst using DQL, and general-purpose programming statements (i.e., Python) in a code block were performed to transform the data into a shape suitable for updates. Next, the scientist 231 Figure 6.4: Observed iterative process within evolutions. An outer loop began with Review and Plan Next Changes, based on the type of change needed an inner loop (Schema Modi- cation Loop or Data Transformation Loop) was executed, until the current evolution was Done. executed DML statements. Then he queried and displayed data to review the changes. These inner loops were repeated until the schema or data changes were complete, and then he returned to the review and planning activity. A common overarching pattern was the \before" and \after" review for nearly every change. 6.5.3 Quantifying Actions and Processes We analyzed some of the quantiable aspects of the recorded actions and observed processes employed during the schema evolution. Occurrences of Each Action We illustrate the occurrences of each recorded action of Table 6.1 in Figure 6.5. The most common was describing the schema (Action 8) with 69 occurrences. The exploratory review phase was extensive and throughout each evolution block the state of the schema was examined before and after a change. Similarly, the next 232 most common action was to query (DQL) and print data via a user-dened function (Actions 11 and 9, respectively) and occurred 34 times. Visualization of the schema (Action 7) was performed 20 times. These operations taken together were among the most dominant, highlighting the need for good exploratory tools integrated with schema evolution operations. Figure 6.5: Totals of actions per category. Reviewing API documentation (Action 6) happened 7 times; note that another charac- teristic of the usage of CHiSEL in an executable notebook is that API documentation can be displayed inline within a cell. Thus the user does not have to switch contexts in order to lookup a function signature and its description. Code blocks (Action 10) of general purpose programming used to transform and prepare data were counted in 20 notebook cells. Then DML statements (Action 12) followed with 16 instances. The actual database evolution actions took place in 65 CHiSEL DDL operations (Action 13) and another 5 CHiSEL SMO operations (Action 14). A more detailed breakdown may be seen in Figure 6.6. There were 16 user-dened functions (Action 4), while initialization was performed in one cell (Action 1) as was the database connection (Action 2). We did not record the database clone operations (Action 3) because it occurred in a separate notebook. Finally, comments (Action 5) accounted for 16 recorded actions. 233 Figure 6.6: Breakdown of database (DQL, DML, DDL, SMO) actions. Activities per Phase The actions recorded, ratios of actions, and lines of code per phase (of phases outlined in Section 6.5.2) are shown in Figures 6.7, 6.8 and 6.9, respectively. The rst phase (A) of the process began with initialization and function denitions followed by connecting to the database in phase (B). Next, the exploratory phase (C) can be seen to involve several actions taken to visualize and describe the schema as well as queries and printing of data. Since the scientist that was evolving the database was not its original developer, it is not surprising that a learning phase preceded taking further actions. Figure 6.7: Actions taken per phase of process. 234 Figure 6.8: Ratio of actions taken per phase of process. The rst evolution phase (D.1) involved major refactoring of the original core table structures to repurpose them for future usage. Unique to this phase was the presence of API documentation review, typically right before CHiSEL DDL statements. Visualization tapered o after D.1, presumably as the scientist became more familiar with the database schema. It can be seen, however, that describing the schema remained a consistent activity throughout the remainder of the notebook. During the evolution phases (D.1 { D.5), reading and displaying data were also regularly performed in order to see the eect of evolution not only on the schema denition but also the data itself. As the evolution phases progressed, the proportion of exploratory actions generally reduced while the overall proportion of CHiSEL DDL and SMO increased. This can be attributed to growing condence in the actions taken to change the schema and familiarity with the schema design. The nal step of the notebook noted in the conrmation phase (E) was to describe the schema before concluding that the planned changes were complete. 235 Figure 6.9: Lines of Code (LoC) per phase of process. 6.6 Conclusions In this chapter, we have presented a real-world scenario where a scientist was tasked with schema evolution for a scientic data management system. The scientist adopted our incre- mental, iterative database evolution approach inspired by Agile methods. The case study demonstrated the utility of providing interfaces to relational databases tailored to the skills and development environments familiar to scientists and that a simple, agile-like method- ology could be used to iteratively work through a non-trivial schema evolution task. The analysis revealed the importance of exploratory utilities. The integrated visualization utility was particularly relied upon to get a high level overview of the schema and drill downs of spe- cic table denitions. Similarly, throughout the database evolution inner loops, the ability to generate \before" and \after" views of the schema and data were crucial. The integration of these, and the API documentation, into the interactive environment of the executable note- book eliminated the need for context switching to other database administration utilities. It also showed that a database evolution language (DEL) delivered in the form of an embedded DSL oers the advantage of allowing users to lean on their knowledge of a familiar general purpose programming language and avoid the \context switching" between general-purpose 236 programming and native database query languages. They can leverage existing programming skills rather than getting stuck on unfamiliar operations in the database's native language. The integration of CHiSEL into the notebook environment aords the scientist a de facto schema evolution workbench tailored to the common skill set of many scientists. The scientist evolved a database that had been developed for the archival, curation and sharing of research data on the study of pancreatic beta cells. Several evolutionary steps were taken to repurpose the database for a new endeavor toward iterative whole cell modeling. The Deriva scientic data management system served as the platform for the research consortium and oered simplied but transparent rendering of the relational data model. The schema evolution script for the case study was developed within an executable notebook making it both a novel environment for database administration and a documented record of the changes to the database. Our analysis categorized the activities recorded in the notebook into several distinct types of actions. We identied an overarching process taken by the researcher and then decomposed the process into multiple inner loops executed repeatedly to achieve each step of the overall evolution plan. We also quantied the actions over the whole and within specic phases of the evolution, which for example demonstrated a reliance on exploratory tools early in the process that tapered o and transitioned into a pattern of before and after reviews of schema change operations. Though the researchers were able to perform a relatively complex evolution of the data- base, we did identify a few remaining pain points primarily in the cumbersome aspects involving integrity constraints, the need for continuous monitoring of before/after validation of schema changes, and limits to schema visualization at multiple levels of abstraction to facilitate communication with collaborators. 237 Chapter 7 Conclusion and Future Work In this thesis, we studied schema evolution for scientic asset management in the context of data-centric discovery. We have addressed two interrelated problems in database evolution: 1. How to provide a user-friendly method for scientists to perform dicult schema evolution tasks; and 2. How to evolve both the database schema and the model mappings needed by database-dependent applications. The rst part of this thesis focused on establishing the principles and vision for handling scientic data eectively throughout the research life cycle. We then rened the principles into an architecture and implementation for scientic asset management. The experiences and analysis of deployments based on this platform further revealed and motivated the need for evolution of schemas for scientic data. In the next part of this thesis, we presented a user-oriented framework for schema evolution that simplies some of the dicult aspects of evolving a complex information system for scientic data. We then extended the framework beginning with a set of architecture patterns for building data-centric ecosystems for scientic data, followed by a denition of new model management operations and an integration of them into the user-oriented schema evolution framework. Finally, we presented a user study involving a real-wold scientic collaboration using the extended framework to evolve a database schema for scientic asset management. 238 7.1 Summary of Contributions We showed that our database evolution language (DEL), dened on a foundation of an algebra of schema modication operators (SMOs) integrated with a novel set of model man- agement operators (MMOs), embodied in a familiar programming environment enables sci- entists to evolve schemas for scientic asset management more eectively and eciently. We made the following specic contributions: 1. a characterization of the high-level schema evolution requirements based on an evalu- ation of scientic applications; 2. a database evolution language based on an algebra of high-level schema modication operators that raise the level of abstraction for schema evolution over conventional languages; 3. an extension of the algebra that encompasses transformation of model mappings to facilitate co-evolution of application models; 4. an embodiment of the algebra as an embedded domain specic language and usable in an executable notebook environment as a novel form of schema evolution workbench; and 5. a case study and experimental evaluation based on the implementation of our approach in Deriva for scientic asset management. Our results demonstrated that with our user-oriented framework for schema evolution (1) user eort can be reduced for complex schema evolution operations and (2) the schema evolution operations can be executed eciently through planning and rewriting. 7.2 Future Work Several areas of interest may build on the work of this thesis. 239 1. While the work has shown that eciency can be improved by the algebraic formulation of \complex" schema modication operations, there remain open areas to explore in terms of optimization approaches to improve execution of the operators. 2. The schema evolution benchmark could be further developed and used to establish a standard for testing of new database evolution languages. 3. The \primitive" schema modication operators have been dened in terms of the con- ventional relational operators from which they are most closely related, and an area of interesting future work is to dene their own formal semantics. 4. Many exciting avenues exist to explore in the schema evolution workbench, such as applying articial intelligence to \guide" the user in the schema evolution task and therefore further reduce the eort for maintaining database schema. The intersection between schema evolution and scientic data curation provide a rich space for future research directions with abundant real-world, use-inspired applications. 240 Bibliography [1] Mandhri Abeysooriya et al. \Gene Name Errors: Lessons Not Learned". In: PLOS Computational Biology 17.7 (July 2021), e1008984. [2] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases: The Logical Level. 1st. USA: Addison-Wesley Longman Publishing Co., Inc., 1995. isbn: 0-201-53771-0. [3] Pierre A Akiki, Arosha K Bandara, and Yijun Yu. \Adaptive Model-Driven User Interface Development Systems". In: ACM Computing Surveys 47.1 (2014), pp. 1{33. [4] Bryce Allen et al. \Software as a Service for Data Scientists". In: Communications of The Acm 55.2 (Feb. 2012), pp. 81{88. [5] Rachana Ananthakrishnan et al. \Globus Nexus: An identity, prole, and group man- agement platform for science gateways and other collaborative science applications". In: 2013 IEEE International Conference on Cluster Computing (CLUSTER). 2013, pp. 1{3. doi: 10.1109/CLUSTER.2013.6702693. [6] Ross Andersen. \How Big Data Is Changing Astronomy (Again)". In: The Atlantic (Apr. 2012). [7] Mario Antonioletti et al. \The design and implementation of Grid database services in OGSA-DAI". In: Concurrency and Computation Practice and Experience 17.2-4 (2005), pp. 357{376. [8] Michael Armbrust et al. \Spark SQL: Relational Data Processing in Spark". In: Pro- ceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD '15. New York, NY, USA: ACM, 2015, pp. 1383{1394. [9] Chaitanya Baru et al. \The SDSC storage resource broker". In: CASCON. Nov. 1998, p. 5. [10] Sean Bechhofer et al. \Research Objects: Towards Exchange and Reuse of Digital Knowledge". In: Nature Precedings (2010). 241 [11] C Glenn Begley. \Six red ags for suspect work." In: Nature 497.7450 (May 2013), pp. 433{4. [12] C Glenn Begley and Lee M Ellis. \Drug development: Raise standards for preclinical cancer research." In: Nature 483.7391 (Mar. 2012), pp. 531{3. [13] Gordon Bell. \Foreword". In: The Fourth Paradigm: Data-Intensive Scientic Dis- covery. Microsoft Research, Oct. 2009, pp. xi{xv. [14] Philip A. Bernstein. \Applying Model Management to Classical Meta Data Prob- lems." In: Proceedings of the 2003 CIDR Conference. Asilomar, CA, USA, 2003, pp. 209{220. [15] Philip A. Bernstein and Sergey Melnik. \Model management 2.0: manipulating richer mappings". In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07. ACM Press, 2007. [16] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. \Generic Schema Matching, Ten Years Later". In: Proceedings of the VLDB Endowment 4.11 (2011). [17] Phillip A Bernstein, Alon Y Halevy, and Rachel A Pottinger. \A Vision for Manage- ment of Complex Models". In: SIGMOD Rec. 29.4 (Dec. 2000), pp. 55{63. [18] Kamal Bhattacharya, Richard Hull, and Jianwen Su. \A Data-Centric Design Method- ology for Business Processes". In: Handbook of Research on Business Process Mod- eling. Ed. by Jorge Cardoso and Wil van der Aalst. Hershey, PA, USA: IGI Global, 2009, pp. 503{531. [19] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. O'Reilly Media, Inc., 2009. [20] DCMI Usage Board. DCMI Metadata Terms. Jan. 2020.url:https://www.dublincore. org/specifications/dublin-core/dcmi-terms/. [21] James F Brinkley et al. \The FaceBase Consortium: a comprehensive resource for craniofacial researchers." In: Development (Cambridge, England) 143.14 (2016), pp. 2677{ 88. [22] Alejandro Bugacov et al. \Experiences with DERIVA: An Asset Management Plat- form for Accelerating eScience". In: 2017 IEEE 13th International Conference on e-Science (e-Science). 2017, pp. 79{88. isbn: 978-1-5386-2686-3. doi: 10.1109/ eScience.2017.20. 242 [23] K. Chard et al. \I'll take that to go: Big data bags and minimal identiers for exchange of large, complex datasets". In: 2016 IEEE International Conference on Big Data (Big Data). Dec. 2016, pp. 319{328. [24] Kyle Chard et al. \Globus data publication as a service: Lowering barriers to repro- ducible science". In: e-Science (e-Science), 2015 IEEE 11th International Conference on. IEEE. 2015, pp. 401{410. [25] S. Chaudhuri, V. Ganti, and R. Kaushik. \A Primitive Operator for Similarity Joins in Data Cleaning". In: 22nd International Conference on Data Engineering (ICDE'06). Apr. 2006, pp. 5{5. [26] Peter Pin-Shan Chen. \The Entity-Relationship Model|toward a Unied View of Data". In: ACM Trans. Database Syst. 1.1 (Mar. 1976), pp. 9{36. [27] Ann L. Chervenak et al. \The Globus Replica Location Service: Design and Experi- ence". In: IEEE Trans. Parallel Distrib. Syst. 20.9 (Sept. 2009), pp. 1260{1272. [28] Tim Clark, Paolo N Ciccarese, and Carole A Goble. \Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communica- tions". In: Journal of Biomedical Semantics 5.1 (2014), p. 28. [29] Brian L. Claus and Dennis J. Underwood. \Discovery informatics: its evolving role in drug discovery". In: Drug Discovery Today 7.18 (Sept. 2002), pp. 957{966. [30] Anthony Cleve and Jean-Luc Hainaut. \Co-transformations in Database Applications Evolution". In: Generative and Transformational Techniques in Software Engineering: International Summer School, GTTSE 2005, Braga, Portugal, July 4-8, 2005. Revised Papers. Ed. by Ralf L ammel, Jo~ ao Saraiva, and Joost Visser. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 409{421. isbn: 978-3-540-46235-4. doi: 10. 1007/11877028_17. url: https://doi.org/10.1007/11877028_17. [31] Anthony Cleve et al. \Understanding database schema evolution: A case study". In: Science of Computer Programming 97 (2015), pp. 113{121. [32] The ENCODE Project Consortium, Ian Dunham, Anshul Kundaje, et al. \An inte- grated encyclopedia of DNA elements in the human genome". In: Nature 489 (Sept. 2012), 57 EP -. [33] John Corwin et al. \Dynamic tables: An architecture for managing evolving, hetero- geneous biomedical data in relational database management systems". In: Journal of the American Medical Informatics Association 14.1 (2007), pp. 86{93. [34] Data Challenges Are Halting AI Projects, IBM Executive Says (2019). 243 [35] C. A. Curino et al. \The PRISM Workwench: Database Schema Evolution without Tears". In: 2009 IEEE 25th International Conference on Data Engineering. Mar. 2009, pp. 1523{1526. [36] Carlo A. Curino, Hyun J. Moon, and Carlo Zaniolo. \Graceful Database Schema Evo- lution: The PRISM Workbench". In: Proc. VLDB Endow. 1.1 (Aug. 2008), pp. 761{ 772. [37] Carlo A. Curino et al. \Schema evolution in wikipedia: toward a web information system benchmark". In: International Conference on Enterprise Information Systems (ICEIS). 2008. [38] Carlo A Curino et al. \Update rewriting and integrity constraint maintenance in a schema evolution support system: PRISM++". In: Proceedings of the VLDB Endow- ment 4.2 (2010), pp. 117{128. [39] Carlo Curino, Hyun J. Moon, and Carlo Zaniolo. \Automating Database Schema Evolution in Information System Upgrades". In: Proceedings of the 2Nd International Workshop on Hot Topics in Software Upgrades. HotSWUp '09. New York, NY, USA: ACM, 2009, 5:1{5:5. [40] Carlo Curino et al. \Automating the Database Schema Evolution Process". In: The VLDB Journal 22.1 (Feb. 2013), pp. 73{98. [41] Karl Czajkowski et al. \ERMrest: A Web Service for Collaborative Data Manage- ment". In: Proceedings of the 30th International Conference on Scientic and Sta- tistical Database Management. SSDBM '18. New York, NY, USA: ACM, 2018, 13:1{ 13:12. [42] P.M. Davis and M.J.L. Connolly. \Institutional Repositories: Evaluating the Reasons for Non-use of Cornell University's Installation of DSpace." In: D-Lib Magazine 13 (Mar. 2007). [43] E. Deelman et al. \Grid-based metadata services". In: Proceedings. 16th Interna- tional Conference on Scientic and Statistical Database Management, 2004. June 2004, pp. 393{402. [44] Julien Delplanque et al. \Relational Database Schema Evolution: An Industrial Case Study". In: 2018 IEEE International Conference on Software Maintenance and Evo- lution (ICSME). 2018, pp. 635{644. [45] William P. Dempsey et al. \Regional synapse gain and loss accompany memory for- mation in larval zebrash". In: Proceedings of the National Academy of Sciences 119.3 (2022), e2107661119. 244 [46] Konstantinos Dimolikas, Apostolos V. Zarras, and Panos Vassiliadis. \A Study on the Eect of a Table's Involvement in Foreign Keys to its Schema Evolution". In: Concep- tual Modeling. Ed. by Gillian Dobbie et al. Cham: Springer International Publishing, 2020, pp. 456{470. [47] Torgeir Dingsyr et al. \A decade of agile methodologies: Towards explaining agile software development". In: Journal of Systems and Software 85.6 (2012), pp. 1213{ 1221. [48] Ivo Dinov et al. \Ecient, Distributed and Interactive Neuroimaging Data Analysis Using the LONI Pipeline". In: Frontiers in Neuroinformatics 3 (2009). [49] AnHai Doan, Alon Halevy, and Zachary Ives. Principles of Data Integration. Elsevier, 2012. [50] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. \Duplicate Record Detection: A Survey". In: IEEE Transactions on Knowledge and Data Engineering 19.1 (Jan. 2007), pp. 1{16. [51] J urgen Engel, Christian Herdin, and Christian M artin. \Evaluation of model-based user interface development approaches". In: International Conference on Human- Computer Interaction. Springer. 2014, pp. 295{307. [52] John Erickson, Kalle Lyytinen, and Keng Siau. \Agile modeling, agile software devel- opment, and extreme programming: the state of research". In: Journal of Database Management (JDM) 16.4 (2005), pp. 88{100. [53] Ronald Fagin et al. \Clio: Schema Mapping Creation and Data Exchange". In: ed. by Alexander T. Borgida et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 198{236. [54] Roy T. Fielding and Richard N. Taylor. \Principled design of the modern Web ar- chitecture". In: ACM Transactions on Internet Technology 2.2 (May 2002), pp. 115{ 150. [55] Peter Fox and James Hendler. \The Science of Data Science". en. In: Big Data 2.2 (June 2014), pp. 68{70. issn: 2167-6461. doi: 10.1089/big.2014.0011. [56] Michael Franklin, Alon Halevy, and David Maier. \From Databases to Dataspaces: A New Abstraction for Information Management". In: SIGMOD Rec. 34.4 (Dec. 2005), pp. 27{33. [57] Stella Giannakopoulou et al. \CleanM: An Optimizable Query Language for Unied Scale-out Data Cleaning". In: Proc. VLDB Endow. 10.11 (Aug. 2017), pp. 1466{1477. 245 [58] Yolanda Gil, Varun Ratnakar, and Ewa Deelman. \Metadata Catalogs with Semantic Representations". In: Provenance and Annotation of Data. Ed. by Luc Moreau and Ian Foster. Vol. 4145. Springer Berlin Heidelberg, 2006, pp. 90{100. [59] M. Gobert et al. \Understanding Schema Evolution as a Basis for Database Reengi- neering". In: 2013 IEEE International Conference on Software Maintenance. Sept. 2013, pp. 472{475. [60] Carole Goble, David De Roure, and Sean Bechhofer. \Accelerating Scientists' Knowl- edge Turns". In: Communications in Computer and Information Science. 2013. [61] Jeremy Goecks, Anton Nekrutenko, and James Taylor. \Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational re- search in the life sciences." In: Genome biology 11.8 (2010), R86. [62] Goetz Graefe. \Query Evaluation Techniques for Large Databases". In: ACM Comput. Surv. 25.2 (June 1993), pp. 73{169. [63] Goetz Graefe. \The Cascades Framework for Query Optimization". In: Data Engi- neering Bulletin 18 (1995). [64] Jim Gray et al. \Scientic Data Management in the Coming Decade". In: SIGMOD Rec. 34.4 (Dec. 2005), pp. 34{41. [65] Alon Halevy, Michael Franklin, and David Maier. \Principles of Dataspace Systems". In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS '06. New York, NY, USA: Association for Computing Machinery, 2006, pp. 1{9. [66] Michael Hammer. \What is Business Process Management?" In: Handbook on Busi- ness Process Management 1: Introduction, Methods, and Information Systems. Ed. by Jan vom Brocke and Michael Rosemann. Berlin, Heidelberg: Springer Berlin Heidel- berg, 2015, pp. 3{16.isbn: 978-3-642-45100-3.doi: 10.1007/978-3-642-45100-3_1. url: https://doi.org/10.1007/978-3-642-45100-3_1. [67] Michael Hartung, James Terwilliger, and Erhard Rahm. \Recent Advances in Schema and Ontology Evolution". In: Schema Matching and Mapping. Ed. by Zohra Bellah- sene, Angela Bonifati, and Erhard Rahm. Berlin, Heidelberg: Springer Berlin Heidel- berg, 2011. Chap. Recent Adv, pp. 149{190. [68] Mark Hedges, Tobias Blanke, and Adil Hasan. \Rule-Based Curation and Preservation of Data: A Data Grid Approach Using iRODS". In: Future Generation Computer Systems 25.4 (2009), pp. 446{452. 246 [69] P Bryan Heidorn. \Shedding Light on the Dark Data in the Long Tail of Science". In: Library Trends 57.2 (2008), pp. 280{299. [70] Karl G Helmer et al. \Enabling Collaborative Research Using the Biomedical Infor- matics Research Network (BIRN)." In: Journal of the American Medical Informatics Association : JAMIA 18.4 (July 2011), pp. 416{422. [71] Kai Herrmann et al. \CoDEL { A Relationally Complete Language for Database Evo- lution". In: Advances in Databases and Information Systems. Ed. by Morzy Tadeusz, Patrick Valduriez, and Ladjel Bellatreche. Cham: Springer International Publishing, 2015, pp. 63{76. [72] Kai Herrmann et al. \Living in Parallel Realities { Co-Existing Schema Versions with a Bidirectional Database Evolution Language". In: SIGMOD'17, Proceedings of the 2017 International Conference on Management of Data, Chicago, IL, USA, May 14-19, 2017. ACM, May 2017. [73] Tony Hey, Stewart Tansley, and Kristin M Tolle. \Jim Gray on eScience: a trans- formed scientic method." In: The Fourth Paradigm: Data-Intensive Scientic Dis- covery. Microsoft Research, Oct. 2009. [74] Tony Hey et al. The Fourth Paradigm: Data-Intensive Scientic Discovery. Microsoft Research, Oct. 2009. [75] Jean-Marc Hick and Jean-Luc Hainaut. \Database application evolution: A transfor- mational approach". In: Data & Knowledge Engineering 59.3 (2006), pp. 534{558. [76] Christine Hine. \Databases as Scientic Instruments and Their Role in the Ordering of Scientic Work". In: Social Studies of Science 36.2 (2006), pp. 269{298. [77] Bill Howe et al. \Database-as-a-Service for Long-Tail Science". In: Scientic and Statistical Database Management. Ed. by Judith Bayard Cushing, James French, and Shawn Bowers. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 480{489. [78] Shrainik Jain et al. \SQLShare: Results from a Multi-Year SQL-as-a-Service Experi- ment". In: Proceedings of the 2016 International Conference on Management of Data. SIGMOD '16. New York, NY, USA: ACM, 2016, pp. 281{293. [79] S. Jansen, A. Finkelstein, and S. Brinkkemper. \A sense of community: A research agenda for software ecosystems". In: 2009 31st International Conference on Software Engineering - Companion Volume. May 2009, pp. 187{190. [80] S Jensen and B Plale. \Schema-Independent and Schema-Friendly Scientic Metadata Management". In: eScience, 2008. eScience '08. IEEE Fourth International Confer- ence on. Dec. 2008, pp. 428{429. 247 [81] Scott Jensen et al. \A hybrid XML-relational grid metadata catalog". In: Parallel Processing Workshops, 2006. ICPP 2006 Workshops. 2006 International Conference on. IEEE. Columbus, OH, USA, 2006, pp. 8{24. [82] M B Jones et al. \Managing scientic metadata". In: IEEE Internet Computing 5.5 (Sept. 2001), pp. 59{68. [83] Robert Kahn and Robert Wilensky. \A Framework for Distributed Digital Object Services". In: International Journal on Digital Libraries 6.2 (2006), pp. 115{123. [84] Sean Kandel. \Enterprise data analysis and visualization: An interview study". In: Visualization and Computer Graphics, IEEE Transactions on 18 (2012), pp. 2917{ 2926. [85] Sean Kandel et al. \Wrangler: Interactive Visual Specication of Data Transofmration Scripts". In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems - CHI '11. New York, New York, USA: ACM Press, May 2011, pp. 3363{ 3372. [86] W James Kent et al. \The human genome browser at UCSC". In: Genome research 12.6 (June 2002), pp. 996{1006. [87] Thomas Kluyver et al. \Jupyter Notebooks|a publishing format for reproducible computational work ows". In: Positioning and Power in Academic Publishing: Play- ers, Agents and Agendas (2016), pp. 87{90. [88] Benjamin Krogh, Andreas Weisberg, and Morten Bested. DBLint : A Tool for Auto- mated Analysis of Database Design. 2011. [89] Joseph Carl Robnett Licklider. In Memoriam, JCR Licklider, 1915-1990. Vol. 61. Digital, Systems Research Center, 1990. [90] Valentina Loconte et al. \Soft X-ray tomography to map and quantify organelle in- teractions at the mesoscale." In: Structure (Feb. 2022). [91] Ravi Madduri et al. \Reproducible big data science: A case study in continuous FAIRness". In: PLOS ONE 14.4 (Apr. 2019), pp. 1{22. [92] David Maier. Theory of Relational Databases. Computer Science Pr, 1983. [93] N. Malevris and S. Gardikiotis. \DaSIAn: A Tool for Estimating the Impact of Data- base Schema Modications on Web Applications". In: 2006 IEEE International Con- ference on Computer Systems and Applications. Los Alamitos, CA, USA: IEEE Com- puter Society, Mar. 2006, pp. 188{195. 248 [94] Daniel S Marcus et al. \The Extensible Neuroimaging Archive Toolkit: an informatics platform for managing, exploring, and sharing neuroimaging data." In: Neuroinfor- matics 5.1 (2007), pp. 11{34. [95] Mike Marin, Richard Hull, and Roman Vacul n. \Data Centric BPM and the Emerging Case Management Standard: A Short Survey". In: Business Process Management Workshops. Ed. by Marcello La Rosa and Pnina Soer. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 24{30. isbn: 978-3-642-36285-9. [96] Sergey Melnik, Erhard Rahm, and Philip A. Bernstein. \Rondo: A Programming Platform for Generic Model Management". In: Proceedings of the 2003 ACM SIG- MOD International Conference on Management of Data. SIGMOD '03. New York, NY, USA: ACM, 2003, pp. 193{204. [97] Sergey Melnik et al. \Supporting Executable Mappings in Model Management". In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. SIGMOD '05. New York, NY, USA: ACM, 2005, pp. 167{178. [98] L. Meurice and A. Cleve. \DAHLIA: A visual analyzer of database schema evolu- tion". In: 2014 Software Evolution Week - IEEE Conference on Software Mainte- nance, Reengineering, and Reverse Engineering (CSMR-WCRE). Feb. 2014, pp. 464{ 468. [99] Eileen Meyer. \Big Data is Transforming How Astronomers Make Discoveries". In: Smithsonian Magazine (May 2018). [100] George A. Miller. \WordNet: A Lexical Database for English". In: Communications of the ACM 38.11 (1995), pp. 39{41. [101] Daniel L Moody. \Metrics for Evaluating the Quality of Entity Relationship Models". In: Proceedings of the 17th International Conference on Conceptual Modeling. ER '98. London, UK, UK: Springer-Verlag, 1998, pp. 211{225. [102] Luc Moreau. \The Foundations for Provenance on the Web". In: Foundations and Trends R in Web Science 2.2{3 (2010), pp. 99{241. [103] Luc Moreau and Paulo Missier. PROV-DM: The PROV Data Model. Apr. 2013. [104] Christopher J. Mungall et al. \The Monarch Initiative: an integrative data and ana- lytic platform connecting phenotypes to genotypes across species". In: Nucleic Acids Research 45.D1 (Nov. 2016), pp. D712{D722. [105] Christopher J Mungall and David B Emmert. \A Chado case study: an ontology- based modular schema for representing genome-associated biological information". In: Bioinformatics 23.13 (2007), pp. i337{i346. 249 [106] Albert van Niekerk. \Strategic management of media assets for optimizing market communication strategies, obtaining a sustainable competitive advantage and maxi- mizing return on investment: An empirical study". In: Journal of Digital Asset Man- agement 3.2 (2007), pp. 89{98. [107] Yasunori Park et al. \Identication of Human Gene Research Articles with Wrongly Identied Nucleotide Sequences". In: Life Science Alliance 5.4 (Apr. 2022), e202101203. [108] F. Perez and B. E. Granger. \IPython: A System for Interactive Scientic Comput- ing". In: Computing in Science Engineering 9.3 (May 2007), pp. 21{29. [109] Beth Plale et al. \SEAD Virtual Archive: Building a Federation of Institutional Repos- itories for Long-Term Data Preservation in Sustainability Science". In: International Journal of Digital Curation 8.2 (Nov. 2013), pp. 172{180. [110] PostgreSQL contributors. PostgreSQL 10.5 Documentation. The PostgreSQL Global Development Group. 2018. [111] PostgreSQL contributors. PostgreSQL 12 Documentation: CREATE VIEW. [Online; accessed 06-March-2020]. 2020. url: https://www.postgresql.org/docs/12/sql- createview.html. [112] Dong Qiu, Bixin Li, and Zhendong Su. \An Empirical Analysis of the Co-Evolution of Schema and Code in Database Applications". In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ESEC/FSE 2013. New York, NY, USA: Association for Computing Machinery, 2013, pp. 125{135. [113] Arcot Rajasekar et al. \iRODS Primer: Integrated Rule-Oriented Data System". In: Synthesis Lectures on Information Concepts, Retrieval, and Services 2.1 (Jan. 2010), pp. 1{143. [114] Barak Raveh et al. \Bayesian metamodeling of complex biological systems across varying representations." In: Proc Natl Acad Sci USA 118.35 (Aug. 2021). [115] John F Roddick. \A survey of schema versioning issues for database systems". In: Information and Software Technology 37.7 (Jan. 1995), pp. 383{393. [116] John F Roddick. \SQL/SE: a query language extension for databases supporting schema evolution". In: Sigmod Record 21.3 (1992), pp. 1079{1080. [117] John F. Roddick, Noel G. Craske, and Thomas J. Richards. \A Taxonomy for Schema Versioning Based on the Relational and Entity Relationship Models". In: Proc. Twelfth International Conference on Entity-Relationship Approach. Dallas, Texas: Springer- Verlag, 1993, pp. 143{154. 250 [118] Susanna-Assunta Sansone et al. \The rst RSBI (ISA-TAB) workshop:\can a simple format work for complex studies?"" In: OMICS A Journal of Integrative Biology 12.2 (2008), pp. 143{149. [119] Susanna-Assunta Sansone et al. \DATS, the data tag suite to enable discoverability of datasets". In: Scientic Data 4 (June 2017), 170059 EP -. [120] Nuno Santos and Birger Koblitz. \Distributed Metadata with the AMGA Metadata Catalog". In: arXiv preprint cs/0604071 (Apr. 2006). [121] H.-J. Schek and M.H. Scholl. \The relational model with relation-valued attributes". In: Information Systems 11.2 (1986), pp. 137{147. [122] Guus Schreiber et al. RDF 1.1 Primer: W3C Working Group Note 24 June 2014. Tech. rep. World Wide Wed Consortium, June 2014. [123] Robert E. Schuler and Carl Kesselman. \Managing Database-Application Co-Evolution in a Scientic Data Ecosystem". In: 2022 IEEE 18th International Conference on e- Science (e-Science). Salt Lake City, Utah, USA: IEEE, Oct. 2022. [124] Robert E. Schuler and Carl Kesselman. \Towards an Ecient and Eective Frame- work for the Evolution of Scientic Databases". In: Proceedings of the 30th Inter- national Conference on Scientic and Statistical Database Management. SSDBM '18. Bozen-Bolzano, Italy: Association for Computing Machinery, 2018.isbn: 9781450365055. doi: 10.1145/3221269.3221300. url: https://doi.org/10.1145/3221269. 3221300. [125] Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \Accelerating Data-Driven Discovery With Scientic Asset Management". In: 2016 IEEE 12th International Con- ference on e-Science (e-Science). Baltimore, MD USA, 2016, pp. 1{10.doi: 10.1109/ eScience.2016.7870883. [126] Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \An Asset Management Approach to Continuous Integration of Heterogeneous Biomedical Data". In: Data Integration in the Life Sciences. Ed. by Helena Galhardas and Erhard Rahm. Cham: Springer International Publishing, 2014, pp. 1{15. isbn: 978-3-319-08590-6. doi: 10. 1007/978-3-319-08590-6_1.url: https://doi.org/10.1007/978-3-319-08590- 6_1. [127] Robert E. Schuler, Carl Kesselman, and Karl Czajkowski. \Digital asset management for heterogeneous biomedical data in an era of data-intensive science". In: Bioinfor- matics and Biomedicine (BIBM), 2014 IEEE International Conference on. Nov. 2014, pp. 588{592. doi: 10.1109/BIBM.2014.6999226. 251 [128] Robert E. Schuler and Carl Kessleman. \A High-Level User-Oriented Framework for Database Evolution". In: Proceedings of the 31st International Conference on Scientic and Statistical Database Management. SSDBM '19. Santa Cruz, CA, USA: Association for Computing Machinery, 2019, pp. 157{168.isbn: 9781450362160.doi: 10.1145/3335783.3335787. url: https://doi.org/10.1145/3335783.3335787. [129] Robert Schuler and Carl Kesselman. \CHiSEL: A User-Oriented Framework for Sim- pling Database Evolution". In: Distrib. Parallel Databases 39.2 (June 2021), pp. 483{ 543. issn: 0926-8782. doi: 10.1007/s10619-020-07314-x. url: https://doi.org/ 10.1007/s10619-020-07314-x. [130] Robert Schuler, Carl Kesselman, and Karl Czajkowski. \Data Centric Discovery with a Data-Oriented Architecture". In: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models. SCREAM '15. Portland, Oregon, USA: Association for Computing Machinery, 2015, pp. 37{44.isbn: 9781450335669.doi: 10.1145/2753524.2753532.url: https://doi.org/10.1145/ 2753524.2753532. [131] Robert Schuler et al. \Database Evolution, by Scientists, for Scientists: A Case Study". In preparation. [132] Robert Schuler et al. \Toward FAIR Knowledge Turns in Bioinformatics". In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2019). San Diego, CA: IEEE, Nov. 2019. [133] Robert Schuler et al. \Towards Co-Evolution of Data-Centric Ecosystems". In: 32nd International Conference on Scientic and Statistical Database Management. SSDBM 2020. Vienna, Austria: Association for Computing Machinery, 2020.isbn: 9781450388146. doi: 10.1145/3400903.3400908. url: https://doi.org/10.1145/3400903. 3400908. [134] David Shaywitz. \Pharma's Desperate Struggle To Teach Old Data New Tricks". In: Forbes (May 2019). [135] Gurmeet Singh et al. \A Metadata Catalog Service for Data Intensive Applications". In: SuperComputing 2003 (SC'03). Phoenix, Arizona: ACM, 2003. [136] Jitin Singla et al. \Opportunities and Challenges in Building a Spatiotemporal Multi- scale Model of the Human Pancreatic Cell". In: Cell 173.1 (2018), pp. 11{19. [137] Mitchell E Skinner et al. \JBrowse: a next-generation genome browser". In: Genome research 19.9 (2009), pp. 1630{1638. [138] MacKenzie Smith et al. \DSpace: An Open Source Dynamic Digital Repository". In: D-Lib Magazine 9.1 (2003). 252 [139] Susan Leigh Star and James R. Griesemer. \Institutional Ecology, `Translations' and Boundary Objects: Amateurs and Professionals in Berkeley's Museum of Vertebrate Zoology, 1907-39". In: Social Studies of Science 19.3 (1989), pp. 387{420. [140] R. Grant Steen, Arturo Casadevall, and Ferric C. Fang. \Why Has the Number of Scientic Retractions Increased?" In: PLoS ONE 8.7 (2013). [141] M. Stonebraker, D. Deng, and M. L. Brodie. \Database decay and how to avoid it". In: 2016 IEEE International Conference on Big Data (Big Data). Dec. 2016, pp. 7{16. [142] Michael Stonebraker, Dong Deng, and Michael L Brodie. \Application-database co- evolution: A new design and development paradigm". In: New England Database Day (2017), pp. 1{3. [143] Jason R Swedlow, Ilya G Goldberg, and Kevin W Eliceiri. \Bioimage informatics for experimental biology". In: Annual review of biophysics 38 (Jan. 2009), pp. 327{46. [144] Alexander S. Szalay. \The National Virtual Observatory". In: Astronomical Data Analysis Software and Systems X. Vol. 238. ASP Conference Series. 2001, p. 3. [145] Alexander S. Szalay et al. \Designing and Mining Multi-terabyte Astronomy Archives: The Sloan Digital Sky Survey". In: Proceedings of the 2000 ACM SIGMOD Interna- tional Conference on Management of Data. SIGMOD '00. New York, NY, USA: ACM, 2000, pp. 451{462. [146] Hongsuda Tangmunarunkit et al. Model-Adaptive Interface Generation for Data- Driven Discovery. 2021. [147] Ignacio G Terrizzano et al. \Data Wrangling: The Challenging Journey from the Wild to the Lake." In: CIDR. 2015. [148] James F. Terwilliger, Philip A. Bernstein, and Adi Unnithan. \Automated Co-evolution of Conceptual Models, Physical Databases, and Mappings". In: Conceptual Modeling { ER 2010. Ed. by Jerey Parsons et al. Berlin, Heidelberg: Springer Berlin Heidel- berg, 2010, pp. 146{159. [149] James F. Terwilliger, Philip A. Bernstein, and Adi Unnithan. \Worry-Free Database Upgrades: Automated Model-Driven Evolution of Schemas and Complex Mappings". In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD '10. New York, NY, USA: Association for Computing Machinery, 2010, pp. 1191{1194. [150] Rattapoom Tuchinda et al. \Artemis: Integrating Scientic Data on the Grid". In: Proceedings of the 16th Conference on Innovative Applications of Artical Intelligence. IAAI'04. AAAI Press, 2004, pp. 892{899. 253 [151] Arie Van Deursen, Paul Klint, and Joost Visser. \Domain-specic languages: An annotated bibliography". In: ACM Sigplan Notices 35.6 (2000), pp. 26{36. [152] A.J. van Niekerk. \The Strategic Management of Media Assets: A Methodological Approach". In: New Orleans Conference Proceedings: International Academy for Case Studies. Allied Academies. 2006. [153] Panos Vassiliadis. \Proles of Schema Evolution in Free Open Source Software Projects". In: 2021 IEEE 37th International Conference on Data Engineering (ICDE). 2021, pp. 1{12. [154] Panos Vassiliadis, Apostolos V. Zarras, and Ioannis Skoulis. \How is Life for a Table in an Evolving Relational Schema? Birth, Death and Everything in Between". In: Conceptual Modeling. Ed. by Paul Johannesson et al. Cham: Springer International Publishing, 2015, pp. 453{466. [155] Juan Antonio Vizca no et al. \2016 update of the PRIDE database and its related tools". In: Nucleic acids research 44.D1 (2016), pp. D447{D456. [156] Xinqi Wang et al. \Semantic enabled metadata management in PetaShare". In: In- ternational Journal of Grid and Utility Computing 1.4 (2009), pp. 275{286. [157] Zhongying Wang et al. \Live-cell imaging of glucose-induced metabolic coupling of and cell metabolism in health and type 2 diabetes." In: Commun Biol 4.1 (May 2021), p. 594. [158] Kate L White et al. \Visualizing subcellular rearrangements in intactcells using soft x-ray tomography." In: Sci Adv 6.50 (Dec. 2020). [159] Tom White. Hadoop: The Denitive Guide. O'Reilly Media, Inc., 2012. [160] Wikipedia contributors. Object-relational mapping. [Online; accessed 06-March-2020]. 2020. url: https://en.wikipedia.org/wiki/Object-relational_mapping. [161] Wikipedia contributors. Schema migration. [Online; accessed 16-October-2022]. 2021. url: https://en.wikipedia.org/wiki/Schema_migration. [162] Wikipedia contributors. Service-Oriented Architecture. [Online; accessed 16-October- 2022]. 2022.url:https://en.wikipedia.org/wiki/Service-oriented_architecture. [163] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al. \The FAIR Guiding Principles for scientic data management and stewardship". In: Scientic Data 3 (Mar. 2016), 160018 EP -. 254 [164] Mark D. Wilkinson et al. \The FAIR Guiding Principles for scientic data manage- ment and stewardship". In: Scientic Data 3 (Mar. 2016), 160018 EP -. [165] D. N. Williams et al. \The Earth System Grid: Enabling Access to Multimodel Cli- mate Simulation Data". In: Bulletin of the American Meteorological Society 90.2 (2009), pp. 195{206. [166] T.B. Winans and J.S. Brown. Web Services 2.0: Policy-driven Service Oriented Ar- chitectures. Tech. rep. Deloitte, 2008. [167] Katherine Wolstencroft et al. \The Taverna Work ow Suite: Designing and Executing Work ows of Web Services on the Desktop, Web or in the Cloud". In: Nucleic Acids Research 41.W1 (July 2013), W557{W561. [168] Mengfei Yu et al. \Cranial Suture Regeneration Mitigates Skull and Neurocognitive Defects in Craniosynostosis". In: Cell 184.1 (2021), 243{256.e18. 255
Abstract (if available)
Abstract
In order to drive discovery from data, scientists rely on accurate, up-to-date descriptions of data. Without explicit schema and descriptive metadata, scientific data would be lost. Yet scientific data is among the most challenging to manage due to schema evolution brought on by changing experiments, new technology and instruments, changing methods, emerging standards, and different conceptualizations between investigators.
In this thesis, we study the problem of evolving schemas for scientific data. First, we introduce a user-oriented framework for schema evolution based on an algebra of schema modification operators for simplifying the tasks needed to evolve schemas for scientific data. We present an implementation of our framework within an embedded domain specific language usable in an executable notebook environment familiar to scientists in order to reduce the effort for schema evolution. We also describe the algorithms for efficient planning and execution of the algebraic expressions. We then extend the framework with model management operators to facilitate coupled evolution of model mappings used by database-dependent applications.
We present an analysis and experimental evaluation on the effectiveness and efficiency of the schema evolution framework to reduce the time and effort needed to evolve schemas for scientific data. We also describe use cases of the framework that illustrate its utility in real-world applications. We show that these contributions can enable scientists to evolve database schemas with less effort and greater efficiency.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Optimizing execution of in situ workflows
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Resource management for scientific workflows
PDF
Simulation and machine learning at exascale
PDF
Scientific workflow generation and benchmarking
PDF
Responsible artificial intelligence for a complex world
PDF
An automated testing system for scientific workflows
PDF
An end-to-end framework for provisioning based resource and application management
PDF
A framework for runtime energy efficient mobile execution
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Performance optimization of scientific applications on emerging architectures
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Domain specific software architecture for large-scale scientific software
PDF
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Physics-based data-driven inference
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
New theory and methods for accelerated MRI reconstruction
Asset Metadata
Creator
Schuler, Robert Edward
(author)
Core Title
Schema evolution for scientific asset management
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
12/14/2022
Defense Date
12/08/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
database,OAI-PMH Harvest,schema evolution,scientific asset management
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kesselman, Carl (
committee chair
), Meeker, Daniella (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
rschuler@usc.edu,schuler@isi.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112620683
Unique identifier
UC112620683
Identifier
etd-SchulerRob-11368.pdf (filename)
Legacy Identifier
etd-SchulerRob-11368
Document Type
Dissertation
Format
theses (aat)
Rights
Schuler, Robert Edward
Internet Media Type
application/pdf
Type
texts
Source
20221214-usctheses-batch-996
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
database
schema evolution
scientific asset management