Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimizing execution of in situ workflows
(USC Thesis Other)
Optimizing execution of in situ workflows
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Optimizing Execution of In situ Workows by Tu Mai Anh Do A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2022 Dedication To my parents, my sister and my girlfriend ii Acknowledgements This Ph.D. journey could not have been happened and done without the hearty support of those who have walked alongside me and those who have supported me silently from behind. First and foremost, I would like to express my deepest gratitude to Professor Ewa Deelman, my dissertation advisor, for her valuable guidance and endless support throughout years of my Ph.D time. I am greatly thankful to her for having condence in my abilities and giving me precious opportunities to freely explore various challenging research topics. I am very grateful for her patience and understanding in advising me through every aspect of doing research and giving me valuable advice to grow as an independent researcher in the future. Without her outstanding leadership and persistent guidance, I would not have been able to accomplish this Ph.D. journey, especially in two dicult years of COVID-19 pandemic. When I think about things that I am grateful during that many past years, my colleagues from the Science Automation Technologies (SciTech) research group are always in my list. More specically, I would like to express a great appreciation for Dr. Loïc Pottier, my dear colleague and my best friend at the same time. I had directly learned a lot of research skills from him and more than that, we had shared a lot of great ideas and joyful experiences in both life and research. I would also like to thank Dr. Rafael Ferreira da Silva even though he left the group. I had a chance to work with him on several projects and had learned a lot from him in the rst three years when I started to get involved in the research group. I am so fortunate to have them accompany on this long journey. Moreover, I am forever grateful to share Ph.D. experiences with a group of bright and talent people: Karan Vahi, Mats Rynge, Rajiv Mayani, George Papadimitriou, iii Patrycja Krawczuk, Ryan Tanaka, Wendy Whitcup, Tainã Coleman, Nicole Virdone, Ciji Davis, Zaiyan Alam. Thank you all for always stepping in, taking the time to help me whenever I need. I could not ask for better colleagues and truly enjoy being part of the team. I am very grateful for making SciTech an awesome place to work every day. I would also like to express my gratitude to my collaborators that I had a chance to work with during my Ph.D. time. Especially, I would like to thank all researchers involved in Analytics4MD project, especially Dr. Michela Taufer and Dr. Silvina Caíno-Lores. I want to express my gratitude for everything that you have helped me to achieved in this project that at the end greatly contributed to this dissertation. I would like to express my sincere gratitude to my committee members for my qualifying exam, proposal and dissertation defense Professor Aiichiro Nakano, Professor Viktor Prasanna, Professor Michela Taufer, Professor John Heidemann, Professor Ramesh Govindan. Thank you for sharing your honest and constructive feedback about the dissertation, from the very initial state of the Qualifying Exam to the nal stage of the Dissertation Defense. I value and respect your opinions to make the dissertation better and appreciate your time and eort to serve on my committee. I cannot thank my father, my mother and my sister enough for their unconditional love and constant encouragement, no matter what. They have consistently empowered me the most precious freedom while having my back in every venture, so that I had strength to pursue my dream. Last but not least, I am very grateful to my lovely girlfriend (my wife to be). You have always stood by me and been there for me on the good days and the bad days. Words cannot express how much your love and support throughout that many years of my Ph.D. time mean to me. iv TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1: Context and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 In Situ Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.4 Workows and Workow Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Problems and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Understanding Execution of In situ Workows . . . . . . . . . . . . . . . . . . . . 13 1.2.2 Performance Evaluation of In situ Workows . . . . . . . . . . . . . . . . . . . . . 14 1.2.3 From In situ Workows to Workow Ensembles . . . . . . . . . . . . . . . . . . . . 16 1.2.4 Co-scheduling and Resource Management for In situ Workows . . . . . . . . . . 17 Chapter 2: Modeling Framework for Characterizing In situ Workows . . . . . . . . . . . . . . . . 19 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 In situ Workow Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Data Management Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 In situ Metrics for Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Characterizing In situ Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 MD Workow: Example of Producer-consumer Patterns . . . . . . . . . . . . . . . 26 2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.3 Execution Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 3: Computational Eciency Model of In situ Execution . . . . . . . . . . . . . . . . . . . . 35 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 In Situ Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 v 3.2.2 Execution Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.3 Buer Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.4 Idle Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.5 Consistency Across Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Eciency Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 In Situ Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Makespan Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.3 Computational Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Molecular Dynamics Synthetic Workow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Workow Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.2 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Molecular Dynamics Realistic Workow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Workow Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.2 Experimental setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4: Performance Indicators to Evaluate Performance of In situ Workow Ensembles . . . . 65 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Workow Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Analyis of Workow Ensemble Co-location . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Choice of Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.2 Impact of Co-location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.1 Framework Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.3 Member Resource Usage (U) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.4 Member Resource Allocation (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.5 Ensemble Resource Provisioning (P) . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.6 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.1 Conguration Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.2 Increased Number of Ensemble Members . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 5: Co-scheduling Ensembles of In situ Workows . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.1 Couplings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.2 Co-scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.3 Application model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.4 Makespan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 Scheduling Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4 Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vi 5.4.1 Rational Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.2 Integer Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Proofs for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 vii ListofTables 2.1 Selected metrics for in situ workows characterization . . . . . . . . . . . . . . . . . . . . 25 2.2 State-of-the-art tools used for proling in situ applications. . . . . . . . . . . . . . . . . . . 25 2.3 Characteristics of our molecular system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Parameters used in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Parameters used in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1 Set of metrics. (LLC stands for Last-level cache.) . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Experimental scenarios conguration settings. . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Experimental congurations with two ensemble members, each ensemble member has two analyses per simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Experimental congurations with two ensemble members, each ensemble member has two analyses per simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6 Experimental congurations for the rst 4 ensemble members allocated on 2 compute nodes, each ensemble member has one simulation and one analysis. To increase the number of ensemble members, these settings can be replicated with a higher number of nodes (e.g., 8 ensemble members on 4 compute nodes, 16 ensemble members on 8 compute nodes). . . 88 5.1 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 Bandwidth models to calibrate the simulator. e N is number of nodes assigned to analysis in P NC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3 Experimental scenarios, whereA(x%) is thex% largest analyses’ sequential time ofA. . . 109 viii ListofFigures 1.1 Architectural overview of HPC systems at multiple levels: system, compute node, and central processing unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Processing paradigms for simulation-analysis pipeline. . . . . . . . . . . . . . . . . . . . . 7 1.3 Scheduling placements of in situ processing. . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Three levels of WORKFLOW ensemble: Ensemble Component, Ensemble Member, Ensemble Workow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 A general in situ workow software architecture. . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Architectural overview of the proposed runtime system for managing WORKFLOW en- semble executions. The Data Transport Layer (DTL) represents in-memory staging area, and the DTL plugins provide the interface between the ensemble components and the underlying DTL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Our workow integrating in situ analyses. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Number of distance matrices and their size as a function of the non-overlapping segments length in which a molecular system withN =1266 can be cut into. . . . . . . . . . . . . . 28 2.5 The observed analysis idle times and simulation idle time with the in situ congurations and in transit congurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Execution patterns of in situ workows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Classic workow design featuring two components and four stages. . . . . . . . . . . . . . 37 3.2 Dependency constraints within and across in situ steps withn =m = 3. . . . . . . . . . . 39 3.3 Two dierent execution scenarios for in situ workow execution. . . . . . . . . . . . . . . 40 3.4 Example of ne-grained execution steps for a member of one ensemble. (Idle simulation and analyzer represent coupled simulation-analysis scenarios.) . . . . . . . . . . . . . . . . 43 ix 3.5 Synthetic Workow: the Extractor (1) sleeps during the emulated delay, then (2) extracts a snapshot of atomic states from existing trajectories and (3) stores it into a synthetic frame. The Ingestor (4) serializes the frame as a chunk and stages it in memory, then the Retriever (5) gets the chunk from the DTL and deserializes it into a frame. Eventually, the MD Analytics performs certain analysis algorithm on that frame. . . . . . . . . . . . . . . 47 3.6 MD benchmarking results from the literature obtained by using 512 NVIDIA K20X GPUs. The results are interpolated to obtain the (a) estimated performance and then combined with the stride to synthesize the (b) emulated simulation delay. . . . . . . . . . . . . . . . 48 3.7 Execution time of LEBM on 16 cores, and using a segment length of 16. The fraction of alpha-amino acids in the entire system is equal to 0.00469. . . . . . . . . . . . . . . . . . . 51 3.8 Execution time per step for each component. The Synthetic Simulation stages are on the left and the Analyzer stages are on the right (lower is better). . . . . . . . . . . . . . . . . 52 3.9 Left y-axis: total idle timeI using an helper-core placement at stride 16000 (the lower the better). EstimatedI is estimated from Equations (3.8) and (3.9), and MeasuredI is measured idle time in one in situ step. Right y-axis: ratio EstimatedI / MeasuredI (the closest to 1 the better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.10 Detailed idle timeI for three component placements at dierent strides when varying the number of atoms (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.11 Makespan is estimated from 100 with stride 16000, the yellow region represents the error. Ratio of EstimatedMakespan to MeasuredMakespan uses the second y-axes on the right (close to 1 is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.12 Computational eciency (higher is better) . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.13 Practical Workow: GROMACS (1) simulates the motion of the atomic system in steps, where Plumed (2) interferes with every stride to update and gather new coordinates and store it into a frame. The Ingestor (3) serializes the frame as a chunk and stages it in memory, then the Retriever (5) gets the chunk from the DTL and deserializes it into a frame. Eventually, the MD Analytics performs the same analysis algorithm of CV calculation on that frame compared to Synthetic Workow. . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.14 Execution time per in situ step for each component with the helper-core and in transit placement. The Practical Simulation stages are on the left and the Analyzer stages are on the right (lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.15 Makespan is estimated from Equation (3.7) over the helper-core and in transit component placement, the yellow region represents the error of the Estimated Makespan from the MeasuredMakespan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.16 Computational eciency of the practical workow over a variety of strides (higher is better) 63 4.1 Metrics at ensemble component level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 x 4.2 Ensemble member makespan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Workow ensemble makespan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Execution time of the in situ step and computational eciency when varying the number of cores assigned to the analysis with a xed simulation setting. . . . . . . . . . . . . . . . 75 4.5 Computational eciency when varying the number of analyses per ensemble member. Note that the missing values in theC dedicated conguration when running with 4 analyses per ensemble member is due to out of memory errors on the node. . . . . . . . . . . . . . . 76 4.6 F (P i ) on dierentP i orders over congurations which have one analysis per simulation (the higher the better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 F (P i ) on dierentP i orders over congurations which have two analyses per simulation (the higher the better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.8 F (P i ) on dierentP i orders over congurations which have one analysis per simulation (the higher the better). Each conguration is measured over 5 trials. . . . . . . . . . . . . . 89 5.1 Illustration of co-scheduling and coupling notions. . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 The simulation setup used for experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 Makespan estimated by the model for theIncreasing-50% scenario with dierent values of bandwidth per node (see Table 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Makespan of the workow ensemble when varying various congurations. . . . . . . . . . 112 5.5 Comparison of normalized makespan of Co-Alloc to Ev-Alloc, where resources are evenly distributed. (n : Co-Alloc;c : Ev-Alloc) indicates Co-Alloc is applied at node-level whileEv-Alloc is applied at core-level. . . . . . . . . . . . . . . . . . . . . . . . 113 xi Abstract In the past decades, computing has played a signicant role in numerous breakthroughs of sciences since computers were utilized to identify solutions for increasingly complex problems. In computational sciences, computer simulation has been used as a complementary approach to conducting empirical and theoretical studies for greater understanding of many elements in the real world [27]. Discoveries obtained by scientic simulations are driven by the computing power that is capable of performing [85]. To meet the necessary complexity of real systems, enormous computing capacity is required to fully simulate the state of billions of elements, which makes such simulations are typically compute-intensive (i.e. the time spending on performing operations on data is orders of magnitude higher than the time for storing data). Thanks to the use of high-performance computing (HPC) infrastructures, research using simulations has obtained various important contributions in many scientic domains, such as understanding regulation phenomena in proteins, molecular docking for drug design, and rening structure prediction in structural bioinformatics [53]. In this thesis, we target one of the most common simulations executed on HPC machines: molecular dynamics (MD) simulations that emulate physical movements of atoms and molecules in a molecular system at atomic resolution. MD simulations forms a signicant fraction of workloads on HPC machines, for instance, molecular biosciences account for a substantial utilization (25%) on XSEDE resources as measured by allocation units consumed for the period 2011-07 to 2017-09 [93], in which the top 3 heavily used applications are widely used software packages for biomolecular simulation, such as NAMD [83], GROMACS [1], and CHARMM [17]. xii HPC has undergone signicant and rapid development in the past decades with ever increasing com- puting capability dened by the number of oating point operations (FLOP) computed in one second. After the petascale (10 15 oating point operations per second) milestone was marked in 2008 [38], exascale (10 18 oating point operations per second) is now achievable with Frontier [88] as the rst known exascale system in production according to the TOP500 list of the world’s fastest computers. Frontier has dramatically larger (5x) compute performance than do the Oak Ridge National Laboratory’s previous petaop supercomputer, Summit. The transition from petascale to exascale computing does not only bring unprecedented compute capability to HPC systems, but also presents new opportunities to discover scientic mysteries. The increase in computing capability directly translates into the ability to execute more and longer simulations, which consequently generate more data that needs to be stored and processed. As insights into the simulated systems primarily depend on the ability of analyzing generated data. Conventionally, the simulation outputs the entire simulated data set to the le system for later post-processing [63]. Unfortunately, there is an increasing performance gap between current processors and today’s storage systems as they do not advance at the same pace. In the past 20 years, there was an increase of 55% in processor performance each year while in fact the performance of storage access gained by only 10% per year [65]. The slow growth of I/O technologies compared to the computing capability of present-day processors causes an I/O bottleneck of post-processing as saving data to storage is not as fast as data is generated. Several processing paradigm shifts have emerged to tackle the challenge of large amount of data produced that needs to be analyzed. New high-performance memory systems that reside closer to the computation units have been developed (e.g., burst buers [45], non-volatile memory [44], high-bandwidth memory [103], etc.). Moreover, since the node communication bandwidth has not increased proportionally to the computing power, the cost of data movement is starting to exceed the cost of oating-point com- putations [39]. This trend leads to the transition from compute-centric models, in which data are moved xiii to where the computation takes place, to data-centric or processing-near-memory models [59], which require locating computing components close to where data resides, as it is expected to reduce the overhead associated with data movement. According to data-centric models, a new processing paradigm has been recently emerged, called in situ, where simulation data is analyzed on-the-y to reduce the expensive I/O cost of saving massive data for post-processing. In coupling with in situ analyses, instead of storing data produced by simulations on disks, the main simulation periodically sends data using faster memories (e.g., DRAM, SSD, NVRAM) to analysis kernels running simultaneously with simulation component as the run progresses. By interleaving their execution with the simulation, running the analyses in situ helps to reduce the time to solution of the simulation-analysis pipeline, thereby optimizing obtained scientic discoveries under time constraints. Performing analyses in situ does not only help to analyze near real-time data produced by the simulation to obtain on-the-y insights into the molecular system at runtime, but users are also able to considerably lessen the nal amount of data generated by the simulation to keep for post-processing, thereby alleviating the bottlenecks in the storage system. Traditionally, an execution of multiple computational tasks and data moments, control ow dependencies between those tasks are represented under a workow [98]. In this work, an in situ workow [12] describes a scientic workow with multiple concurrent components (simulations and analyses) coupling data using in situ [101, 21], and potentially co-scheduling together, i.e. the simulations and analyses coordinate their executions on the same allocated resources [81], to minimize the cost of data movement. These concurrent components periodically exchange data as the run progresses — data iteratively produced by the main simulation are processed, analyzed, and visualized in situ by the analyses at runtime rather than post-processed, thus their executions are synchronized, or tightly-coupled based on their data relationships. The ability to eciently execute an in situ workow from the point of view of performance depends on the in-depth understanding of the in situ design space, which includes component placements, data coupling xiv schemes, and synchronization methods. However, the greater the number of components the workow has, the larger the multi-parametric space needed to explore, thereby optimizing execution of in situ workows in terms of maximizing resource utilization and minimizing execution time is problematic. Therefore, this thesis aims to answer questions on how to improve performance of executing this emerging type of in situ workows on aforementioned modern HPC computer systems. To rst understand behavior of in situ workows, ecient and accurate characterization practices suited to in situ workows are required to capture the individual behavior of multiple coupled workow components (i.e., concurrent executions of overlapped steps with inter-component data dependency). This information is necessary for predicting the performance, resource usage, and overall workow behavior. We focus on the characterization aspect of in situ workows to understand their execution patterns in Chapter 2. With the advent of in situ workows, new monitoring and proling approaches [4, 105, 109] targeting tightly-coupled workows have been studied. Thanks to those advanced frameworks, extracting proling data from in situ workows is feasible. However, in situ evaluation is still challenging due to the lack of thorough guidelines for using these data to extract meaningful insights on in situ workows. Only a few studies [21, 56] have addressed this challenging problem. In Chapter 3, we investigate how to use proling data to quantify the eciency of in situ workows. Moreover, aforementioned advancements in computing power do not only allow to run longer simulations but also oer opportunities to run multiple simulations in parallel, a.k.a. simulation ensembles. The key challenge of enabling ensembles of simulations is that evaluating each component independently does not guarantee a thorough understanding of the overall ensemble performance. This is due to concurrent executions and tightly-coupled behaviors of in situ components, which may impact each others’ performance. The incorporation of numerous in situ components in ensembles requires eective management of co-scheduling the simulations and the analyses to deal with performance interference when they are mapped onto same resources. To address the increasing necessity of eciently executing simultaneous simulations as an ensemble, we further extend the notion of xv eciency to be capable of taking into account how resources are distributed and allocated among ensemble’s components for performance evaluation, in Chapter 4. Even though we can characterize the performance of an in situ execution, to schedule an execution of an ensemble comprised of multiple simulations and in situ analyses on HPC machines, we still need to answer two following questions: (i) how many resources are provisioned to in situ tasks and (ii) how in situ tasks are co-scheduled together on given resources. Designing scheduling algorithms for in situ workows such that available resources are eciently utilized is challenging as optimizing individual components does not ensure that the end-to-end performance of the workow is optimal. Specically, in the context of ensembles that consist of many simulations and in situ analyses running concurrently, the combination of their data coupling behaviors signicantly increases the complexity of the scheduling decision. We propose answers for the two aforementioned questions in Chapter 5. The thesis is organized as follows. In Chapter 1, we comprehensively review related work of this thesis from molecular dynamics simulations to in situ analyses, and then declare the problems we are addressing and describe our contributions. In Chapter 2, we start with the characterization aspects of in situ workows to understand their executions. We introduce a novel metric to reect computational eciency of an in situ execution in Chapter 3, and further extend the metric to be applicable to ensembles in Chapter 4. Finally, in Chapter 5, we determine an approach to schedule in situ tasks in an ensemble under constraints of computing resources. The main contributions of each chapter are summarized below. Chapter1: ContextandContributions In this preliminary chapter, we introduce the global context of this thesis. Present-day high performance computing systems are not only valuable tools to accommodate molecular dynamics simulations, they also allow to incorporate analyses of simulation data into a pipeline to understand the dynamics of the simulated molecular system while it is being executed. In preparation for next-generation computers and to overcome xvi I/O bottlenecks in gradual-growth storage systems, we review the emergence of in situ processing, which is executed under an in situ workow. The idea behind in situ is to analyze data as soon as it is generated. We then describe challenges facing in the use of in situ workows and detail our contributions to resolve those challenges. Chapter2: ModelingFrameworkforCharacterizingInsituWorkows In this chapter, we intend to understand execution of in situ workows. Through reviewing essential components of an in situ architecture, we model a framework that allows us to decouple in situ analyses from the simulation. The modeled framework helps to resolve complexities in enabling an in situ execution in terms of data coupling, data compatibility. Furthermore, by analyzing in situ systems, we derive a set of key metrics, which are obtainable by state-of-the-art monitoring tools, to characterize the behavior of in situ workows. and verify them as collectable by state-of-the-art monitoring tools. Thanks to the modeled framework and derived metrics, we empirically classify the execution in two pertinent in situ patterns. Chapter3: ComputationalEciencyModelofInsituExecution This chapter aims to provide guidelines for the evaluation of in situ workows. Based on iterative patterns characterized in Chapter 2, we introduce a theoretical framework that formalizes in situ workow executions. We develop a lightweight metric that is benecial when comparing the eciency of conguration variations in an in situ system. The aim is to provide an approach that is able to run concurrently with the in situ workow and possibly enable its adaptation at runtime. Since the proposed eciency metric is lightweight, it is helpful for designing a in situ workow and an ensemble of workows in production. xvii Chapter4: PerformanceIndicatorstoEvaluatePerformanceofInsituWorkowEnsembles Due to the emergence of ensemble-based methods, this chapter aims to characterize the performance of in situ ensembles on HPC systems. To prepare for eciently co-scheduling numerous jobs in an ensemble, we rst determine how to evaluate performance of an ensemble’s execution. Specically, we extend and consolidate the notion of computational eciency introduced in Chapter 3 into novel performance indicators that allow us to assess the expected eciency of a workow ensemble. The proposed performance indicators take into account multiple resource perspectives, such as resource usage, resource allocation, and resource provisioning to provide accurate placement-dependent performance evaluation. Chapter5: Co-schedulingEnsemblesofInsituWorkows Resource management is essential for ensembles as co-scheduling concurrent components may slow down each other. In order to design eective scheduling of in situ workow ensemble, we answer two questions: (i) how are concurrent jobs in the ensemble allocated or co-scheduled on available resources to maximize performance and prevent performance degradation? and (ii) how are resources provisioned to components of the workow to maximize eciency of utilizing resources? Our last contribution proposes an approach to compute ecient co-scheduling strategies and optimal resource assignments for the workow ensembles of simulations and in situ analyses. xviii Chapter1 ContextandContributions 1.1 Context Computational science is a eld of study that uses computers to identify solutions for increasingly complex problems. In the past decades, computer simulation has been used as a complementary tool in computational sciences for greater understanding of many aspects in the real world beside conducting empirical and theoretical studies. Thanks to computing power of high-performance computing (HPC) infrastructures, the use of simulation in HPC has made various important contributions in many scientic domains, such as physics, chemistry, biology, and material sciences, etc. However, there is always a need for more computing capacity to fully simulate and meet the necessary complexity of realistic systems. The transition from petascale (10 15 oating point operations per second) to exascale (10 18 oating point operations per second) computing brings unprecedented compute capability to HPC systems, and thereby presenting new opportunities to discover scientic mysteries. Consider for instance the Frontier supercomputer, the new No. 1 HPC system in the TOP500 benchmark, has dramatically larger compute performance (5) than do the Oak Ridge National Laboratory’s previous supercomputer, Summit. The increase in computing capability directly translates into the ability to execute more and longer simulations, which consequently generate more data that needs to be stored and processed. 1 1.1.1 MolecularDynamicsSimulations Experiments to study the dynamics of complex structures are usually sophisticated, time-consuming to conduct in the lab. Computer simulations provide a method to validate theoretical models that are not experimentally attainable in reality. Computer simulations help to bridge the gap between theory and experiment and accommodate a better understanding of real-life systems. Molecular dynamics (MD) simulation is a popular model computing the atomic states of a molecular system evolving over time by observing microscopic interactions between atoms. MD simulation serves as a productive method to observe important processes at atomic resolution for better understanding the behaviors of the system. Rather than dealing with the complexity of setting experiment environment, MD simulation oers a handful approach to control the congurations of the molecular systems, such as temperature, pressure. Due to these discussed advantages, MD simulations have been extensively applied to various scientic domains of chemistry, material sciences, molecular biology, and drug design. Specically, MD simulations computationally replicate the motions of a physical molecular system by iterating a two- step algorithm. Firstly, the interactions between atoms are calculated by using physical-chemical knowledge called a force eld, which consists of a mathematical function of potential energy and a set of pre-calibrated parameters. Secondly, based on the calculated forces, the positions of the atoms are advanced by solving Newton’s equations on a small time step, i.e. the time interval for sampling the system. TimescaleChallenges. Generally, the eciency of MD simulations mainly depends on the computational complexity of the algorithmic computation determined by two deciding factors: (1) the number of time steps, which denes the time scale simulating the system, and (2) the size of the system (number of atoms/particles). Empirical MD simulation requires establishing a time step of the order of femtoseconds (normally in the range of 1 to 2 fs), which corresponds to the shortest characteristic time to ensure numerical stability and capture the fastest changes of the system. However, most interesting molecular events such as conformational changes, phase transitions, or binding events appear to be observable at timescales of 2 nanoseconds, microseconds, or longer. Hence, there is a necessity of simulating in long trajectories of millions or billions of time steps. Increasing the number of time steps linearly scales the length of time simulated and the algorithmic complexity, thereby the computational resources. More specically, the total time of the simulation scales linearly asO(t), wheret is the number of time steps. On the other hand, the size of the molecular system determines the computational costs of the simulation. Calculating long-range forces of calculate all pair-wise interactions between atoms as a function of their coordinates inside the system of hundreds of thousand atoms is by far computationally demanding, and thereby time-consuming to obtain a single simulation time step. For example, in a system composed of N particles, the force computations imposeO(N 2 ) operations at each time step [102]. Since typical MD simulations are commonly comprised of thousands or even millions of interacting atoms, achieving a long timescale is challenging due to the computationally intensive characteristic of MD simulations. Therefore, most present-day MD simulations are deployed on high performance computing (HPC) systems to sustain long timescales and massive system sizes. 1.1.2 HighPerformanceComputing In the past, a MD simulation in a timescale of nanoseconds practically took up to hours or days to complete in normal computers due to a large number of calculations required for emulating atomic interactions. To speed up the algorithmic computations, a considerable amount of eort was made on the hardware side to make computer processors faster. Unfortunately, chip manufacturers could no longer maintain the exponential improvement of clock frequencies (Dennard scaling) due to the failure of keeping voltage reduction within energy constraints [101]. To keep computing capability rising despite the stagnancy of single processor’s performance, the alternative approach makes use of the fact that computation can be parallelized for executing simultaneously on multiple processors. 3 Figure 1.1: Architectural overview of HPC systems at multiple levels: system, compute node, and central processing unit. The task of parallelism requires sustainable hardware that provides support for multiprocessing with powerful computing capacity to make use of parallelism. The ability of integrating more processors into present-day computers empowers parallel computing that adapts multiprocessing to deploy desired hardware parallelism. The rst attempt of parallel computing was conducted on a machine comprised of multiple Central Processing Unit (CPUs). The evolution of multi-core architectures later allowed a CPU to accommodate multiple cores (processing units). In parallel computing, each portion, i.e. thread, of computation is usually taken by a core of a multi-core CPU. To shelter greater level of parallelism, multiple multi-core/multi-processor machines can be connected together via a high-speed interconnect network to form a cluster, in which each machine is considered as a compute node. Consequently, the past decade witnessed the development of cluster computers to the evolution of many HPC systems to enable high demand of computing for sciences. Until recently, MD simulations are commonly run on HPC platforms that provide high computational power to obtain high-impact sciences. However, enhancing computing capabilities by simply increasing the number of cores per compute node does not naturally guarantee the 4 escalation in the simulation performance. Parallelizing MD simulations is needed, so that the simulated system is partitioned in smaller spatial regions to be solved in parallel threads, each thread is assigned to dierent hardware portions. By computing multiple parts of the simulation concurrently, the time for simulating a time step is eectively shortened, thereby allowing the simulation to approach a longer timescale for better scientic discovery. From software perspectives, the eciency of parallelism is limited by the sequential portion during the program’s execution, i.e. dependent calculations needed to be performed in a sequence (formalized in Amdahl’s law [7]). Therefore, a considerable amount of eort was made on the algorithmic computations of the simulation to optimize the software parallelism level for achieving nearly linear speedup. MD packages, e.g. GROMACS [1], NAMD [83], CHARMM [17], OpenMM [41], etc, are developed in the help to ease the setup of parallel MD simulations and harness the full potential of HPC architectures. These packages provide various support to eciently map and perform a MD simulation to dierent HPC architectures. StorageSystems. For HPC systems, the concern of computing capabilities regularly goes along with the need of storing computed results for later access and process. This demand is accomplished via the storage hierarchies of the system, which feature multiple tiers with a trade-o between latency and capacity. These storage tiers impose various energy and cost to read and write, i.e. storage device oering lower latency tends to have smaller storage capacity, but incurs higher energy costs at the same time [68]. Therefore, selecting proper type of memory hierarchy that matches the particular storing purpose is crucial. Hot data, i.e. data are currently in use or immediately needed for next round of computation, are commonly kept in main memory, which is backed by dynamic random-access memory (DRAM) for volatile data. Unfortunately, these fast access technologies are expensive due to the power needed to maintain, thus setting the constraints in the size of data able to store in main memory of a compute node. In distributed memory system, each processor of a compute node is equipped with its local memory, thus memory access across processor is implemented by high-speed on-chip interconnect. To further speed up the retrieval of 5 data in main memory, modern CPUs supply static random-access memory (SRAM)-based multiple cache levels to keep data with frequent accesses, so that next access to the data can be processed without reloading into memory. Last level cache (LLC) is commonly shared between cores in the multi-core CPU, while the other lower level caches (L1, L2) are private per core. Since SRAM is more expensive than DRAM, the size of caches are usually limited in megabytes for each CPU, thus ecient cache management is crucial to the performance of applications running on HPC systems. On the other hand, cold data served as long-term storing is interchangeably saved to high latency, persistent storage media. On the software perspectives, the storage systems in most of today’s HPC systems are managed by parallel le system to enable storing and accessing by parallel applications. Parallel le systems, e.g. Lustre, Spectrum Scale, etc, are designed as an interface to provide high-performance accesses through distributed storage nodes. Unfortunately, storage technologies advance at a slower pace compared to the computing capability of modern processors [64]. The slowly improving I/O system bandwidth witnessed in recent HPC sys- tems [48] creates performance bottleneck which prevents many data-intensive applications from accessing or retrieving a large amount data from/to the storage system. In terms of hardware perspectives, the aforementioned stall in disk improvements issue new memory technologies that enable faster memory tiers in deep memory hierarchies to compensate the I/O bottleneck incurring at other high-latency layers. Several new high-performance memory systems that reside closer to local computation units have been developed to improve data access performance (e.g., burst buers [45], high-bandwidth memory (HBM) [103, 80], non-volatile memory (NVM) [44], etc.). These new manufacturing techniques create opportunities to overcome the expensive cost of I/O and oer higher I/O throughput thanks to the low-latency access capability of those advanced storage technologies. The emergence of these memory technologies enables processing-in-memory or in-memory computing [55] to mitigate the cost of data movement. Furthermore, alternative approaches focusing on new data processing models recently come to light to overcome the barrier caused by the mismatch between the ability to compute and store data. 6 1.1.3 InSituAnalytics Recently, advancements in science is increasingly established on the use of HPC systems to conduct simulation and on the analysis of computed simulation data. The MD simulation output is generated under trajectories, a three-dimensional atom coordinates that describes the atomic-level state of the system at every simulation time step. The analysis of these trajectories generates knowledge of the simulated systems through observing rare events, state transformations such as ligand binding, protein folding, membrane transport, to drive the simulation to more promising simulation space. To obtain these outcomes, the analysis tasks are needed to be integrated into the simulation pipeline. Traditionally, the frames, i.e. snapshots of atomic positions saved at xed intervals of time during the simulated period, produced by the simulation are stored in persistent storages and visualizations or analytics are performed post hoc, called post-processing (see Figure 1.2a). Such pipelines require considerable computing time and computer storage resources, which grow exponentially with the size and the timescale of the system. (a) Post-processing. (b) In situ. Figure 1.2: Processing paradigms for simulation-analysis pipeline. 7 The improvements in both computing capacities of computer hardware and MD software-related methods permit present-day simulations to attain the increasing scaling of system sizes and lengthen achievable timescales. The longer timescales are obtainable by the simulation or the more extensive the simulated systems are studying, the larger the trajectories data set needs to be stored for later analysis. Moreover, identifying the region of interest in advance without prior knowledge of the simulated system is challenging. Maximizing useful insights extracted from the simulations demands observing and analyzing meticulously all available data generated. Even though in some cases, scientists are more interested in a subset of the most relevant events, a typical MD simulation still requires to track the trajectories of entire molecular system on the full simulated time scale. The total output data footprint is therefore considerable. For example, trajectories produced in a duration a microsecond to a millisecond potentially consist of between 10 5 and 10 8 frames, which occupy between 30 GB and 30 TB of disk space [22]. Due to the slow growth of disk write rate compared to computational growth, contemporary leadership computers mark the stagnation of le system throughput [48]. For example, Summit’s parallel le system provides a peak write speed of 2.5 TB/s [2], while Frontera is provisioned to have 2 TB/s I/O bandwidth [95]. Hence, outputting such a large amount of trajectory data to the le system becoming an I/O bottleneck of post-processing as saving data to storage is not as fast as data is generated. Rather than post-processing data upon simulation completion, in situ methods allow scientists to process data during the runtime of the simulation. Since the in situ analysis interleaves its execution with the simulation, running the analysis in situ reduces the time to solution of the pipeline. This approach oers many advantages for processing large volumes of simulated data and eciently utilizing computing resources. Additionally, with the development of processor technologies, moving data consumes more energy than performing the computing operation on the same amount of data [101]. This trend leads to the transition from compute-centric programming models, in which data are moved to where the computation takes place, to data-centric models, which require locating computing components close to where data are 8 placed. According to data-centric models, an alternative processing paradigm has been recently emerged, called in situ, where data are analyzed in situ, i.e. as soon as data are produced, to overcome I/O limitations of saving massive data to persistent storage (see Figure 1.2b). Using in situ processing often implies the simulation and the analysis reside on the same set of compute nodes, thereby sharing the computing resources, thus the in situ analysis can directly retrieve data from the simulation memory to bypass writing intermediate les and prevent from moving les across nodes. When the simulation and the analysis are executed on dierent nodes, in stead of transferring data organized under les for in situ processing, data are staged memory-to-memory using Remote Direct Memory Access (RDMA). From scientic contributions, the essence of performing analyses online helps to study insights into phenomena of the molecular system in a timely fashion, which potentially leads to better science discovery. By analyzing near real-time data produced by the simulation, users are able to considerably reduce the nal amount of data generated by the simulation by choosing on-the-y which data are more essential to keep [48]. Furthermore, knowledge extracted from in situ data analytics can be used to enable on-the-y tuning or runtime steer the MD simulations (i.e., stop, start, restart and fork MD jobs). in situ methods are commonly applied to analyzing [70, 29, 71] or visualizing [42, 52, 61] simulation data. (a) Co-scheduling (Helper-core). (b) In transit. Figure 1.3: Scheduling placements of in situ processing. 9 According to how in situ components are mapped into underlying computing resources, an in situ coupling between the simulation and the analysis is classied into two scenarios: co-scheduling(helper-core) (Figure 1.3a) andintransit (Figure 1.3b). In co-scheduling scenario, the simulation and the analysis share the same set of resources, thus the analysis can locally extract data from the simulation to reduce data transfer overhead across compute nodes. Co-scheduling approach dedicates a set of cores, called helper-cores, in the shared compute nodes to the analysis. The simulation, therefore, does not use the entire cores of the node and compete for shared resources, e.g. memory, network, etc., against the simulation. Practically, the simulation may not fully exploit the whole compute nodes, thus co-scheduling method is favored to oer the advantage of data locality between the simulations and the analyses. On the other hand, in transit placement uses dedicated computing resources for the in situ analysis, which incurs no contention between the simulation and the analysis but requires additional nodes for the analysis. To fully benet from in situ coupling, the simulation and the analysis components have to be eectively managed to deal with performance interference between the in situ components running on the same resource, so that they do not slow each other down. 1.1.4 WorkowsandWorkowEnsembles 1.1.4.1 InsituWorkows Traditionally, a workow is represented as a directed acyclic graph (DAG), where computational tasks are the nodes and the edges represent data dependencies between those tasks. Most of current WMS have adopted a loosely-coupled approach, in which tasks communicate through les – a task writes output les that are then used as input les for another task. However, this model does not facilitate inter-tasks communications during their executions, which are the heart of tightly-coupled approaches [107]. Most well-established workow management systems (WMS) have been designed before the emergence of tightly-coupled workows [66, 47]. However, bringing tightly-coupled workow support into current and 10 emerging WMS is one key element on the road to enable computational science at an extreme-scale [101, 47]. In addition, a tightly-coupled approach may provide periodic feedback to the user about the execution of tasks and the status of the simulation. According to this feedback, the user might want to update the workow. This behavior introduces cycles in the workow [47]. Note that some traditional DAG representations have been extended with the concept of a bundle representing a group of several tasks that need to be scheduled concurrently, allowing inter-tasks communications at runtime [26]. An in situ workow describes a scientic workow with multiple components (simulations with dierent parameters, visualization, analytics, etc.) running concurrently [101, 21], potentially coordinating their executions using the same allocated resources to minimize the cost of data movement [81]. These components periodically exchange data as the run progresses— data periodically produced by the main simulation are processed, analyzed, and visualized in situ at runtime rather than post-processed on dedicated resources. The main simulation periodically sends, for example, everyk steps, data to one or several analysis kernels running concurrently. 1.1.4.2 WorkowEnsembles Advancements in computing power increased the level of parallelism in scientic simulations, which not only speeds up the simulation itself but also opens an opportunity to run multiple simulations in parallel, a.k.a. ensembles. Instead of simulating long MD trajectories, each job in an ensemble simulates the same molecular system starting from dierent initial conditions (e.g., positions, velocities) or similar systems under dierent conditions (e.g., temperature, protein mutants, drug variants). Ensemble-based simulation approaches (in which multiple simulations are run concurrently) potentially leads to more ecient sampling of the solution space. For instance, multiple-walker [23, 86] employs multiple replicas of the system, known as walkers, where each walker simultaneously explores the same free energy landscape to improve sampling performance. Generalized ensembles [20, 76] allow sampling a broader conguration space by partitioning 11 simulation states into ensembles with optimal weights to perform a random walk in potential energy spaces. The key challenge for enabling these approaches on large-scale systems is to eciently execute these concurrent simulations structured as an entity, an ensemble. Figure 1.4: Three levels of workow ensemble: Ensemble Component, Ensemble Member, Ensemble Workow A workow ensemble is a collection of inter-related ensemble members/workows executing in parallel. Each ensemble member may be comprised of multiple ensemble components – a component can be a simulation or an analysis as is the case in our MD example (Figure 1.4). Note that even though a workow ensemble can be comprised of parallel and sequential workows, we can always group workows (ensemble members) running in parallel into a workow ensemble. We focus on the set of ensemble members running concurrently and starting their executions at the same time, to mimic how multiple MD simulations are executed simultaneously in ensemble methods [23, 20]. In this work, we restrict ourselves to a single simulation per ensemble member. This simulation is coupled with at least one analysis component. An analysis coupled with a simulation forms a coupling, and thus an ensemble member may include multiple couplings dened by the number of analyses paired with the simulation of the given ensemble member. We assume that ensemble members do not exchange information and are independent of each other (i.e., 12 the analysis component of a given ensemble member only requires data generated by the simulation of that ensemble member [10]). The type of coupling is dened by the ensemble components. In our MD application, the simulation periodically writes out the data, which is read synchronously by the analyses. Although the simulation can compute while the analyses are reading the data, the simulation does not write any new data until the data from the previous iteration is fully consumed. 1.2 ProblemsandContributions 1.2.1 UnderstandingExecutionofInsituWorkows In this rst work, we address the challenges facing on the move of MD simulations to next generation supercomputers by transforming the traditionally performed centralized MD analysis (i.e., rst generating and saving all the trajectory data to storage, and then relying on a posteriori analysis) to an in situ analysis that analyzes data as they are generated. The ability to optimize an in situ workow depends on the in-depth understanding of the in situ design space, which includes component placements, data coupling schemes, and synchronization methods. To enable ecient execution of such complex, heterogeneous workows involving many components, workow systems need to embed additional services that assist execution characterization such as performance monitoring, I/O optimization, or online data reductions [21]. In this rst contribution, we target the characterization aspect of in situ workows to understand their execution. Ecient and accurate characterization practices suited to in situ workows are required to capture the individual behavior of multiple coupled workow components (i.e., concurrent executions of overlapped steps with inter-component data dependency). These information of characterization are necessary for predicting the performance, resource usage, and overall workow behavior. These data need to be accurate in order to have meaningful predictions [47, 94]. 13 Specically, through reviewing essential components of an in situ architecture, we model a framework that allows to decouple in situ analyses from the simulation. The modeled framework helps to resolve complexities in enabling an in situ execution in terms of data coupling, data compatibility. Furthermore, by analyzing in situ systems, we derive a set of key metrics to characterize the behavior of in situ workows and verify them as collectable by state-of-the-art monitoring tools. Thanks to the modeled framework, we empirically classify the execution patterns to demonstrate two pertinent scenarios: (a) when simulations are idle waiting in I/O because the analysis is not able to consume MD frames at the same pace as simulation; (b) when the resources used for the analysis are idle because the simulations are not able to produce MD frames at the same pace. This work has been published in: “Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers”. By Michela Taufer, Stephen Thomas, Michael Wyatt, Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Harel Weinstein, Michel A. Cuendet, Trilce Estrada, and Ewa Deelman. In: 2019 15th International Conference on eScience (eScience). Sept. 2019, pp. 188–198 1.2.2 PerformanceEvaluationofInsituWorkows Performance monitoring is used to characterize workload behaviors for nearly three decades. Monitoring and performance proling tools for HPC applications is a vibrant research eld resulting in many contributions over the past decade. For scientic workows, performance data collected by monitoring tools, help the scientists to better understand the behaviors of complex workows, which is needed to eciently manage resources. These data can be used as input for theoretical performance models or scheduling algorithms, for debugging and optimization purposes and to provide users with feedback about the execution. With the advent of in situ workows [12], new monitoring and proling approaches targeting tightly- coupled workows have been studied. An in situ workow runs concurrently several components which continuously exchange data between each other during execution, therefore the usage of monitoring tools 14 requires a careful analysis of the potential overhead induced by such tools which could slow down some components of the workow. Moreover, as next-generation workows are getting even and even more complicated by incorporating more tightly-coupled components and integrate a variety of sophisticated data layers between these components. Therefore, monitoring each component independently is not sucient to accurately characterize the entire workow. Most current HPC monitoring tools, such as TAU [91], HPCToolkit [3], or CrayPat [28] aim to capture the performance prole of standalone applications and, thus their design is inadequate for in situ workows. Moreover, in situ evaluation is challenging due to the lack of thorough guidelines for using performance data obtained from these monitoring tools to extract meaningful insights on in situ workows. Only a few studies [21, 56] have addressed this challenging problem. In this section, we focus on how to use proling data to evaluate the in situ workows. Through the use of a theoretical framework that formalizes in situ workow executions based on their iterative patterns, we develop a lightweight approach that is benecial when comparing the eciency of conguration variations in an in situ system. The aim is to provide a lightweight approach that is able to run concurrently with the in situ workow and possibly enable its adaptation at runtime. This work has been published in: “A lightweight method for evaluating in situ workow eciency”. By Tu Mai Anh Do, Loïc Pottier, Silvina Caíno-Lores, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman. In: Journal of Computational Science 48 (2021), p. 101259 “A Novel Metric to Evaluate In Situ Workows”. By Tu Mai Anh Do, Loïc Pottier, Stephen Thomas, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman. In: ComputationalScience–ICCS2020. Cham: Springer International Publishing, 2020, pp. 538–553 15 1.2.3 FromInsituWorkowstoWorkowEnsembles Scientic breakthroughs in biomolecular methods and improvements in hardware technology have shifted from a long-running simulation to a large set of shorter simulations running simultaneously, called an ensemble. Since workow ensemble is multi-hierarchical as described in Section 1.1.4, evaluating each component exclusively does not guarantee a thorough understanding of the workow ensemble performance. Performance data at the component level yield insights into the characteristics of individual components, but fail to capture the overall workow ensemble behavior. For example, analyses are commonly more memory- intensive than simulations, which leads to increased cache miss ratio or higher memory interference. As a result, resource contention may arise once co-scheduling memory-intensive analyses, thereby not only leading to increased execution time of these analysis components, but also increased ensemble member makespan (recall the simulation and analyses execute synchronously and simultaneously). Consequently, the overall workow ensemble makespan may be harmed due to slow ensemble members. Therefore, to identify stragglers among the members would need to diligently inspect and relate the independent measurements at multiple levels, i.e. ensemble components, ensemble members, to draw conclusions of the workow ensemble performance. These arguments identify a need to develop a method that captures the performance within a workow ensemble at multiple levels of granularity. To this end, for the third contribution, we extend and consolidate notion of computational eciency introduced in the previous contribution into novel performance indicators that allow us to assess the expected eciency of a workow ensemble in multiple resource perspectives: resource usage, resource allocation, and resource provisioning. This work has been published in: “Performance assessment of ensembles of in situ workows under resource constraints”. By Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. In: Concurrency and Computation: Practice and Experience (June 2022), e7111 16 “Assessing Resource Provisioning and Allocation of Ensembles of In Situ Workows”. By Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. In: 50th InternationalConferenceonParallelProcessingWorkshop. ICPP Workshops ’21. Lemont, IL, USA: Association for Computing Machinery, 2021 1.2.4 Co-schedulingandResourceManagementforInsituWorkows Designing scheduling algorithms for in situ workows such that available resources are eciently utilized is challenging as optimizing individual workow components does not ensure the end-to-end performance of the workow is optimal. The more number of components the workow has, the larger the multi-parametric space needed to explore. Specically, workow ensembles consist of many simulations and in situ analyses running concurrently, thus the combination of their data coupling behaviors signicantly increases the complexity of the scheduling decision. Few eorts have been attempted to solve this problem on a single in situ workow [70, 9, C97, 100]. The intersection of workow ensembles and in situ is particularly challenging as in situ itself has intricate, complex communication patterns, which is further convoluted by the number of concurrent executions at scale. Co-scheduling strategies have been recently proposed to deliver better resource utilization and increas- ing overall application throughput [16, 106, 62]. Incorporating co-scheduling into the in situ workows is promising as co-scheduling the analyses with their associated simulation on the same computing resources improves data locality. However, required resources may not be consistently available, thus upon resource constraint, how do we co-schedule ensembles of workows on an HPC machine? Another major challenge is to determine resource requirements such that the resulting performance of executing them together in the workow ensemble is maximized. Our last contribution proposes an approach to compute ecient co-scheduling strategies and optimal resource assignments for the workow ensemble of simulations and in situ analyses. To the best of our 17 knowledge, this last contribution is the rst study on a co-scheduling problem for in situ workows at ensemble level. This work has been accepted to be published in: “Co-scheduling Ensembles of In Situ Workows”. By Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Frédéric Suter, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. In: 17th Workshop on Workows in Support of Large-Scale Science (WORKS’22). IEEE. 2022, To appear 18 Chapter2 ModelingFrameworkforCharacterizingInsituWorkows In situ workows are the result of the shift in the workow design from a loosely-coupled representation to a tightly-coupled representation. In the traditional loosely-coupled representation, task dependencies are expressed via les, each task writes output les that are then used as input les for another task. In the more recent tightly-coupled approach, tasks are executed concurrently while data is exchanged iteratively between them. In situ tasks are usually scheduled on the same resources and leverage inter-tasks communications at runtime [26]. The behavior of in situ workows is dicult to characterize as their executions are usually complicated due to the co-scheduling of in situ tasks running concurrently within the workow. This chapter presents a characterization of in situ workows’ execution. Specically, we analyze common characteristics of in situ systems and model a framework to enable the in situ couplings between in situ components such as the simulations and analyses. The framework considers the management of data staging and addresses data incompatibility between concurrent components (simulations, analyses) within the in situ workow. We also propose a set of performance metrics that are necessary to collect for understanding in situ workows. Based on these prerequisites, we then empirically classify the workow’s execution patterns to demonstrate three pertinent scenarios based on idle time waiting in I/Os. Being able 19 to detect the execution pattern for in situ workows is helpful to better utilize underlying resources and improve workow performance. The rest of the chapter is organized as follows. Section 2.1 reviews the literature in the area of in situ workow performance characterization. Section 2.2 details the proposed data management framework and in situ metrics for characterization. An in-depth classication of in situ patterns isd then discussed in Section 2.3. Finally, Section 2.4 summarizes our main contributions and discusses directions for future work. 2.1 RelatedWork Several general-purpose in situ libraries, such as Dataspaces [37], DIMES [108], Decaf [40], Flexpath [25], provides the in-memory data layers to couple data between the simulation and the analysis. Dataspaces [37] adopts in-memory computing to build a shared virtual space to store data staged between coupled appli- cations. Specically, data is kept in memory of a centralized server, indexed under metadata and queried through a high-level interface provided to ease data access from various components. DIMES [108] is a variation of Dataspaces [37] in which data is kept locally in the node memory on which the simulation is running and distributed over network to nodes upon request. The server only keeps metadata to direct operations on data stored in the shared memory space. This setup brings two advantages over Dataspaces: (1) increasing data locality and (2) alleviating the burden on the centralized server. Decaf [40] provides a message-driven dataow middleware that supports both tight and loose coupling of in situ tasks. The mid- dleware relies on message passing interface (MPI) to enable the dataow, the parallel data communication channel, between two data-coupled applications. For better understanding of these in situ libraries, we refer interest readers to a thorough study [55] that compares these in situ middlewares in many factors: implementation, design, performance, usability and robustness. 20 2.2 InsituWorkowArchitecture In this section, we describe the architecture of an in situ workow that underlies our study. We also dene and motivate a set of non-exhaustive metrics that need to be captured for in situ workow performance characterization. 2.2.1 GeneralArchitecture Figure 2.1: A general in situ workow software architecture. We propose an in situ architecture that enables a variety of in situ placements to characterize the behavior of in situ couplings. Although we focus on a particular type of in situ workows (composed of simulation and data analytics), our approach is broader and applicable to a variety of in situ components. For example, in situ components could consist of an ensemble of independent simulations coupled together. The in situ workow architecture ( Figure 2.1) features three main components: • A simulation component that performs computations and periodically generates a snapshot of scientic data. • A data transport layer (DTL) that is responsible for ecient data transfer. • An analyzer component that applies several analysis kernels to the data received periodically from the simulation component via the DTL. 21 In general, the data transport layer (DTL) may be implemented using dierent technologies to enable data delivery between workow components, e.g. the use of Burst Buers for in transit processing [92], or complex memory hierarchies for in situ processing of small volumes of data and a large number of computing jobs [51]. We discuss detailed implementation of the DTL in Section 2.2.2. On the data path “simulation-to-analyzer” depicted on Figure 2.1: (1) the ingester ingests data from a certain data source and stores them in the DTL and the (2) retriever, in a reverse way, gets data from the DTL to perform further operations. These two DTL plugins allow us to abstract and detach complex I/O management from the application code. This approach enables more control in terms of in situ coupling and is less intrusive than many current approaches. The ingester synchronously inputs data from the simulation by sequentially taking turns with the simulation using the same resource. The ingester is useful to attach simple tasks to pre-process data (e.g., data reduction, data compression). The architecture allows in situ execution with various placements of the retriever. A helper-core retriever co-locates with the ingester on a subset of cores where the simulation is running—it asynchronously gets data from the DTL to perform an analysis. As the retriever is using the helper-core placement, the analysis should be lightweight to prevent simulation slowdown. An in transit retriever runs on dedicated resources (e.g., staging I/O nodes [110]), receives data from the DTL and performs compute-intensive analysis tasks. 2.2.2 DataManagementFramework We developed a runtime system (Figure 2.2) that manages the execution of in situ couplings on a target HPC platform. We need to decouple in situ analyses from the simulation to take advantage of in transit nodes or helper-cores, so that their executions are eectively interleaved. Decoupling allows in situ analyses to utilize better idle resources and be more exible to plug with dierent analyses. However, decoupling them requires enabling data staging to decoupled components (from simulation to in situ analyses) while they are running concurrently. Moreover, since the simulations and the analyses are iterative applications, we 22 also need a mechanism to synchronize data operations between them over iterations, so that the analyses acknowledge when data is available from the simulations; or in the opposite way, the simulations know when the data is processed from the analysis side to progress next iteration of data generation. One of the diculties of decoupling the simulation is that data generated from the simulations could be in dierent formats compared to the data structures needed for the analyses, which creates the incompatibility when incorporating them together. To address these challenges, we model a framework that allows to decouple in situ analyses from the simulation, in which we design a data transport layer to handle these discussed data staging and data coupling complexities. We leverage a data abstraction to solve the problem of data incompatibility. This framework includes two main components: (i) a data transport layer (DTL), and (ii) a DTL plugin. The former represents a variety of storage tiers, including in-memory [108], burst-buers [J46], or parallel le systems. In this paper, we target in-memory DTL. The latter acts as a middle layer between the ensemble components (simulations/analyses) and the underlying DTL and is responsible for data handling. The simulation using the DTL plugin to write out data abstracted into a chunk, which is the base data representation manipulated within the entire runtime. This abstraction allows the system to be adaptable to a variety of simulations and eases the burden of developing special-purpose code to pair with diverse simulation types. The chunk also denes a unique data type standard for the analysis kernels, though each of them may perform dierent computations. The DTL plugin does data marshaling to support various DTL implementations. Specically, the abstract chunk is serialized to a buer of bytes, which is easy to manage for most DTL. The DTL plugin interfaces also hide the complexities of managing dierent I/O staging protocols in the DTL. Ensemble components (simulations or analyses) are integrated with the DTL plugins for I/O operations from/to the DTL. The DTL plugin is usually pinpointed at the root process of the ensemble components. In this paper, we target in-memory DTL. We leverage DIMES [108] to deploy the in-memory staging area for the DTL. DIMES is an in situ implementation in which data is kept locally in 23 the node memory on which the simulation is running and distributed over network to nodes upon request. Nevertheless, our DTL abstraction can be extended to other storage tiers, such as burst buers or parallel le systems. , via provided interfaces. We use TAU [91] to collect runtimes, performance counters, and memory footprints. Figure 2.2: Architectural overview of the proposed runtime system for managing workow ensemble executions. The Data Transport Layer (DTL) represents in-memory staging area, and the DTL plugins provide the interface between the ensemble components and the underlying DTL. To optimize the in situ data processing, coupled components in an ensemble member are synchronized as they progress concurrently over time. For example, in an ensemble of simulations, analysis steps can only execute upon completion of the current simulation step. To that end, the coupling of data between an analysis and the corresponding simulation denes a synchronization behavior. Our runtime system follows such execution pattern. 2.2.3 InsituMetricsforCharacterization In the HPC domain, an important number of mature monitoring tools have been developed, most of them designed for standalone applications, but few studies have been focused on how to use these monitoring tools for scientic workows in general and in situ workows in particular. This section studies how 24 current HPC monitoring tools were leveraged by scientists in terms of in situ workows and then further discusses the help of these tools on the set of proposed metrics for characterizing in situ workows. To characterize in situ workows, we have dened a foundational set of metrics (Table 2.1). Table 2.1: Selected metrics for in situ workows characterization Name Denition Unit Makespan Total workow execution time s TimeSimulation Total time spent in the simulation s TimeAnalytics Total time spent in the analysis s TimeDTL Total time spent in data transfers s TimeSimulationIdle Idle time during simulation s TimeAnalyticsIdle Idle time during analysis s As a rst metric, it is natural to consider the makespan, which is dened as three metrics corresponding to time spent in each component: the simulation, the analyzer, and the DTL. The periodic pattern enacted by in situ workows may impose data dependencies between steps of coupled components, e.g. the analyzer may have to wait for data sent by the simulation to become available in the DTL for reading. Thus, we monitor the idle time of each individual component. Table 2.2: State-of-the-art tools used for proling in situ applications. TAU HPCToolkit CrayPat WOWMON SOS Makespan [49] [87] [69] [109] TimeSimulation [109, 49, 69] [87] [69] [109] TimeAnalytics [109, 49, 69] [87] [69] [109] TimeDTL [109, 49, 84] [87] [109] [84] Most current HPC monitoring tools, such as TAU [91], HPCToolkit [3], or CrayPat [28] aim to capture the performance prole of standalone applications and, thus their design is inadequate for in situ work- ows. Table 2.2 provides an overview of whether these tools can be used to capture the dierent metrics dened in Table 2.1. This literature search further underlines the novelty of our utilization of the idle time during the execution to characterize in situ workows because, to the best of our knowledge, no study used 25 this approach. Moreover, in this work, we use TAU for monitoring purposes, mainly due to its versatility, as demonstrated by Table 2.2. Recent works (e.g., LDMS [4], SOS [105], and WOWMON [109]) have proposed general-purpose distributed approaches to provide global workow performance information by aggregating data from each component. Thanks to those advanced frameworks, extracting meaningful proling data from in situ workows is ecient. However, in situ evaluation is still challenging due to the lack of thorough guidelines for using these data to extract meaningful insights on in situ workows. Only a few studies [21, 56] have addressed this challenging problem. In this work, we focus on how to use proling data to characterize the in situ workows. The motivation behind this study is to address the lack of characterization studies for in situ workows. 2.3 CharacterizingInsituExecution 2.3.1 MDWorkow: ExampleofProducer-consumerPatterns Our workow integrates MD simulations and in situ analyses following general architecture of in situ workows described in Section 2.2. The workow can be depicted as an example of a producer-consumer pattern with the MD simulation producing snapshots (x i (t);y i (t);z i (t)) (i.e., frames) being output at a regular number of steps (i.e., strides), and one or more analysis modules serving as the consumers. Figure 2.3 illustrates the workow used in this study. Simulation. Table 2.3 provides details about the simulated molecular system used in this experiment: the name of the molecular system (MD system); the number of atomsN; the number of carbon atomsN ; and the estimated number of steps that the simulation can perform per second wall clock time (TPS). We assume Table 2.3: Characteristics of our molecular system MDSystem N N TPS Gltph [6] 270,088 1,266 318 26 Plumed MD code (e.g., GROMACS) In-memory Staging Area DataSpaces A4MD analytics Retriever Dataflow Ingestor Controlflow Run n-Stride simulation steps Collective variables (CVs) Data Generation (producer) Data Analytics (consumer) Data Storage Dataflow Burst Buffer Parallel File System (e.g., Lustre) A4MD analytics A4MD analytics algorithms Distance matrices Controlflow Controlflow Controlflow Figure 2.3: Our workow integrating in situ analyses. that as the size of molecular systems decreases, the possible structural changes become faster, requiring the stride of the associated MD simulation to become smaller in order to capture all the changes. Thus, the rate at which frames are generated for the analysis depends on the molecular system size. For each molecular system, we select four strides that scale to system size and follow the ratio of 1 : 5 : 10 : 50. For example, for the stride values are: 100, 500, 1000 and 5000; a frame is output every 100 t, 500 t, 1000 t, and 5000 t, where t is the time step size and is computed as the inverse of the TPS. We use elements of Plumed, a plug-in software package compatible with many state-of-the-art MD packages for CV calculation [99], to capture a frame when generated by the MD code. Plumed serves as a plug-in and no changes to the MD code are needed: we engineer a Plumed function to read a frame from the MD simulation memory space and transfer it to the DataSpaces shared memory via an ingestor module. Data Transport Layer. DataSpaces [37], a memory-to-memory framework and remote direct memory access (RDMA), serves as our data transport layer (DTL). It enables the ecient and scalable data sharing (i.e., the sharing of trajectory frames) between the MD simulations and the analytics modules. It uses a client/server approach: the server is a virtual shared memory space that can be concurrently queried by multiple clients. The RDMA leverages scalable communications between the server and each client. The 27 Dataspaces shared memory is accessed by the ingestor fed by Plumed; a retriever module passes the frame to the analytics modules. Ingestor and retriever use a simple key-value representation to coordinate the data movement, where the key is the time step and value is the data. The size of the shared memory buer is xed. In this work, a Dataspaces buer size corresponding to the size of a single frame is considered. Dataspaces supports both a default setting blocking the producer from writing data from the next time step until the consumer nishes reading the current time step frame, and an asynchronous setting allowing for managing the synchronization between the producer and consumer by the user. We use both mechanisms in this study. 0 100 200 300 400 500 600 Segment Length (m) 10 0 10 1 10 2 10 3 10 4 10 5 # of Matrices ( N m ) ( N m 1) 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Matrix Size (# of Elements) 1e6 (2m) 2 Figure 2.4: Number of distance matrices and their size as a function of the non-overlapping segments length in which a molecular system withN =1266 can be cut into. Analysis. The analytics modules used in this study arepython modules, but can easily be extended to use other modules. At a given timet an MD simulation writes a snapshot or frame of the molecular system to memory; the frame contains all atomic coordinates and complete structural information on them amino acidsk 1 ;k 2 ;:::;k m comprised in the structure of interest. Trajectory analysis commonly measures the structural changes of a frame with respect to past frames of the same trajectory, or to frames in other trajectories, without comparing the frame itself with the other frames. Structural changes we want to 28 capture are: changes within single amino acid segments and changes of two amino acid segments with respect to each other. To this end, we simplify the molecular system made ofm amino acids by extracting the positions of them-Carbon (C) backbone atoms (x i ,y i ,z i ) for 1im amino acids and using the backbone atoms to build distance matrices that, together with the matrix eigenvalues, are proxies for structural changes in the molecular system itself [57]. Given a frame, we build two types of distance matrices: (1) Euclidean distance matrices from the positions of the correspondingC,D = [d ij ] withi;j = 1;:::;m and d ij = q (x i x j ) 2 + (y i y j ) 2 + (z i z j ) 2 (2.1) to capture changes within single amino acid segments and (2) bipartite distance matricesB = [b ij ] to capture changes of two amino acid segmentsS 1 andS 2 with respect to each other. B is of sizemm, with elements dened by b ij = 8 > > > > < > > > > : d ij ; ifi2S 1 andj2S 2 0; otherwise. (2.2) The Euclidean distance matrixD and the bipartite distance matrixB have three key properties: they are symmetric, the diagonal is exactly zero, and the o-diagonal elements are strictly positive. Johnston and co-authors show how the larger eigenvalues of each one of these matrices can be computed in isolation (i.e., without keeping other frames in memory) and can serve as a CV that, unlike other CVs, can identify structural changes of substructures [57]. By computing CVs on each frame, we drop the requirement to keep frames in memory as the simulation evolves. We leverage this work to build a suite of analysis scenarios with dierent numbers of matrices and dierent matrix sizes. The number of matrices and matrix sizes are dependent on the system size and non-overlapping segment lengths in which we cut the molecular system into strings of amino acids. The 29 scenarios range from the ne-grained study of as many substructures as possible (with segment length as small as two amino acids) in which we generate many small matrices, to the coarse-grained study of the entire molecular system through one single matrix whose size matches up to the number of amino acids in the system. Figure 2.4 shows the number of matrices (dotted line) and matrix size (solid line) when bipartite matrices are generated. In this gure, the minimum segment length applicable to the analysis is two and the maximum segment length isN /2. In Figure 2.4N =1266 is used as an example. The number of matrices and matrix size impact the cost of the analytics (i.e., the larger the number of the matrices, the larger the number of eigenvalues computed) and the memory use (i.e., the larger the matrices, the larger the memory use). 2.3.2 Results We measure and observe trends for the time spent waiting in I/O for the simulation and idle time for the analysis under dierent MD workow settings (i.e., in situ vs. in transit, dierent strides, and dierent segment lengths). We cut the molecular systems into equal sized segments; the segment lengths considered are m=[2, 13, 25,N =8;N =4,N =2] whereN is the total number ofC backbone atoms, one per amino acid, in the system. For each segment length, we build bipartite matrices for every ( N m ) ( N m 1) 2 pairs of segments. We then compute the largest eigenvalues for each of the generated bipartite matrices. Shown in Figure 2.5 are the measured time spent by the MD simulation waiting in I/O (simulation idle time) and analysis idle time for dierent strides for the large molecular system. Similar outcomes are observable for the other two smaller systems. The rst column in Figure 2.5 (2.5a-2.5g) shows the measurements obtained with the in situ conguration and the second column of Figures 2.5 (2.5b-2.5h) shows the in transit conguration. In Figure 2.5, from top to bottom, the rows indicate stride sizes of 100 (2.5a,2.5b), 500 (2.5c,2.5d), 1000 (2.5e,2.5f), and 5000 (2.5g,2.5h) The dots in the gures show the measured times (i.e., squares are the simulation idle times and the circles are the analysis idle times). The error bars show the standard 30 deviation of the measured times for ve independent trajectories, each with 1000 frames. We assume a threshold marked by the dashed horizontal line, which represents 5% of the largest idle time observed for all the tests. Specically, when analysis idle time is above the dened threshold and larger than the simulation idle time, we observe the execution where the analysis processing is faster than the production of MD frames, thereby imposing idle time on the analysis as it waits for the simulation to generate the frames. In contrast, when simulation idle time is above the threshold and larger than the analysis idle time, the simulation maintains fast production of frames while the processing rate is fast. MD frames are analyzed as soon as they are produced when both simulation idle time and analysis idle time are below the threshold. These three execution scenarios are shown by the blue, yellow, and green shaded regions in the gure. We do not observe a signicant dierence in the measured idle times, both simulation and analysis, between the in situ and in transit congurations. This due to low communication overhead: the two nodes in the in transit conguration are located within close proximity to each other, and the size of the frames being transferred are small ( 7MB). As we move from top to bottom (Figure 2.5a to 2.5g), the dominate simulation idle time for stride of 100 becomes small with increasing stride size (with a reduction of 85% from 100 to 5000 for segment length of 2). On the other hand, analysis idle time is essentially negligible (the resources are fully utilized) and increased to become the dominant component when the stride increases to 5000 (with an increase of 99.99% from 100 to 5000 for segment length ofN =2). We observe a region in which the producer-consumer pattern is balanced. Such a region is predominant for small strides and decreases as the strides increase. 31 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (a) in situ, 100 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (b) in transit, 100 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (c) in situ, 500 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (d) in transit, 500 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (e) in situ, 1000 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (f) in transit, 1000 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (g) in situ, 5000 10 0 10 1 10 2 10 3 Segment Length 0 5 10 15 20 Time Per Frame (s) Threshold:0.05 (0.92 s) Analysis Idle Time Simulation Idle Time (h) in transit, 5000 Figure 2.5: The observed analysis idle times and simulation idle time with the in situ congurations and in transit congurations. 32 2.3.3 ExecutionPatterns According to the observed behavior in Section 2.3.2, we classify our workows in terms of their execution patterns into: (a) workows with fast production of MD frames and slow analytic processing of the frames by the analytic modules; (b) workows with slow production of MD frames and fast analytic processing of the frames by the analytic modules; and (c) balanced workows with MD frames that are analyzed as they are produced. (a) Execution patterns associated to fast MD simulations and slow analytics with S1, S2, and S3 referring to MD simulation times; W1, W2, and W3 referring to write times to Dataspaces; R1, R2, and R3 to read times from Dataspaces; A1, A2, and A3 to analytics times; and IS to simulation idle times. (b) Execution pattern associated to slow MD simulations and fast ana- lytic with S1, S2, and S3 referring to MD simulation times; W1 and W2 referring to write times to Dataspaces; R1 and R2 to read times from Dataspaces; A1 and A2 to analytics times; and IA to analysis idle times because waiting for the MD simulation. Figure 2.6: Execution patterns of in situ workows In the rst type of workows, because the analysis is not able to consume the frame in a timely manner, the MD simulation either waits in I/O to write new frames to the in-memory staging area of Dataspaces 33 (idle simulation time or IS in Figures 2.6a). This scenario negatively impacts the amount of science delivered by the MD workow a lower number of frames over the total simulation time. In the second type of workows, because the MD simulation generates a new frame with large strides (larger than the time needed to analyze the frame), the analytics is waiting in I/O and the associated resources are idle (and thus, may be available for other analyses or for the MD simulation itself). Figure 2.6b shows this scenario. We characterized this scenario as analysis idle time or IA. While there is not direct negative impact on the science delivered, this scenario can result in under-utilized resources. 2.4 Conclusion This chapter presents the characterization of in situ and in transit analysis of MD simulations for super- computers using our proposed metrics. Characterization of workows resulted in identifying regions in the parameter space with underutilized computational resources and wasted compute cycles, including the simulation is idle waiting in I/O because the analysis is not able to consume MD frames at the same pace that simulations produce frames; or the analysis is sitting idle due to the slow rate of frame generation of the simulation. Future work includes the study of our modeling for a diverse set of MD trajectories and the study of the impact of more complex analyses on a diverse set of molecular systems. 34 Chapter3 ComputationalEciencyModelofInsituExecution In this chapter, we generalize the understanding of in situ behaviors to a computational eciency model that is benecial for assessing performance of in situ workows. The intuition is that through the capability of in situ evaluation provided by the proposed eciency model, we are able to determine how far the observed execution is from the optimal performance, and then we use this information to rene the conguration of the in situ workows toward the optimal point. Specically, based on the characterization obtained in Chapter 2, we propose a framework to formalize in situ workow executions based on their iterative patterns. Under the framework’s constraints, we further develop a lightweight approach to evaluate performance of in situ workows that allows us to compare execution eciency with respect to dierent conguration variations of an in situ system. Note that all ndings discussed in this chapter are for a single-coupling in situ workow, i.e. a simulation couples data with one or more analyses using in situ processing. We extend the work to multiple in situ couplings and generalize the approach to workow ensembles in Chapter 4. The rest of the chapter is organized as follows. Section 3.1 reviews the related work on performance evaluation of in situ workows. Section 3.3 details the framework for in situ executions and expands it to dene the notion of computational eciency. Section 3.5 provides insights into the behaviors of in situ workows by applying the proposed metric to characterize an MD synthetic workow. In Section 3.5, 35 we further examine the signicance of in situ placements on coupling performance through employing the proposed eciency metric in a practical workow using a real high-performance MD application. Finally, Section 3.6 summarizes our main contributions and discusses directions for future work. 3.1 RelatedWork Monitoring and performance proling tools for HPC applications are a vibrant research eld resulting in many contributions. Many monitoring and performance proling tools for HPC applications have been developed over the past decade such as TAU [91], HPCToolkit [3], or CrayPat [28] to understand the execution of such applications. Information gathered by such tools, either at runtime or upon application completion, can be used as input for theoretical performance models or scheduling algorithms, for debugging and optimization purposes and to provide users with feedback about the execution. In the context of workow systems, monitoring capabilities are essential to map jobs onto resources and optimize the execution of a given workow [58, 21] on specic HPC systems. With the advent of in situ workows [12], new monitoring and proling approaches targeting tightly- coupled workows have been studied. LDMS [4] is a loosely-integrated scalable monitoring infrastructure that targets general large-scale applications and delivers a low-overhead distributed solution, in contrast to TAU [91], which provides a deeper understanding of the application at a higher computational cost. SOS [105] provides a distributed monitoring platform, conceptually similar to LDMS but specically designed for online in situ characterization of HPC applications. An SOS daemon running on each compute node intercepts events and registers them into a database; the monitored application may fetch data from that database to get feedback. Note that, SOS relies on TAU [91] monitoring capabilities. TAU, in association with ML techniques, has been used to tune the parameters of in situ simulations and optimize the execution at runtime [109]. ADIOS [67], the next-generation IO-stack, is built on top of many in situ data transport layers, e.g. DataSpaces [37] or DIMES [108]. Savannah [48], a workow orchestrator, has been leveraged 36 to bundle a coupled simulation with two main simulations, multiple analysis kernels, and a visualization service [21]. The performance and monitoring service was provided via SOS and the I/O middleware via ADIOS. These works focus mainly on providing monitoring schemes for in situ workows. Here we propose, instead, a novel method to extract useful knowledge from the captured performance data. 3.2 InSituExecutionModel In this section, we propose a novel method to estimate and characterize in situ workows behaviors from collected performance data. To this end, we develop a theoretical framework of in situ executions. In this study, we consider a dedicated failure-free platform without any interferences (caches, I/O, network etc) from other applications. 3.2.1 Framework In traditional workows, the simulation and the post-processing analyzer are typical components, in which the post-processing follows the simulation in a sequential manner. Let astage be a part of a given component. We can identify two main stages for each component (see Figure 3.1): • Simulation component: (S) is the computational stage that produces the scientic data and (W ) is the I/O (or DTL) stage that writes the produced data. • Analytics component: (R) is the DTL stage that reads the data previously written and (A) is the analysis stage. S W R A Simulation Analytics Figure 3.1: Classic workow design featuring two components and four stages. 37 However, in situ workows exhibit a periodic behavior:S,W ,R, andA follow the same sequential order but, instead of operating on all the data, they operate only on a subset of it iteratively. Here, this subset of data is called a frame and can be seen as a snapshot of the simulation at a given timet. LetS i ,W i , R i , andA i be respectively, the simulation, the write, the read and the analysis stage at stepi, respectively. In other words,S i produces the framei,W i writes the produced framei into a buer,R i reads the framei, andA i analyzes the framei. Note that, an actual simulation computes for a given number of steps, but only a subset of these steps are outputted as frames and analyzed [C97]. The frequency of simulation steps to be analyzed is dened by the stride. Letn be the total number of simulation steps,s the stride, andm the number of steps actually analyzed and outputted as frames. We havem = n S . However, we model the in situ workow itself and not only the simulation time (i.e., execution times ofS i ,W i ,R i , andA i are known beforehand), thus the value of the stride does not impact our model. We sets = 1 (orm =n), so every step always analyzes a frame. 3.2.2 ExecutionConstraints To ensure work conservation we dene the following constraint: P m i=0 S i = S (m is the number of produced frames). Obviously, we have identical constraints forR i ,A i , andW i . Similarly to the classic approach, we have the following precedence constraints for alli with 0im: S i !W i !R i !A i : (3.1) 38 S 1 I S 1 W 1 R 1 A 1 I A 1 S 2 I S 2 W 2 R 2 A 2 I A 2 S 3 I S 3 W 3 R 3 A 3 I A 3 Execution constraints Buer constraints In situ step 1 Figure 3.2: Dependency constraints within and across in situ steps withn =m = 3. 3.2.3 BuerConstraints The pipeline design of in situ workows introduces new constraints. We consider a frame is analyzed right after it has been computed. This implies that for any given stepi, the stageW i+1 can start if, and only if, the stageR i has been completed. Formally, for alli with 0im: R i !W i+1 : (3.2) Equation (3.1) and Equation (3.2) guarantee that we buer at most one frame at a time (Figure 3.3). Note that, this constraint can be relaxed such that up tok frames can be buered at the same time as follows, R i !W i+k , where 0im and 1km. In this work, we only consider the casek = 1 (red arrows in Figure 3.2). 3.2.4 IdleStages Due to the above constraints, the dierent stages are tightly-coupled (i.e,R i andA i stages must waitS i and W i before starting their executions). Therefore, idle periods could arise during the execution (i.e., either the simulation or the analytics must wait for the other component). We can characterize two dierent scenarios, Idle Simulation and Idle Analyzer in which idle time occurs. The former (Figure 3.3a) occurs when analyzing a frame takes longer to complete compared to a simulation cycle (i.e.,S i +W i >R i +A i ). The 39 (a) Idle Simulation scenario whenS i +W i <R i +A i (b) Idle Analyzer scenario whenS i +W i >R i +A i Figure 3.3: Two dierent execution scenarios for in situ workow execution. later (Figure 3.3b) occurs when the simulation component takes longer to execute (i.e.,S i +W i <R i +A i ). Figure 3.2 provides a detailed overview of the dependencies among the dierent stages. Note that, the concept of in situ step is dened and explained later in the paper. Intuitively, we want to minimize the idle time on both sides. If the idle time is absent, then it means that we reach the idle-free scenario:S i +W i =R i +A i . To ease the characterization of these idle periods, we introduce two idle stages, one per component. LetI S i andI A i be, respectively, the idle time occurring in the simulation and in the analysis component for the stepi. These two stages represent the idle time in both components, therefore the precedence constraint dened in Equation (3.1) results in: S i !I S i !W i !R i !A i !I A i : (3.3) In our framework, every simulation step is now divided into three ne-grained stages: a simulation stageS, an idle stageI S , and a writing stageW in order, i.e.S occurs beforeI S ,I S happens beforeW . The simulation performs the computation duringS, waits for the time when data are ready to stage inI S , 40 and then sends data to the analysis duringW . Similarly, every analysis step is comprised of: a reading stageR, an analyzing stageA, and an idle stageI A , executed in that order. The analysis reads data sent by the simulation inR, performs certain analyses duringA, and then waits until the next chunk of data is available for processing duringI A . These ne-grained stages can be organized into three sub-groups: computational stages (S;A), I/O stages (W;R), and idle stages (I S ;I A ). The synchronous communication pattern discussed in Sections 3.2.2 and 3.2.3 enforces the coordination among I/O stages such thatW i of stepi occurs beforeR i , andR i happens beforeW i+1 of the next iteration (Figures 3.3a and 3.3b) so that the simulation does not overwrite data, which have not been read yet. 3.2.5 ConsistencyAcrossSteps This work is supported by the hypothesis that every execution of in situ workows under the above constraints will reach a consistent state after a nite number of warming-up steps. Thus, the time spent on each stage within an iteration can be considered constant over iterations. Formally, there existsj where 0j <m such that for alli wherejim, we haveS i =S j . The same holds for each stage,W i ,R i , A i ,I S i , and,I A i . This hypothesis is conrmed in Sections 3.4 and 3.5, and in practice, we observe that the cost of these non-consistent steps is negligible. Our experiments showed that, on average,j 3 for one hundred steps (m = 100). Therefore, we ignore the warming-up steps and we considerj = 0. For the sake of simplicity, we generalize in situ consistency behavior by denotingS =S i for allij. We also have similar notations forR ;A ;I S ;I A andW . As a result, rather than considering a particular stepi for a given stage (e.g.,W i ), we use a star symbol to denote steady-state stages. Then,S ;I S ;W ;R ;A , andI A denote the steady-state stages ofS;I S ;W;R;A, andI A respectively. This hypothesis allows us to predict the performance of a given in situ workow by monitoring a subset of steps, instead of the whole workow. 41 From the two constraints dened by Equation (3.3) and Equation (3.2), and our hypothesis, we dene the following property: S +I S +W =R +A +I A : (3.4) The Idle Simulation scenario is whenI A = 0, andI S = 0 for Idle Analyzer scenario. 3.3 EciencyModel The challenge behind in situ workows evaluation lies in collecting global information from multiple components (in our case, the simulation and the analytics) and use this information to derivate meaningful characteristics about the execution. The complexity of such a task is correlated to the number of steps the workow is running and the number of components involved. By leveraging the consistency hypothesis in Section 3.2.5, we propose to alleviate this cost by proposing a metric that does not require data from all steps. The keystone of our approach is the concept of in situ step. 3.3.1 InSituStep A given in situ workow is composed of a single simulationSim coupled withK analysesAna 1 ;Ana 2 ;:::;Ana K . An in situ workow with one simulation andK analyses hasK dierent couplingsf(Sim;Ana 1 );:::; (Sim;Ana K )g shortened in this work as (Sim;Ana i ) with 1iK. As shown in Section 3.2.4, each of these couplings can be categorized as either Idle Simulation or Idle Analyzer scenarios. For example, in Figure 3.4, the coupling of the simulation and the analysis 1 falls into the Idle Simulation scenario, while the simulation and the analysis 2 are paired under the Idle Analyzer scenario. An in situ step is dened as the duration between the beginning of the stageS in the simulation and the end of the stageI A that nishes last among theK analyses. The in situ step concept helps us to manipulate 42 Figure 3.4: Example of ne-grained execution steps for a member of one ensemble. (Idle simulation and analyzer represent coupled simulation-analysis scenarios.) all the stages of a given step as one consistent task executing across components that can potentially run on dierent machines. Dierent in situ steps overlap each other due to concurrent executions, so we need to distinguish the part that is overlapped ( 0 i ) from the other part ( i ). Thus, i = i + 0 i . For example, in Figure 3.4, to compute the time elapsed between the start of 4 and the end of 5 , we need to sum the two steps and remove the overlapped execution time 0 4 . Thus, we obtain 4 + 5 0 4 . This simple example will give us the intuition behind the makespan computation in Section 3.3.2. Intuitively, the non-overlapped segment of a given in situ step is the section between two consecutive simulation stagesS (recall an in situ step starts with the stageS). According the property dened in Equation (3.4), the non-overlapped period is the aggregation of all stages belonging to one single component in an in situ step: =S +I S +W =R i +A i +I A i ;8i2f1; 2;:::;Kg (3.5) There are two possible scenarios: (i) the simulation and the write stage run longer (Idle Analyzer scenario), then the non-overlapped segment is equals toS +W ; or (ii) one of theK analysis,Ana i , has 43 the longest runtime (Idle Simulation scenario) then, the non-overlapped step is equals toR i +A i . Hence, the non-overlapped step can be calculated as = max(S +W ;R 1 +A 1 ;:::;R K +A K ): (3.6) 3.3.2 MakespanEstimation A rough estimation of the makespan of such workow would be the sum of the execution time for all the stages (i.e, sum up them in situ steps i ). But, recall that in situ steps interleave with each other, so we need to subtract the overlapped parts: Makespan =m 0 (m 1) =m + 0 : (3.7) From Equation (3.7), form large enough, the term 0 becomes negligible. Since in situ workows are executed with a large number of iterations, then Makespan = m (recall that = + 0 ). The makespan, therefore, is capable of being reproducible from the non-overlapped in situ step. This observation indicates that the non-overlapped part of an in situ step is enough to characterize a periodic in situ workow. From Equation (3.5), minimizing the makespan is equivalent to reducing the idle time to zero, which conrms the smallest makespan occurs at the equilibrium point (i.e., when the idle time is equal to zero). Therefore, the in situ execution is necessary to be driven to the equilibrium point. Using our framework and these observations, we further dene a metric to estimate the eciency of an in situ execution 3.3.3 ComputationalEciency To characterize the execution of an in situ run, in this section, we propose an indicator to capture the eciency of the execution of an in situ workow from a computational standpoint, where we want to 44 minimize the idle time, and as a result increase resource usage. To compute the idle time per in situ step, we use Equation (3.6) to derive the duration of the idle stage on the simulation component: I S = (S +W ): (3.8) The duration of the idle stage for the analysisi as I A i = (A i +R i ): (3.9) For each coupling (Sim;Ana i ), the portion of eective computation, i.e. not sitting idle, of an actual in situ step is dened as (I S +I A i ). Since the computational eciency of an ensemble member depends on the amount of time the ensemble components are idle, we compute a computational eciencyE to be the average time of eective computation over the actual in situ step ofK couplings in the ensemble member: E = 1 K K X i=1 1 I S +I A i ! = S +W + P K i=1 A i +R i K 1: (3.10) Since this indicator is derived from , which is used to estimate the makespan, maximizingE implies minimizing the idle time and thereby the makespan. This eciency metric allows for performance compar- ison between dierent in situ runs with dierent congurations. By examining only one non-overlapped in situ step, we provide a lightweight approach to observe behavior from multiple components running concurrently in an in situ workow. This metric can be used as an indicator to determine how far the in situ execution is from the equilibrium point, where the idle time is equal to zero orE = 1. We then get an idea of how to adjust the parameters to approach the equilibrium point, at which the makespan is minimized (see Section 3.3.2). 45 3.4 MolecularDynamicsSyntheticWorkow MD is one of the most popular scientic applications executing on modern HPC systems. MD simulations reproduce the time evolution of molecular systems at a given temperature and pressure by iteratively computing inter-atomic forces and moving atoms over a short time step. The resulting trajectories allow scientists to understand molecular mechanisms and conformations. In particular, a trajectory is a series of frames, i.e. sets of atomic positions saved at xed intervals of time. The stride is the number of time steps between frames considered for storage or further in situ analysis. For example in our framework, for a simulation with 100 steps and a stride of 20, only 5 frames will be sent by the simulation to the analysis component. Since trajectories are high-dimensional objects and many atomic motions such as high-frequency thermal uctuations are usually of no interest, scientists use well-chosen collective variables (CVs) to capture important molecular motions. Technically, a CV is dened as a function of the atomic coordinates in one frame. Reduced to time series of a small number of such CVs, simulated molecular processes are much more amenable to interpretation and further analysis. A CV can be as simple as the distance between two atoms, or can involve complex mathematical operations on a large number of atoms. An example of a complex CV that we will use in this work is the Largest Eigenvalue of the Bipartite Matrix (LEBM). Given two amino acid segmentsA andB, ifd ij is the Euclidean distance betweenC atomsi and j, then the symmetric bipartite matrixB AB = [b ij ] is dened as follows: b ij = 8 > > > > > > > > > < > > > > > > > > > : d ij ; ifi2A andj2B d ij ; ifi2B andj2A 0; otherwise. (3.11) Note thatB AB is symmetric and has zeroes in its diagonal. Johnston et al. [57] showed that the largest eigenvalue ofB AB is an ecient proxy to monitor changes in the conformation ofA relative toB. 46 Figure 3.5: Synthetic Workow: the Extractor (1) sleeps during the emulated delay, then (2) extracts a snapshot of atomic states from existing trajectories and (3) stores it into a synthetic frame. The Ingestor (4) serializes the frame as a chunk and stages it in memory, then the Retriever (5) gets the chunk from the DTL and deserializes it into a frame. Eventually, the MD Analytics performs certain analysis algorithm on that frame. 3.4.1 WorkowDescription In order to study the complex behavior resulting from coupling a MD simulation and with an analysis component in the previously discussed parameter space, we have designed a synthetic in situ MD workow (Figure 3.5). The Synthetic Simulation component extracts frames from previously computed MD trajectories instead of performing an actual, compute-intensive MD simulation. SyntheticSimulation. TheSyntheticSimulation emulates the process of a real MD simulation by extracting frames from trajectories generated previously by an MD simulation engine. The Synthetic Simulation enables us to tune and manage many simulation parameters (discussed in detail in Section 3.4.2) including the number of atoms and strides, which helps the Synthetic Simulation mimic the behavior of real molecular dynamics simulation. Note that, since the Synthetic Simulation does not emulate the computation part of the real MD simulation, it mimics the behavior of the I/O processes of the simulation. Thus, we dene the emulated simulation delay, which is the period of time corresponding to the computation time in the real MD simulation. In order to estimate such delay for emulating the simulation time for a given stride and number of atoms, we use recent benchmarking results from the literature obtained by running the well-known NAMD [74] and Gromacs [79] MD engines. We considered the benchmarking performance 47 (a) Interpolated MD performance 81K 2M 12M 21M 224M Number of atoms 0 50 100 150 200 Performance(ns/day) Benchmarked Estimated (b) Emulated simulation delay 24 8 16 32 64 Number of atoms × 10 5 0 10 20 30 Emulated delay(s) Stride = 1000 Stride = 4000 Stride = 16000 Figure 3.6: MD benchmarking results from the literature obtained by using 512 NVIDIA K20X GPUs. The results are interpolated to obtain the (a) estimated performance and then combined with the stride to synthesize the (b) emulated simulation delay. for ve practical system sizes of 81K, 2M, 12M atoms from Gromacs [79] and 21M, 224M atoms from NAMD [74] to interpolate to the simulation performance with the desired number of atoms ( Figure 3.6a). The interpolated value is then multiplied by the stride to obtain the delay (i.e., a function of both the number of atoms and the stride). In Section 3.4.2, we run the synthetic workow with 200K, 400K, 800K, 1.6M, 3.2M, and 6.4M protein atoms. Figure 3.6b shows the emulated simulation delay when varying the stride for dierent numbers of atoms. The stride varying between 4–16K delivers a wide range of emulated simulation delays, up to 40s. DataTransportLayer(DTL). The DTL Server leverages DataSpaces [37] to deploy an in memory staging area for coupling data between the Synthetic Simulation and the Analyzer. DataSpaces follows the pub- lish/subscribe model in terms of data ow, and the client/server paradigm in terms of control ow. The workow system has to manage a DataSpaces server to manage data requests, keep metadata, and create in memory data objects. Analyzer. The Analyzer plays the role of the analytics component in the synthetic in situ workow. More specically, the Retriever subscribes to a chunk from the in memory staging area and deserializes it into a frame. The MD Analytics then performs a given type of analysis on this frame. Recall that, in our model, 48 only one frame at a time can be store by the DTL (see Section 3.2.2). We leverage DataSpaces built-in locks to ensure that a writing operation to the in memory staging area can only happen when the reading operation of the previous step is complete (constraint model by Equation (3.2)). Thus, the Analyzer is instructed by DataSpaces to wait for the next chunk available in the in memory staging area. Once a chunk has been received and is being processed, the Synthetic Simulation can send another chunk to the Analyzer. 3.4.2 ExperimentalSetups For our experiments we use Tellico (UTK), an IBM POWER9 system that includes four 32-core nodes (2 compute nodes) with 256GB RAM each. Each compute node is connected through an InniBand interconnect network. Since the Synthetic Simulation only emulates the I/O operations of an MD simulation without replicating the actual computation, resource contention is not expected to produce the disparity in execution performance between dierent component placements. Moreover, the main contribution of the Synthetic Simulation is its ability of mimicking the actions of a real simulation engine with fewer resource requirements. For these reasons, we leverage this synthetic workow to (i) validate the accuracy of the proposed in situ metrics; and (ii) characterize the behavior of coupling a simulation with a variety of system sizes with an in situ analytics. The Synthetic Simulation runs on one physical core on a single compute node as it mimics the behavior of a real simulation. On the other hand, the Analyzer and the DataSpaces server are co-located on another dedicated node. Particularly, the Analyzer computes bipartite matrices (see Section 3.4) using multiple parallel processes, which improves CV calculation eciency. After experimenting with dierent numbers of Analyzer processes, we xed that number at 16 processes (number of cores of an IBM AC922) to reach a good speed up and to t the entire Analyzer within one compute node. Parameterspace. Table 3.1 describes the parameters used in the experiments. For the Synthetic Simulation, we study the impact of the number of atoms (the size of the system) and the stride (the frequency at which the Synthetic Simulation component sends a frame to the Analyzer through the DTL). We consider a constant 49 Table 3.1: Parameters used in the experiments Parameter Description Values used in the experiments Synthetic simulation #atoms Number of atoms [210 5 , 410 5 , 810 5 , 1610 5 , 3210 5 , 6410 5 ] #strides Stride [1000, 4000, 16000] #frames Number of frames 100 Data transport layer SM Staging method DATASPACES Analyzer CV Collective variable LEBM lsegment Length of segment pairs 16 number of 100 frames to be analyzed due to the time constraint and the consistency in the behavior between in situ steps. For the DTL, we use the staging method DATASPACES for all the experiments, in which the staging area is in the main memory of the node assigned to the DataSpaces server. For the Analyzer, we choose to calculate a compute-intensive set of CVs (LEBM, Section 3.4) for each possible pair of non-overlapping segments of length 16. If there aren amino acids (alpha amino acids) in the system, there areN = oor(n=16) segments, which amounts toN(N 1)=2 LEBM calculations (O(n 2 )). To fairly interpret the complexity of this analysis algorithm related to the system size, we manipulate the number of amino acids to be proportional to the number of atoms. For example, the system of 400K atoms yields 2 fold larger number of segments compared to the system of 200K atoms. Figure 3.7 illustrates the LEBM’s runtime with respect to the number of atoms. We leverage user-dened events to collect the proposed metrics using TAU [91]. We focus on two dierent levels of information, the workow level and the in situ step level (the time taken by the workow to compute and analyze one frame). At the in situ step level, each value is averaged over three runs for each step. At the workow level, each value is averaged over all in situ steps at steady-sate and then averaged over the three runs. We also depict the standard deviation to assess the statistical signicance of the results. There are two levels of statistical error: for averages across the 3 trials at the in situ step level, and for 50 24 8 16 32 64 #atoms × 10 5 0 100 Time(s) Figure 3.7: Execution time of LEBM on 16 cores, and using a segment length of 16. The fraction of alpha-amino acids in the entire system is equal to 0.00469. averages over 94 in situ steps (excluding the three rst steps and the three last steps) in each run at the workow level. 3.4.3 Results 3.4.3.1 InSituStepCharacterization We study the correlation of individual stages in each in situ step. Due to lack of space, the discussion is limited to a subset of the parameter space as the representative of two characterized idle-based scenarios. Figure 3.8 shows the execution time per step for each component with stride values of 1000 and 4000 steps. Conrming the consistency hypothesis across steps discussed in Section 3.2.5, we observe that the execution time per step is nearly constant, except for a few warm-up and wrap-up steps. Figure 3.8a falls under the Idle Simulation (IS) scenario, as theI i stage only appears in the Synthetic Simulation step. Similarly in Figure 3.8b, we observe the Idle Analyzer (IA) scenario because of the presence ofI A i . These ndings verify the existence of the two idle-based scenarios discussed in Section 3.2.1. Since both the Synthetic Simulation and Analyzer are nearly synchronized, we also underline that the execution time of a single step for each component is equal to each other. This information conrms the property of in situ workows in Equation (3.4). Overall, we can observe that the I/O stages (W i andR i ) take an insignicant 51 (a) stride = 1000 with 810 5 atoms 0 20 40 60 80 100 Step 0 10 20 Time(ms) Synthetic Simulation stages 0 20 40 60 80 100 Step 0 10 20 Analyzer stages S i I S i W i R i A i I A i (b) stride = 4000 with 810 5 atoms 0 20 40 60 80 100 Step 0 10 20 Time(ms) Synthetic Simulation stages 0 20 40 60 80 100 Step 0 10 20 Analyzer stages Figure 3.8: Execution time per step for each component. The Synthetic Simulation stages are on the left and the Analyzer stages are on the right (lower is better). portion of time compared to the full step. This negligible overhead veries the advantage of leveraging in-memory staging for exchanging frames between coupled components. 3.4.3.2 IdleTimeObservation By examining the total idle time, we study the impact of the number of atoms, stride on the performance of the entire workow and each component for dierent scenarios. Accuracy of estimated idle time. For dierent system sizes, Figure 3.9 demonstrates the similarity between MeasuredI that is the measured idle time in one in situ step and EstimatedI , which is the idle 52 24 8 16 32 64 #atoms × 10 5 50 100 Time(s) 0.6 0.8 1.0 1.2 1.4 Ratio Estimated I ∗ Measured I ∗ Estimated I ∗ / Measured I ∗ Figure 3.9: Left y-axis: total idle timeI using an helper-core placement at stride 16000 (the lower the better). EstimatedI is estimated from Equations (3.8) and (3.9), and MeasuredI is measured idle time in one in situ step. Right y-axis: ratio EstimatedI / MeasuredI (the closest to 1 the better). (a) Stride = 1000 2 4 8 16 32 64 #atoms × 10 5 0 50 100 150 Time(s) Idle Analyzer Idle Simulation Idle time I ∗ (b) Stride = 4000 2 4 8 16 32 64 #atoms × 10 5 0 50 100 150 Equilibrium Point Idle Analyzer Idle Simulation Idle time I ∗ (c) Stride = 16000 2 4 8 16 32 64 #atoms × 10 5 0 50 100 150 Equilibrium Point Idle Analyzer Idle Simulation Idle time I ∗ Figure 3.10: Detailed idle timeI for three component placements at dierent strides when varying the number of atoms (lower is better). time estimation computed using Equations (3.8) and (3.9). The ratio between Estimated and Measured idle time is close to 1, conrming the accuracy of Equations (3.8) and (3.9) to estimate the idle timeI S andI A for each in situ step, which allows us to apply this relationship to identify the execution scenario that the workow is following. ExecutionScenarios. Figure 3.10 shows that the workow execution follows our model (Figure 3.3). The blue regions in Figure 3.10 represent the Idle Simulation scenario whenS +W < R +A , and the yellow area indicates the Idle Analyzer scenario whenS +W >R +A . While increasing the number of atoms, which increases the simulation time and the chunk size, the total idle timeI decreases in the Idle Simulation scenario, and increases in the Idle Analyzer scenario. Every in situ step exhibits a similar 53 24 8 16 32 64 #atoms × 10 5 5000 10000 15000 20000 Time(s) 0.0 0.5 1.0 1.5 2.0 Ratio Measured Makespan Estimated Makespan Estimated Makespan / Measured Makespan Figure 3.11: Makespan is estimated from 100 with stride 16000, the yellow region represents the error. Ratio of Estimated Makespan to Measured Makespan uses the second y-axes on the right (close to 1 is better). pattern in which at a certain system size the workow execution switches from one scenario to another. We notice that with larger stride, the equilibrium point occurs at larger system sizes. As the stride increases, the Synthetic Simulation sends frames to the Analyzer less often. Therefore, increasing the stride reduces the gap betweenS +W andR +A , which also leads to a equilibrium point reached with a smaller number of atoms. With a stride of 4000, the equilibrium point occurs at #atoms = 810 5 , but it occurs at #atoms = 1610 5 with a stride of 16000. At a stride of 1000, the execution follows the Idle Simulation scenario for all observed number of atoms and the equilibrium point cannot be reached in this range of system size. 3.4.3.3 EstimatedMakespan The goal is to verify the assertion made by Equation (3.7) stating the Makespan of an in situ workow can simply be expressed as the product of the number of steps and the time of one stepm . A typical MD simulation can easily feature> 10 7 number of in situ steps, thus a metric requiring only a few steps to be accurate is interesting. Figure 3.11 demonstrates the strength of our approach to estimate the Makespan (maximum error 5%) using our denition of in situ steps, in addition to the accuracy of our model. In in situ workows run with a larger number of steps, monitoring the entire system increases the 54 2 4 8 16 32 64 #atoms × 10 5 0 25 50 75 100 Resource Usage Efficiency (%) Stride = 1000 Stride = 4000 Stride = 16000 Figure 3.12: Computational eciency (higher is better) pressure and slows down the execution. Thus, without failures and external loads, only looking at a single non-overlapped step results in a scalable, accurate, and lightweight approach. 3.4.3.4 ComputationalEciency We utilize the eciency metricE given by Equation (3.10), to evaluate an in situ conguration within the objective to propose a metric that allows users to characterize in situ workows. Figure 3.12 shows that the eciencyE increases and reach a maximum in the Idle Simulation scenario, and decreases after this maximum in the Idle Analyzer scenario. Thus, an in situ run is most ecient at the equilibrium point, whereE 1. If a run is less ecient and classied as the Idle Analyzer scenario, it has more freedom to perform other analyses or increase the analysis algorithm’s complexity. In the Idle Simulation scenario, the simulation is aordable to emulate the motions of a larger number of atoms to more eciently use the resource sitting idle. 55 Figure 3.13: Practical Workow: GROMACS (1) simulates the motion of the atomic system in steps, where Plumed (2) interferes with every stride to update and gather new coordinates and store it into a frame. The Ingestor (3) serializes the frame as a chunk and stages it in memory, then the Retriever (5) gets the chunk from the DTL and deserializes it into a frame. Eventually, the MD Analytics performs the same analysis algorithm of CV calculation on that frame compared to Synthetic Workow. 3.5 MolecularDynamicsRealisticWorkow In this section, in order to observe performance interference between applications running in situ, we replace the Synthetic Simulation by a high-performance molecular dynamics simulation engine. We use this realistic workow to study the eect of dierent component placements on workow execution characterization. 3.5.1 WorkowDescription The simulation component in this practical workow performs MD computation instead of sitting idle as the Synthetic Simulation does, thus it consumes memory that contends with the co-located Analyzer executing on the same resource. PracticalSimulation. The Practical Simulation (see Figure 3.13) utilizes the MD software package GRO- MACS [79] to simulate biomolecular processes. GROMACS enables diverse levels of parallelism, i.e. multithreading and process communication via message passing. However, this MD engine does not explicitly allow us to extract in-memory frames during the course of the run without manually intruding the source code. To oer a non-intrusive approach, we use Plumed [99] that intercepts function calls 56 done by GROMACS periodically to get a snapshot of the system state in memory. An additional layer implemented by Plumed is placed on top of the corresponding simulation as an external library. Therefore, this Plumed kernel approach to obtaining in-memory frames is not restrictive to a specic simulation engine, but applicable to a variety of MD codes as long as Plumed provides support for being incorporated in such MD applications. In particular, a Plumed kernel function is called in every interval of time, which is determined by the stride, to collect atomic coordinates at the corresponding simulation step. Molecular positions are then serialized into an abstract chunk to be compatible with the data abstraction conducted by the interface of the Ingestor. Since the chunk is reachable by the Ingestor, the dataow then acts similarly to the Synthetic Workow. DataTransportLayerandAnalyzer. In this workow, the DTL and the Analyzer remain the same at the synthetic workow discussed in Section 3.4. 3.5.2 Experimentalsetups In this experiment, we study two component placements: (i) helper-core—where the Synthetic Simulation, the DataSpaces server, and the Analyzer are co-located on the same compute node; and (ii) in transit where the Analyzer and the DataSpaces server are co-located on one node, and the Synthetic Simulation runs on a dedicated node. We use the same machine, Tellico, which is described in Section 3.4.2. The experimental plan is designed to yield baseline insights on a simulation coupled with an in situ analysis task in the context of ensemble workows. An ensemble workow is comprised of many small-scale simulations [77] that run independently of each other. Hence, insights from a single simulation-analysis integration will scale up to the entire workow. On the simulation side, we conduct a GROMACS run on 24 cores of a compute node, while the remaining cores of the node are assigned to the Analyzer and the DataSpaces server in the helper-core placement. Specically, we run the analytics on 4 physical cores and set 1 core to execute a DataSpaces server. In 57 Table 3.2: Parameters used in the experiments Parameter Description Values used Practical simulation- Gltph system #atoms Number of atoms 268552 #strides Stride [1000, 2000, 3000, 4000, 5000] #steps Number of simulation steps 45000 #frames Number of frames #steps / #strides Data transport layer SM Staging method DATASPACES Analyzer CV Collective variable LEBM lsegment Length of segment pairs 2 #threads Number of threads 4 #repetitions Number of times computing CV 10 contrast to the helper-core placement, the Analyzer and the DTL server reside in a separate node in the in transit placement. We keep the resources assigned for each component comparable in both placements to demonstrate the impact of such the component placement on the execution of an in situ workow. The details of the parameter space used in this experiment are specied in Table 3.2. For the Practical Simulation, we selected a medium-scale all-atom system that we used in a previous publication[6] on the molecular mechanism of active neurotransmitter transport across the cellular mem- brane. That study focused on the GltPh transporter protein, which is an archaeal homolog of the human excitatory amino acid transporter (EAAT) family of proteins which are implicated in many neurological disorders and are responsible for permanent neurological damage after strokes. The 268552-atom model system contains the GltPh transporter protein (three identical chains of 605 amino acids, X-ray structure from PDB entry 2NWX[15]) embedded in a lipid bilayer and surrounded by water molecules and a physio- logical concentration of Na + and Cl - ions. Molecular interactions are parameterized with the CHARMM36 forceeld[13] implemented in GROMACS[14], with standard simulation settings and a time steps of 2 fs. 58 To test our analysis workow with GltPh, we explore dierent strides of 1000, 2000, 3000, 4000, 5000 time steps, at which Plumed generates in-memory frames for later processing. Thus for each #strides, the number of generated frames on the simulation side, which is also equivalent to the number of analyzed frames on the analysis side, is calculated as #steps #strides . In this experimental setup, we vary the stride as a congurable parameter to nd the equilibrium point. However, this parameter is not only restricted to the stride, we are always able to set the equilibrium point to dierent parameters. On the Analyzer side, the analysis kernel computes a set of 10 LEBM collective variables (Section 3.4) for each possible pair of non-overlapping segments of length 2. The complexity of this CV computation with respect to the predened segment length is discussed in Section 3.4.2. For the DTL, the memory staging method DATASPACES is used for all subsequent experiments in this section, and data resides in the memory of the node where the DataSpaces server runs. Due to the diculty of linking TAU to Plumed, in this experiment, we manually inserted timers to collect performance data that is necessary for the proposed in situ metrics. Similar statistical methods (Section 3.4.2) are applied, so we can accumulate experimental error across both trials and in situ steps at the same time. We still eliminate the rst three in situ steps and the last in situ step to assure in situ metrics are collected in the steady state where consistent behavior is observed across in situ steps. 3.5.3 Results 3.5.3.1 InSituStepCharacterization We examine each in situ step over dierent placements of in situ tasks. Figure 3.14 illustrates the time spent in each stage on both the simulation and the analysis side at stride 1000. Since the practical workow satises dependency constraints within an in situ step and across the aforementioned stages, the behavior is observed to be approximately stable as expected in the steady state regime. The results are shown at stride 1000 only due to lack of space, but we note that the consistency is present for every given stride. 59 (a) stride = 1000 with helper-core placement 0 5 1015202530354045 Step 0 50 100 150 200 Time(s) Simulation + plumed stages 0 5 1015202530354045 Step 0 50 100 150 200 Analyzer stages S i I S i W i R i A i I A i (b) stride = 1000 with in transit placement 0 5 1015202530354045 Step 0 50 100 150 200 Time(s) Simulation + plumed stages 0 5 1015202530354045 Step 0 50 100 150 200 Analyzer stages Figure 3.14: Execution time per in situ step for each component with the helper-core and in transit placement. The Practical Simulation stages are on the left and the Analyzer stages are on the right (lower is better). This experiment conrms the applicability of our proposed in situ metrics. In addition, the in transit scheme appears to result in less bursty behavior in terms of execution compared to the helper-core scenario. The uctuations observed when using helper-core are due to resource contentions between co-located applications. 60 3.5.3.2 MakespanEstimation In Section 3.4.3.3, we have conrmed the assumption of consistency across steps and thus, our in situ metric, for the synthetic scenario and for one component placement. Here, we further estimate the Makespan from the non-overlapped step based on Equation (3.7) using dierent component placements. In terms of execution scenarios, an in situ run is classied by the Idle Simulation and the Idle Analyzer in Section 3.2.1. In this experiment, we are able to determine which range of strides leads to which scenario as shown in Figure 3.15. We dene theequilibriumpoint as the inexion point where the transition from Idle Simulation to Idle Analyzer happens (i.e., the equilibrium point corresponds to a perfect execution with zero idle time). Figure 3.15 compares theMakespan between the helper-core and in transit placement. At rst glance, the estimated Makespan is close to the measured Makespan in both scenarios. The equilibrium point happens at stride 2000 to stride 3000 in the in transit scheme, whereas the equilibrium point of the helper- core placement occurs at larger stride (from stride 3000 to stride 4000). This nding conrms that the in transit conguration allows executing more frequent analyses at better eciency than the helper-core does. In the Idle Analyzer case, there is no big dierence inMakespan between the helper-core and the in transit case. Another way to state this is that component placement has more importance in the Idle Simulation scenario, which corresponds to the case when the analytics are performed at high frequency. Although the experiment is conducted on a single simulation coupled with an in situ analysis task, the trend observed here sets the foundation for scaling up to many simulations in the context of ensembles workows. 3.5.3.3 ComputationalEciency As discussed in Section 3.4.3.2, evaluating the coupling performance of dierent component placements using the idle time in an in situ step is challenging due to the involvement of multiple concurrent tasks competing for computing resources and due to the large parameter space for each component. In this 61 1000 2000 3000 4000 5000 Stride 0 2000 4000 6000 Time(s) Helper-core Equilibrium Point In transit Equilibrium Point Helper-core Measured Makespan Helper-core Estimated Makespan In transit Measured Makespan In transit Estimated Makespan Figure 3.15: Makespan is estimated from Equation (3.7) over the helper-core and in transit component placement, the yellow region represents the error of theEstimatedMakespan from theMeasuredMakespan. section, we leverage the eciency metricE, Equation (3.10), to determine how ecient an in situ run is with respect to a given conguration. Figure 3.16 shows this eciency value with dierent strides and over the helper-core and in transit placements. The comparison between two given in situ runs becomes straightforward usingE as the indicator. The higher theE value is, the more ecient the in situ execution is, in terms of resource usage. The helper-core case has the best resource usage eciency ( 100%) at larger stride, or lower frequency of the Analyzer compared to the in transit case. Resource contention between co-located applications in the helper-core placement results in the eciency degradation when performing the analytics at high frequency. This nding introduces a trade-o between computing resource cost and the analysis frequency in designing an in situ system. Finally, as expected, a run with a stride close to the equilibrium point gives a better resource usage eciency. 62 1000 2000 3000 4000 5000 Stride 0 20 40 60 80 100 Resource Usage Efficiency (%) Helper-core Equilibrium Point In transit Equilibrium Point Helper-core In transit Figure 3.16: Computational eciency of the practical workow over a variety of strides (higher is better) 3.6 Conclusion In this study, we have designed a lightweight metric for the makespan and the computational eciency of the workow, based on behavior consistency across in situ steps under our constrained in situ execution model. We have validated the usefulness of these proposed metrics with a set of experiments using an in situ MD synthetic workow. By using a realistic MD practical workow, we have compared two dierent placements for the workow components, a helper-core placement and an in transit placement in which the DTL server is co-located with dierent components. Under no resource constraint, by allocating dedicated nodes for the in transit analytics, the in situ coupling is allowed to perform the analysis more frequently. On the other hand, running the helper-core placement at the equilibrium point is targeted as the ideal scenario for optimizing resource utilization if those are limited. Future work will study dierent models where the constraints are relaxed, for example where the workow allows to buer multiple frames in memory instead of one currently. We also plan to generalize 63 the proposed framework’s constraints to support more communication protocols, i.e. message-driven dataow, multiple data transport paths, or another data transport layer. Another promising research line is to extend our theoretical framework to take into account multiple analysis methods, which is often the case for MD trajectory data. In this case, the time taken by the analysis could vary depending on the method used. Finally, arising from the necessity of more complex workows to serve various in situ analysis requirements, performance evaluation of in situ workows should be analyzed in the setting of an ensemble of workows. 64 Chapter4 PerformanceIndicatorstoEvaluatePerformanceofInsituWorkow Ensembles In Chapter 3, we focused on the performance evaluation of in situ workows that are comprised of a single simulation coupled with one or more analyses. However, many interesting events in large-scale molecular systems require running simulations for a long time, consuming vast amounts of resources, which is out of reach for many scientists. To overcome this timescale problem, a family of enhanced-sampling methods are used to replace the simulation of a long trajectory by multiple short-range simulations that are executed simultaneously in an ensemble. Ensemble-based computational methods [86, 23] allow to more eciently explore dierent regions of the conformational space, thereby gaining popularity in many scientic domains using computational simulations. Given the need to execute ensembles of simulations, in this chapter, we extend the single workow method to evaluate the performance of in situ workow ensembles. To this end, we introduce a set of performance metrics that quantify the benets of the co-location between components sharing the same computing allocation. Specically, we leverage the framework proposed in Chapter 3 to propose novel performance indicators that allow us to assess the expected eciency of a given conguration of a workow ensemble in multiple resource perspectives: resource usage, resource allocation, and resource provisioning. 65 The rest of the chapter is organized as follows. First, we discuss related work in Section 4.1. In Section 4.2, we show that collecting traditional metrics such as makespan, instructions per cycle, memory usage, or cache miss ratio is not sucient to characterize complex behaviors of workow ensembles. The problem to optimize performance of workow ensemble is formalized and the performance indicators are dened in Section 4.4. Section 4.5 validates our proposed indicators and empirically demonstrates the feasibility of using our methods. Finally, we conclude and provide directions for future work in Section 4.6. 4.1 RelatedWork Modern scientic workows commonly feature multiple coupled components, which need to be monitored at the same time to understand the global performance of the workow. Recent monitoring systems for scientic workows use system-level information to extract insights into the execution of the workows. LDMS [4] developed distributed proling services to periodically sample resource utilization metrics of compute nodes running the workow. SOS [105] leverages conventional HPC monitoring tools to build an online performance prole that can be run alongside the workow execution to analyze workow behaviors. However, traditional performance tools are not designed for modern workows featuring in situ processing. They collect potentially unnecessary data and may incur signicant overhead during proling. Several works have addressed monitoring overhead by introducing their particular methods to evaluate a subset of desired features of the workows. Taufer et al. [C97] leveraged domain-specic metrics such as lost frames to characterize in situ analytic tasks using various job mappings. Zacarias et al. [106] estimated the performance degradation arising from co-located applications using a machine learning model. SeeSAw [71] maximized the performance of in situ analysis under power constraints using energy management approaches. WOWMON [109] implemented a runtime that provides a monitoring scheme for scientic workows composed of in situ tasks by collecting a set of proposed metrics, and a machine learning-based performance diagnosis to validate if the collected metrics are necessary or redundant. While 66 these works focused on in situ workows, evaluating the performance of the workow ensembles is not a straightforward extension of evaluating individual workows. Our work denes the performance of ensembles of in situ workows. Ensemble-based methods [23, 20, 76] recently gained attention in the computational science, mainly due to the growth of computing power of large-scale systems allowing more simulations to run in parallel. Ensembles are an ecient approach for enhancing sampling techniques, exploring broader conguration space and overcoming the local minima problem observed in scientic simulations. Multiple-walker [23] allowed faster convergence and better sampling by exploiting multiple replicas that simultaneously explore free-energy landscape in addition to transition coordinates of the system. Generalized ensembles [20] explored multiple states of a simulation in ensembles with a probability weight factor so that a random walk in a particular state can escape the energy barrier. Several recent eorts attempted to eciently manage the execution of ensemble-based simulations combined with analysis tasks. John et al. [77] proposed a workow management system that stores task provenances to enable adaptive ensemble simulation. EnTK [10] is a general-purpose toolkit that abstracts components and tasks in an ensemble-based workow to support various scenarios in which the number of tasks or task dependencies can vary. These works build on RADICAL-Pilot as runtime system [73]. However, these works focus on workow ensembles with traditional data coupling among tasks and not on workow ensembles of in situ tasks like the proposed work. A recent study [82, 5] has aimed to prepare the HPC software stack to sustain concurrent execution of multiple simulations and in situ analyses. 4.2 WorkowEnsemble In this section, we conduct several experiments using a realistic use case of molecular dynamics ensembles executing on a large-scale HPC platforms. We characterize the behavior of the ensemble use case using traditional metrics and discuss their limitations. The analysis of the obtained results demonstrates the need 67 for new metrics that can accurately capture performance behaviors of ensemble-based computations. Based on these results, we developed new metrics that can better capture ensemble behavior. 4.2.1 ExperimentalSetup In situ processing, combined with in-memory computing, has emerged as a solution to overcome I/O bottlenecks in large-scale systems, because moving data in memory rather than via the le system provides much better performance. However, using in situ processing, often implies that the communicating components need to share a node on an HPC system (in case of a distributed memory architecture). However, this co-location can also lead to resource contention and reduce the benet of in situ communications. In the context of workow ensembles, a large number of components sharing resources may exacerbate resource contention. To measure the impact of resource contention, we monitor a set of traditional metrics (see Table 4.1) that are classied into three levels of granularity: (i) ensemble component, (ii) ensemble member/workow, and (iii) workow ensemble. Metric Description Ensemble Component Execution time Time spent in one component (e.g., simulation or analyses) LLC miss ratio Number of LLC misses / Number of LLC references Memory intensity Number of LLC misses / Number of instructions Instructions per cycle Number of instructions / Number of cycles Ensemble Member Member makespan Timespan between simulation start time and the latest analysis end time Workow Ensemble Ensemble makespan Maximum makespan among all ensemble members in the workow Table 4.1: Set of metrics. (LLC stands for Last-level cache.) At the ensemble component level, cache miss ratio and memory intensity [24] indicate the degree of resource contention; instructions per cycle shows the raw performance of the ensemble component. At the ensemble member level, we calculate the turnaround time (makespan) of each member, by taking the dierence between the end time of the latest analysis and the start time of the simulation. The ensemble 68 makespan is dened as the maximum makespan of all ensemble members. (Recall that all members run concurrently and all simulations start simultaneously.) 4.2.1.1 ApplicationSettings In this experiment, an ensemble member is comprised of a MD simulation coupled with analysis kernels using in situ processing. Specically, the simulation simulates a medium-scale all-atom system containing the GltPh transporter protein [6]. Molecular interactions are implemented in GROMACS [14], with standard simulation settings at a time-step of 2 femtoseconds. The simulation periodically sends in-memory generated frames, i.e. atomic positions, to the analyses coupled with it. In our application, the analysis computes the largest eigenvalue of bipartite matrices [57] as a collective variable [11] of the frames. This captures molecular motions of the system. The frequency at which data is sent for analysis is determined by the stride, which represents the number of simulation steps computed before a frame is generated. 4.2.1.2 Experimentalplatform Our execution platform is Cori [75], a Cray XC40 supercomputer located at the National Energy Research Scientic Computing Center (NERSC). Each compute node is equipped with two Intel Xeon E5-2698 v3 (16 cores each) sharing 128 GB of DRAM, which are connected through a Cray Aries dragony topology. To test the impact of co-locating the analyses and the simulation, we set the simulation to a predened stride and choose the settings for the analysis that satisfy two conditions: 1. A simulation step takes longer than an analysis step so that the analysis does not slow down the simulation 2. The idle time in the analysis (waiting for simulations’ chunks) is minimized, so that we maximize the time that the analyses and simulations are running at the same time. 69 Section 4.3.1 provides more details about the approach. For our experiments, the two constraints are satised by the following resource allocations: every simulation runs on 16 physical cores of a computing node with a stride equal to 2; 000 and 30; 000 simulation steps, and each analysis uses 8 physical cores. We leverage DIMES [108] to deploy the in-memory staging area for the DTL. DIMES is an in situ implementation in which data is kept locally in the node memory on which the simulation is running and distributed over network to nodes upon request. We use TAU [91] to collect execution times, performance counters, and memory footprints. Measurements are averaged over ve trials. 4.2.1.3 WorkowEnsembleCongurations In this work, we experiment with an workow ensembles with dierent congurations (e.g., number of ensemble members, component placements) to study co-location behaviors. Table 4.2 shows the 7 congurations used in our experiments. These congurations include the number of ensemble members, number of computing nodes allocated for the entire workow ensemble, and node indexes in the allocation on which each ensemble component is running. Every ensemble member is comprised of one simulation coupled with one analysis.C f andC c are two elementary congurations in which each conguration has a single ensemble member.C f describes a co-location-free placement, i.e. the simulation and the analysis are located on two separate nodes.C c co-locates the simulation and the analysis on a single compute node. The congurations for 2 ensemble members explore a number of co-location scenarios of ensemble components. InC1:1, the two analyses run on the same node and each simulation on a dedicated node; inC1:2, both simulations share a node and analyses run on dedicated nodes. InC1:3, the simulation and the analysis of the rst ensemble member share the same node, while the other ensemble member has the simulation and the analysis running on two dierent nodes. InC1:4, the two simulations share a node and the two analyses share another node. Finally,C1:5 represents the setup where each simulation shares a node with its corresponding analysis. 70 Cong- uration Number of computing nodes Number of ensemble members Node indexes Ensemble member 1 Ensemble member 2 Simulation 1 Analysis 1 Simulation 2 Analysis 2 C f 2 1 n 0 n 1 - - C c 1 1 n 0 n 0 - - C1.1 3 2 n 0 n 2 n 1 n 2 C1.2 3 2 n 0 n 1 n 0 n 2 C1.3 3 2 n 0 n 0 n 1 n 2 C1.4 2 2 n 0 n 1 n 0 n 1 C1.5 2 2 n 0 n 0 n 1 n 1 Table 4.2: Experimental scenarios conguration settings. 4.2.2 AnalyisofWorkowEnsembleCo-location Figures 4.1 to 4.3 show measurements obtained with the set of traditional metrics (Table 4.1) for the various conguration settings (Table 4.2). Higher LLC miss ratios in Figure 4.1 (compared to co-location-free congurationC f ) capture the cache misses inC c , andC1:1 toC1:5 due to resource contention from the co-located ensemble components. In our application, analyses are more memory-intensive than the simulations, thus co-locations of the analyses, i.e. C1:1 and C1:4, result in higher cache misses than the co-location of the simulations, i.e.C1:2. The co-location of heterogeneous tasks (the simulation and the analysis) lead to higher miss rates inC1:3 andC1:5 compared toC1:1,C1:2, andC1:4. That said, C1:5 yields the shortest member makespan among all congurations (Figures 4.2 and 4.3). We argue that co-locating coupled components within an ensemble member leads to execution eciency despite the elevated degree of LLC interference. However, only simulation and analyses that exchange data should be co-located. The overall conclusion is that evaluating each set of metrics exclusively does not guarantee a thorough understanding of the workow ensemble performance. Metrics at the component level yield insights into the characteristics of individual components, but fail to capture the overall workow ensemble behavior. For example, in our case, analyses are more memory-intensive than simulations, which leads to increased 71 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 1400 1600 1800 Execution time [s] Simulations Simulation 1 Simulation 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 1400 1600 Analyses Analysis 1 Analysis 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 6 8 10 12 LLC miss ratio [%] Simulations Simulation 1 Simulation 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 36 38 40 42 Analyses Analysis 1 Analysis 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 0.00008 0.00010 0.00012 0.00014 0.00016 Memory intensity Simulations Simulation 1 Simulation 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 0.0014 0.0016 0.0018 0.0020 Analyses Analysis 1 Analysis 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 Configurations 1.6 1.8 2.0 Instructions per cycle Simulations Simulation 1 Simulation 2 C f C c C1.1 C1.2 C1.3 C1.4 C1.5 Configurations 1.6 1.8 2.0 Analyses Analysis 1 Analysis 2 Figure 4.1: Metrics at ensemble component level. C f C c C1.1 C1.2 C1.3 C1.4 C1.5 Configurations 1300 1400 1500 1600 1700 Member makespan [s] Ensemble Member 1 Ensemble Member 2 Figure 4.2: Ensemble member makespan. C1.1 C1.2 C1.3 C1.4 C1.5 Configurations 1500 1600 Ensemble makespan [s] Figure 4.3: Workow ensemble makespan. 72 cache miss ratio or higher memory interference. As a result, resource contention may arise due to co-located analyses, thereby not only leading to increased execution time of these components, but also increased ensemble member makespan (recall the simulation and analyses execute synchronously). Consequently, the overall workow ensemble makespan may be harmed due to slow ensemble members. Therefore, in order to identify stragglers among the members one would need to diligently inspect and relate the independent measurements to draw conclusions of the workow ensemble performance. We argue then that there is a need to develop a method that captures the performance within a workow ensemble at multiple levels of granularity. To this end, in the next section, we present an eciency metric that indicates eective computation during the execution of an ensemble member. We then consolidate measurements collected at the ensemble member level into an indicator of overall workow ensemble eciency. 4.3 Discussion 4.3.1 ChoiceofSettings In this section, we use our eciency model to substantiate the choice of settings (i.e., number of cores) used to run the experiments shown in Section 4.2.1. Recall that for that set of experiments, we consider one MD simulation coupled with one in situ analysis. The parameter space is intractable as we can vary the number of cores per component, their respective placements, and the stride of the simulation. Thus, an exhaustive search is out of reach. However, we can dene a heuristic that seeks for parameters that minimize the makespan and maximize the computational eciency of an ensemble member. In this context, we make the following assumptions: • The simulation settings are considered as an input of the problem and are provided by the user. In most cases, scientists have a rough estimate of the best settings for their simulations, but not for the analyses. 73 • Although our theoretical framework supports coupling to dierent types of analyses simultaneously, we limit our experiments to only identical analyses. We rst consider the scenario without co-location, and we argue that settings provisioned to the simulation and the analysis within that context act as a baseline when contrasting to scenarios with co-location. In this experiment, to ensure that there is no co-location, we consider a simple coupling of a single MD simulation coupled with one analysis executed on one dedicated compute node. Based on our rst assumption, we arbitrarily set the settings of the simulation as follows: 8 cores and a stride of 2000. We then vary the number of cores allocated to the analysis to determine for which number of cores the makespan is minimized and the computational eciencyE is maximized in that conguration (recall that our execution platform has compute nodes embedding 32 cores). We know that minimizing the makespan is equivalent to minimizing . Thus, given an ensemble member with a certain simulation and a predened conguration coupled with in situ analyses, in order to minimize the makespan, we need to assign a number of cores to the analysis such that: R i +A i S +W ;8i2f1; 2;:::;Kg. This inequality implies that each of theK coupling (Sim;Ana i ) falls into the Idle Analyzer scenario so that the analysis steps are hidden by the simulation steps to not slow down the makespan. Figure 4.4 shows the impact, when the number of cores assigned to the analysis ranges from 1 to 32, on the in situ step , the simulation componentS +W , the analysis componentR +A , and the computational eciencyE. The analysis step when using 1 and 2 cores takes longer than the simulation step, i.e.R +A >S +W , thus =R +A . The inequality is satised once the analysis uses between 4 and 32 cores, which minimizes =S +W , thereby minimizing the member makespan. Among executions whose makespan are minimized, we optimize the computation eciency by selecting the conguration that leads tomax(E). Hence, we decide to assign 4 cores to each analysis, which results in the highest computational eciency, i.e. the smallest amount of idle time. 74 1 2 4 8 16 32 Number of cores assigned to the analysis 100 200 300 400 Time (s) S ∗ +W ∗ R ∗ +A ∗ σ ∗ E 20 40 60 80 100 Computational Efficiency E (%) Figure 4.4: Execution time of the in situ step and computational eciency when varying the number of cores assigned to the analysis with a xed simulation setting. 4.3.2 ImpactofCo-location In this section, we estimate the impact of co-locating ensemble components within a workow ensemble by conducting executions of workow ensembles with two members on three congurations described in Table 4.3 (each conguration uses two compute nodes). C colocated co-locates the simulation and the analyses of an ensemble member on the same compute node. C dedicated co-locates two simulations on the same node while all the analyses are co-located on the second dedicated node. Finally, inC hybrid the analyses are placed on the node on which the simulation of the other ensemble member is running. A compute node has 32 cores and, for all congurations, every simulation is assigned 8 cores and every analysis 4 cores. Recall from Section 4.3.1 this setting minimizes idle time when there is no co-location between simulation and analyses for ensembles with one simulation coupled with one analysis. In this experiment, we increase the number of analyses per ensemble member to observe the impact of co-location among ensemble components. Figure 4.5 shows the computational eciency corresponding to the three congurations when the number of analyses per ensemble member ranges from 1 to 4, each data point is the result of 5 trials. Overall, we observe higher eciencies inC colocated andC hybrid thanC dedicated at small number of analyses 75 Conguration Number of computing nodes (N) Number of ensemble members Node indexes Ensemble member 1 Ensemble member 2 Simulation 1 Analyses 1.x Simulation 2 Analyses 2.x C colocated 2 2 n 0 n 0 ;:::;n 0 n 1 n 1 ;:::;n 1 C dedicated 2 2 n 0 n 1 ;:::;n 1 n 0 n 1 ;:::;n 1 C hybrid 2 2 n 0 n 1 ;:::;n 1 n 1 n 0 ;:::;n 0 Table 4.3: Experimental congurations with two ensemble members, each ensemble member has two analyses per simulation. per ensemble member (1 and 2). Several studies have demonstrated the benets of co-locating compute- intensive with memory-intensive applications [78, 111]. Our ndings conrm the benets of co-locating heterogeneous applications, i.e. compute-intensive simulation and memory-intensive analyses, that triggers less resource contention (recall from Section 4.2.2, the analysis is memory-intensive while the simulation is compute-intensive). C colocated C dedicated C hybrid 40 50 60 70 Computational Efficiency (%) Number of analyses per ensemble member = 1 C colocated C dedicated C hybrid Number of analyses per ensemble member = 2 C colocated C dedicated C hybrid Number of analyses per ensemble member = 3 C colocated C dedicated C hybrid Number of analyses per ensemble member = 4 Ensemble member 1 Ensemble member 2 Figure 4.5: Computational eciency when varying the number of analyses per ensemble member. Note that the missing values in theC dedicated conguration when running with 4 analyses per ensemble member is due to out of memory errors on the node. Executions with 3 analyses per ensemble member result on unexpectedly high eciency values for C dedicated , which is due to the dramatically slowing down of the analyses once co-locating with large enough number of memory-intensive analyses (6 analyses in total for 2 ensemble members) on a single node. This increase in eciency forC dedicated indicates a small amount of time sitting idle during the execution, 76 however, when compared to the other two congurations, it also implies fewer idle resources remaining to accommodate more analyses. The good eciency demonstrated byC dedicated when running with 3 analyses is due to co-locating three memory-intensive applications together, which leads to performance degradation due to competitions for the shared resources, thus the analysis side is slowing down and getting closer to the simulation execution time, hence leading to an overall smaller idle time and eciency, that can be seen as a “negative improvement". Inappropriate co-location strategies can lead to poor performance of in situ workow ensembles, but, even more dramatic, looking at eciency solely can lead to poor choices of placement strategies. These ndings require broader considerations, beside computational eciency, when designing workow ensemble placement strategies to avoid such negative improvements which are misleading. Computational eciency is not sucient without considering the resource specication of a given conguration that reects how eciently the underlying resources are utilized. Without considering resource aspects, i.e., number of cores assigned to each ensemble components, total number of nodes, or the placement of ensemble components, the executions of two workow ensembles are not comparable. As demonstrated in Figure 4.5, dierent ensemble members exhibit dierent computational eciencies, thus synthesizing the overall performance of workow ensembles with many ensemble members requires summarizing a large number of eciency values, which is not straightforward. We acknowledge these limitations and, in the next section, we introduce the concept of performance indicators that aims to address these limitations. 4.4 PerformanceIndicators Traditionally, scientists running on HPC machines want to optimize applications performance while using as few resources as possible. When considering multiple concurrent components like workow ensembles, dening the notion of resource usage and its perimeter is already challenging (e.g., each ensemble member 77 can use dierent numbers of cores, ensemble components can have various mapping onto allocated resources). In addition, since ensemble members are executed simultaneously, one needs to consider their local resource usages and performance but also dene a method to aggregate local knowledge into a coherent global analysis. As detailed in Section 4.3.2, eciency by itself is not sucient to describe such complex concurrent executions and does not consider the underlying resource usage. 4.4.1 FrameworkDenition In this section, we dene a framework, denoted as performance indicators, that provides us with a method to aggregate resource usage of dierent members within a workow ensemble. We augment the notion of eciency previously described with resource context under the form of a multi-stage framework that aims to capture the eciency of every ensemble member under multiple resource constraints. Each stage of the framework adds a layer of information to the performance indicator that characterizes a certain resource feature, such as number of resources used and resource mapping. More formally, given a workow ensemble with 1iN ensemble membersfEM 1 ;:::;EM N g, letE i be the computational eciency ofEM i (as dened in Equation (3.10)). We rst deneR i , a given resource constraint aecting memberEM i , for example we could deneR i as the number of cores allocated toEM i . Then, we can formulate the problem of optimizing the global eciency of a workow ensemble under possibly several resource constraintsR i as follows: maximize E i ;8i = 1;:::;N subject to minimize/maximizeR i : (4.1) The idea is to maximize eciency of all ensemble members but under multiple predened constraints. These constraints can be arbitrarily chosen by users (e.g., number of cores, network links capacity). For convenience, the eciencyE i of each ensemble memberEM i and the constraintR i are combined and 78 rewritten as a performance indicatorP i , which can be seen as a function ofE i . More precisely,P i =E i R i if the constraint is to maximizeR i , otherwiseP i =E i =R i if the constraint is to minimizeR i . For example, letc i be the number of cores allocated toEM i to be minimized, then to design a performance indicator considering the number of cores, we would deneR i =c i andP i =E i =c i (recall we maximize eciency while minimizing the resource indicator, so we have to divide in that situation). With these performance indicators, Equation (4.1) can be simply rewritten as: maximize P i ;8i = 1;:::;N: (4.2) Now that we have a framework to evaluate executions of workow ensembles, we need a method to aggregate information from each performance indicator into one coherent measure. To synthesize the performance of a workow ensemble with potentially many members, we propose a method that accu- mulates performance indicatorsP i of every ensemble member using an objective functionF (dened in Section 4.4.6). Therefore, the problem stated in Equation (4.2) can be transformed into its nal form: maximize F (P 1 ;:::;P N ): (4.3) The goal of this whole process is to provide a methodology to assess the impact of each layer of resource informationR i and obtain an overall indicator that can characterize the performance of the entire workow ensemble. We discuss the procedure for calculating performance indicators and generating the objective function to aggregate them in the following sections. First, we dene a set of notations (Table 5.1) used to dene the indicators. Then, we present three resource indicators R U ;R A ;R P corresponding, respectively, to resource usage (U), resource allocation (A), and resource provisioning (P). 79 4.4.2 Notations Given a workow ensemble withN ensemble membersfEM 1 ;:::;EM N g, letP i be the performance indicator of the ensemble memberEM i , andE i be its computational eciency. The ensemble member EM i contains a simulationSim i coupled withK i analyses,Ana 1 i ;:::;Ana K i i , thusEM i hasK i couplings (Sim i ;Ana j i ), wherej2f1;:::;K i g. Letcs i be the number of cores used bySim i , these cores belong to nodes whose indexes are listed in sets i . Similarly, the analysisAna j i usesca j i cores of nodes whose indexes are dened in seta j i . For example, in Table 4.2, C1.1 hass 1 =f0g;a 1 1 =f2g;s 2 =f1g;a 1 2 =f2g. Letc i denote the total number of cores assigned to all ensemble components, i.e. simulationSim i andK i analysesAna j i , in a given ensemble memberEM i . We havec i =cs i + P K i j=1 ca j i . Letd i be the number of computing nodes allocated to the ensemble memberEM i . Then, the number of compute nodesd i allocated to the ensemble memberEM i is calculated byd i = s i [ S K i j=1 a j i . If the simulation and some analyses share compute nodes, we haved i js i j + P K i j=1 ja j i j. (Note that this inequality becomes an equality if each component runs on dedicated nodes.) LetM be the total number of computing nodes used by the entire workow ofN ensemble members. Similarly, we haveM P N i=1 d i . In the absence of resource sharing (i.e, each ensemble member runs on dedicated nodes), we haveM = P N i=1 d i . 4.4.3 MemberResourceUsage(U) The rst performance indicatorP U i considers underlying computing units, i.e. cores, to model the eciency of an ensemble member in terms of resource usage. Our goal is to build an indicator that can compare dierent executions of workow ensembles using dierent numbers of resources (e.g., number of cores). Precisely,P U i maximizes computational eciencyE i of an ensemble memberEM i such that the total 80 Notation Description Workow Ensemble N Number of ensemble members M Number of nodes used by the workow ensemble Ensemble Member EM i Ensemble memberi R i Resource constraint aectingEM i P i Performance indicator ofEM i K i Number of couplings inEM i c i Total number of cores used by components ofEM i d i Number of nodes allocated toEM i Ensemble Component Sim i Simulation ofEM i (one simulation per member) Ana j i Analysisj ofEM i (K i analysis for eachEM i ) cs i Number of cores used bySim i ofEM i ca j i Number of cores used byAna j i fromEM i s i Set of node indexes on whichSim i fromEM i is executed a j i Set of node indexes on whichAna j i fromEM i is executed Table 4.4: Notations. number of coresc i used byEM i is minimized. We then dene the resource usage indicatorR U i =c i . To minimizeR U i ,P U i is computed as follows: P U i = E i R U i = E i c i (4.4) P U i represents the smallest unit of eciency in terms of single core usage. Recall that maximizingE i is equivalent to minimizing the idle time and the makespan. High values ofP U i indicate that a large portion of the execution on assigned resources is spent on computing (in contrast to idling), thus the ensemble member makespan is reduced. 4.4.4 MemberResourceAllocation(A) Since an ensemble member can have concurrent execution of multiple components, the component can be co-located on the same node or distributed across nodes. Finding an optimal placement among the 81 numerous placement congurations is challenging. Therefore, we propose the second stageP A i to quantify the level of data locality of a certain placement. Lets consider the coupling (Sim i ;Ana j i ) part of the ensemble memberEM i , thenSim i is co-located withAna j i if and only ifjs i j =js i [a j i j. Otherwise, ifjs i j<js i [a j i j, then they are assigned to dierent nodes. Based on this observation, we dene a placement indicator obtained from the ratio 0< js i j js i [a j i j 1 to represent a placement of a workow ensemble. LetCP i be the placement indicator for the ensemble memberEM i : CP i = 1 K i ( js i j js i [a 1 i j + + js i j js i [a K i i j ) = js i j K i K i X j=1 1 js i [a j i j : (4.5) Intuitively,CP i describes the placement ofEM i . It decreases with the number of computing nodes used for a given coupling.CP i = 1 indicates that theEM i components are all co-located, and aCP i value near 0 indicates that more dedicated resources are used and that the components ofEM i are distributed across them. Maximizing the placement indicator for each ensemble member results in prioritizing placements that minimize the number of computing resources used by that ensemble member. As a result, the placement indicator not only reects placement characteristics but also the number of resources used at the ensemble member level. To evaluate the eciency of a placement (i.e., a mapping between ensemble members and available resources), we include the proposed placement indicator as the resource indicatorR A in the next stage of the performance indicator. Specically, we multiply the rst stage of our performance indicator by R A i =CP i as follows: P A i =E i R A i =E i CP i =E i js i j K i K i X j=1 1 js i [a j i j : (4.6) 82 Based to the insight derived from the placement indicator, maximizing the performance indicator at this stage favors the resource conguration that occupies a small number of compute nodes while maximizing the eectiveness of the execution. 4.4.5 EnsembleResourceProvisioning(P) Finally, by just considering the execution features at the ensemble member level might not be sucient to capture the overall performance of the entire workow ensemble. To that end, we extend the performance indicator with the number of resources provisioned for the entire workow ensemble, i.e. the number of computing nodes the workow ensemble resides on. When comparing two executions using dierent number of computing nodes, the run using the smaller number of nodes should yield better eciency in two settings with the same performance. Therefore, to obtain the last stageP P i , we weigh the eciency indicator byR P i = M, whereM is the total number of compute nodes, so that the number of compute nodes provisioned for the entire workow ensemble is minimized whileP P i is maximized: P P i = E i R P i = E i M : (4.7) Finally, depending on the resource aspects of interest, the performance indicatorP i can either represent a single-stage indicator P U i ;P A i ;P P i , or a multi-stage indicator P U;A i ;P U;P i ;P A;P i ;P U;A;P i . For example, P U;A;P i = E i R A i R U i R P i . 4.4.6 ObjectiveFunction In this section, we propose a method for aggregating indicator values from individual ensemble members into a global indicator at the workow ensemble level. In order to compute a global indicator, we synthesize performance indicators of every ensemble member. A simple approach could consider the average values for allP i . However, the large variation between these values may lead to an inaccurate assessment of the 83 overall performance. To minimize the variability in performance among ensemble members, we consider the mean performanceP from which we subtract the standard deviation: F (P i ) =P v u u t 1 N N X i=1 (P i P ) 2 where P = 1 N N X i=1 P i : (4.8) The intuition behind Equation (4.8) is to favor workow ensemble’s congurations with good makespan, i.e. congurations with low variability between workow ensemble members (recall that the makespan of a workow ensemble is dened as the maximum completion time among its members). The goal of an ecient conguration, as dened in this work, is to maximize the objective functionF . The higher the value of the objective function, the better the performance of the entire workow regarding eciency, makespan, resource usage, and component placement. 4.5 ExperimentalEvaluation In this section, we evaluate the ability of the proposed performance indicators to characterize the execution performance of workow ensembles. We have specied our performance indicators and how to compute them, we are now ready to evaluate their interests in terms of workow ensembles characterization. Then, we extend our previous experimental conguration settings (Section 4.2.1.3) with scenarios in which multiple analyses are coupled with the simulation. 4.5.1 CongurationExploration 4.5.1.1 WorkowEnsembleCongurations In this work, we apply our multi-stage performance indicators to two sets of congurations, each of these sets species the number of ensemble members and the node assignment for each ensemble component. In this paper, we consider only workow ensembles comprised of 2 ensemble members. The rst set of 84 Conguration Number of computing nodes (N) Number of ensemble members Node indexes Ensemble member 1 Ensemble member 2 Simulation 1 Analysis 1.1 Analysis 1.2 Simulation 2 Analysis 2.1 Analysis 2.2 C2.1 3 2 n0 n2 n2 n1 n2 n2 C2.2 3 2 n0 n1 n1 n0 n2 n2 C2.3 3 2 n0 n1 n2 n0 n1 n2 C2.4 3 2 n0 n0 n2 n1 n1 n2 C2.5 3 2 n0 n1 n2 n1 n0 n2 C2.6 2 2 n0 n1 n1 n0 n1 n1 C2.7 2 2 n0 n0 n1 n1 n0 n1 C2.8 2 2 n0 n0 n0 n1 n1 n1 Table 4.5: Experimental congurations with two ensemble members, each ensemble member has two analyses per simulation. congurations includesC1:1 toC1:5 (Table 4.2). For every conguration in this set, each ensemble member is a single coupling of a simulation and an in situ analysis. The second set consists of congurations ranging fromC2:1 toC2:8 (Table 4.5). For congurations in this set, the simulation of each ensemble member is coupled with two analyses. For each conguration in both sets, every simulation runs on 16 cores while every analysis is assigned 8 cores, which is identied by following the similar procedure described in Section 4.3.1 to minimize idle time occurred in the coupling between them. With this setting, congurations of the second set leverage all cores of each compute node, thus saturating the computing resources (recall that each compute node has 32 cores). Since we propose a multi-stage method for evaluating the performance of an ensemble member as well as the entire workow ensemble, we examine the impact and the order of each stage on the quality of the performance indicatorP i by accumulating in the objective functionF (P i ) as the performance of the entire workow ensemble. To this end, we explore two feasible paths that can be followed to concatenate performance indicator stages: (1)P U i ! P U;P i ! P U;P;A i ; or (2)P U i !P U;A i !P U;A;P i . For path (1),P U;P i =P U i =M, whereM is the total number of nodes used by the workow ensemble (see Table 5.1) andP U;P;A i =P U;P i CP i , whereCP i is the placement indicator dened in Section 4.4.4. Note thatP U;P;A i =P U;A;P i . Specically, we observe changes inF (P i ) when adding a new stage (i.e., resource usage U, resource provisioning P, resource allocation A) to the performance indicator P i , which can be eitherP U i ;P U;P i ;P U;A i ,P U;P;A i , orP U;A;P i . 85 4.5.1.2 Results C1.1 C1.2 C1.3 C1.4 C1.5 3.2 3.4 3.6 Objective function F(P i ) P i =P U i C1.1 C1.2 C1.3 C1.4 C1.5 1.25 1.50 1.75 P i =P U,P i C1.1 C1.2 C1.3 C1.4 C1.5 0.5 1.0 1.5 P i =P U,P,A i =P U,A,P i C1.1 C1.2 C1.3 C1.4 C1.5 2 3 P i =P U,A i Figure 4.6:F (P i ) on dierentP i orders over congurations which have one analysis per simulation (the higher the better). C2.1C2.2C2.3C2.4C2.5C2.6C2.7C2.8 2.4 2.5 2.6 2.7 Objective function F(P i ) P i =P U i C2.1C2.2C2.3C2.4C2.5C2.6C2.7C2.8 1.00 1.25 P i =P U,P i C2.1C2.2C2.3C2.4C2.5C2.6C2.7C2.8 0.5 1.0 P i =P U,P,A i =P U,A,P i C2.1C2.2C2.3C2.4C2.5C2.6C2.7C2.8 1 2 P i =P U,A i Figure 4.7:F (P i ) on dierentP i orders over congurations which have two analyses per simulation (the higher the better). Figure 4.6 demonstrates the results of the objective performance function at each of the multiple stages ofP i over dierent congurations in the rst set. After the initial stage ofP U i ( Figure 4.6 left), a new layer is added, either P in the middle top gure or A on the middle bottom to form the next stage. On the contrary, P U;A i ,P U;P i is not able to dierentiate the performance ofC1:4 fromC1:5 as these two congurations both use 2 compute nodes. Recall that inC1:4 the two simulations share a node while the two analyses share another node. As shown in Figures 4.1 and 4.2,C1:4 does not lead to small member makespan due to the contention of co-location between two analyses. InP U;A;P i , the performance ofC1:4 is degraded to lower thanC1:5, but higher thanC1:1,C1:2, andC1:3. Finally, our performance indicator conrms that C1:5 is the best choice, as demonstrated by traditional metrics in Figures 4.2 and 4.3 thatC1:5 has the smallest makespan.C1:5 outperforms other congurations, which also validates the common intuition associated with in situ processing that simulations and analyses must be co-located when possible. Since the in-memory staging mechanism in this work is implemented by DIMES [108] (data resides on the memory 86 of the simulation node), co-locating the analysis with the simulation can benet from data locality – the time for staging data is signicantly shortened. By opposition to the rst set of congurations, for the second set, we do not show the results of traditional metrics (described in Table 4.1) due to the lack of space. However, experimental results of these metrics when using the second set of congurations are not as straightforward as the rst on inferring from the metrics monitored which conguration is the best. The increased number of analyses involved in an ensemble member complicates the performance evaluation using traditional metrics. Utilizing the whole cores of compute nodes in several congurations, e.g. C2:6;C2:7;C2:8, likely saturates the resources, which makes it diculty to compare them with other congurations where compute nodes are not entirely occupied by ensemble components. This scenario motivates the need for a performance indicator able to elect the best potential conguration in terms of eciency of the workow ensemble. Figure 4.7 shows the values taken by the objective function when instantiated with dierent congurations in the second set. In this case,P U;P i separates the set of congurations in two groups dened by the number of compute nodes used by the workow ensemble (C2:6,C2:7 andC2:8 uses 2 nodes when the other congurations use 3 nodes). Then,P U;P;A i keeps this distinction but in addition indicates that congurationC2:8 should return better performance than the others. On the other hand, when adding layer A, we rst isolateC2:8 from the other congurations, and further dierentiateC2:6;C2:7 fromC2:1;C2:2;C2:4 at the last stage. Note that, similarly to conclusions reached in the previous setup, the chosen congurationC2:8 is also the optimal conguration in terms of co-location (i.e, simulation is collocated with its analyses) which again conrms the benets of co-locating coupled components of an ensemble member. 87 Conguration Node indexes Ensemble member 1 Ensemble member 2 Ensemble member 3 Ensemble member 4 Simulation 1 Analysis 1.1 Simulation 2 Analysis 2.1 Simulation 3 Analysis 3.1 Simulation 4 Analysis 4.1 C colocated n0 n0 n0 n0 n1 n1 n1 n1 C dedicated n0 n1 n0 n1 n0 n1 n0 n1 C hybrid n0 n1 n0 n1 n1 n0 n1 n0 Table 4.6: Experimental congurations for the rst 4 ensemble members allocated on 2 compute nodes, each ensemble member has one simulation and one analysis. To increase the number of ensemble members, these settings can be replicated with a higher number of nodes (e.g., 8 ensemble members on 4 compute nodes, 16 ensemble members on 8 compute nodes). 4.5.2 IncreasedNumberofEnsembleMembers 4.5.2.1 Workowensemblecongurations In this section, we rene the three congurations described in Table 4.3 to scale up the number of en- semble members. Our goal is to increase the load on compute nodes’ resources and increase network communications. We pile up 4 ensemble members on 2 nodes and explore dierent component placements (see Table 4.6). Specically,C colocated co-locates ensemble components of every two ensemble members on the same node to guarantee data locality among ensemble components of an ensemble member. On the other hand,C dedicated co-locates simulations of every four ensemble members on a single node while the corresponding analyses are placed on another dedicated node. WithC hybrid conguration, ensemble components from a two-ensemble member pair are interchangeably placed together, i.e. the analyses of the other two ensemble members are co-located with the simulations of another two ensemble members. To accommodate sucient cores for 4 ensemble members on 2 compute nodes, we couple a simulation with one analysis per ensemble member, in which the simulation used 8 cores and the analysis 4 cores. To increase the number of ensemble members per workow ensemble, we replicate the placement described for 4 ensemble members. In this experiment, the number of ensemble members varies between 4 and 32. 88 4 8 16 32 Number of ensemble members 3.6 3.8 4.0 4.2 Objective function F(P i ) P i =P U i 4 8 16 32 Number of ensemble members 2 3 4 P i =P U,A i 4 8 16 32 Number of ensemble members 0.5 1.0 1.5 2.0 P i =P U,A,P i C colocated C dedicated C hybrid Figure 4.8:F (P i ) on dierentP i orders over congurations which have one analysis per simulation (the higher the better). Each conguration is measured over 5 trials. 4.5.2.2 Results Figure 4.8 shows the values of the objective function for each performance indicatorP U i ;P U;A i ;P U;A;P i . With P i =P U i , since the number of cores used by every ensemble member is identical among congurations, P U i indirectly reects the computational eciency of each conguration. Recall from Section 4.3.2 that the computational eciency ofC colocated andC hybrid are approximately comparable once there is only one simulation and one analysis co-located on a single node. We note thatC colocated surpassesC hybrid when a greater number of ensemble components competes for a certain amount of resources. This observation highlights the signicance of data locality for allocating ensemble components of a workow ensemble on shared resources. Overall,C dedicated exhibits the lowest value of the objective function in most cases, which indicates an example of poor placement. We also notice a decline ofC colocated at high numbers of ensemble members which closes the gap from the other two congurations. This may be due to the congestion of a high number of data requests to the staging server (recall from Section 4.2.1, the in-memory staging area is implemented by DIMES) as there are numerous concurrent ensemble members communicating data to each other at the same time. We leave the investigation of this behavior for future work.P U;A i assists to distinct C colocated fromC dedicated andC hybrid as it favors congurations with higher level of data locality. Finally, P U;A;P i groups executions by the number of compute nodes utilized, so that the performance evaluation 89 considers the resource cost dening by node count. The remark is consistent asC colocated still oers the highest objective value among congurations for a given number of ensemble members. 4.6 Conclusion In this paper, we characterize an ensemble of in situ workows using multiple congurations and placements. Based on the insights gained from this characterization, we introduce a theoretical framework that models the execution of workow ensembles when multiple simulations are coupled with multiple analyses using in situ techniques. We dene the notion of computational eciency for workow ensembles at component level, and then extend this notion to member and ensemble levels by designing several performance indicators. These indicators capture the performance of a workow ensemble by aggregating several metrics of the given workow ensemble in terms of resource usage eciency and resources allocated for components, members, and the entire ensemble. By evaluating these indicators on a real molecular dynamic simulation use case, we show the advantages of data locality when co-locating the simulation with the corresponding analyses in an ensemble member. This nding allows us to schedule each ensemble member of the workow ensemble individually on a distinct allocation, targeting the co-location among ensemble components of each ensemble member. Future work will consider leveraging the proposed indicators for scheduling in situ components of a workow ensemble under resource constraints. The performance indicators appear to be benecial to assisting in the comparison between dierent scheduling decisions to optimize scientic discovery. Another future work direction is adapting our performance framework to more complex domain-specic use cases of workow ensembles, e.g. adaptive sampling [10] in which simulations are periodically executed and restarted. 90 Chapter5 Co-schedulingEnsemblesofInsituWorkows Executing a workow ensemble of simulations and their in situ analyses requires ecient co-scheduling strategies and sophisticated management of computational resources so that they are not slowing down each other. However, sustaining complex co-scheduling for massive workow ensembles of concurrent applications requires sophisticated orchestration tailored for the management of workow ensembles [5]. The key challenge when orchestrating the workow ensembles at scale is to eciently schedule the simulations and in situ analyses as they continuously exchange data. The intersection of workow ensembles and in situ is particularly challenging as in situ itself has intricate communication patterns, which are made more complex by the number of concurrent executions at scale. Intuitively, co-scheduling the analyses with their associated simulation on the same computing resources improves data locality, but the required resources may not be consistently available. Thus, upon resource constraint, how do we co-schedule workow ensembles on a HPC machine? In this chapter, we aim to guide the above emerging orchestration runtime for in situ workow ensembles. A key distinguishing factor is to eciently generate large ensembles to optimize scientic discoveries. Another major challenge is to determine resource requirements such that the resulting performance 91 of executing them together in the workow ensemble is maximized. Therefore, we design ecient co- scheduling strategies and resource assignments for the workow ensemble of simulations and in situ analyses. The rest of the chapter is organized as follows. Section 5.1 provides an overview of related work on scheduling for in situ workows. Section 5.2 formally provides the explanation for important terms and a mathematical model that characterizes iterative execution of a workow ensemble. Section 5.3 discusses two scheduling problems and our proposed co-scheduling strategies and resource allocations. The proposed solutions are evaluated through simulations in Section 5.5. Section 5.6 summarizes our conclusions and provides ideas for future work. 5.1 RelatedWork Designing scheduling algorithms for in situ workows such that available resources are eciently utilized is challenging as optimizing individual workow components does not ensure that the end-to-end performance of the workow is optimal. The larger the number of components in the workow, the larger the multi- parametric space that needs to be explored. Specically, workow ensembles consist of many simulations and in situ analyses running concurrently, thus the combination of their data coupling behaviors signicantly increase the complexity of the scheduling decisions. Few eorts have attempted to solve this problem on a single in situ workow. Malakar et al. formulated the in situ coupling between the simulation and the analysis as a mixed-integer linear programming problem [70] and derived the frequency to schedule the analysis executed in situ on dierent set of nodes. Aupy et al. proposed a greedy algorithm [9] to schedule a set of in situ analyses on available resources with memory constraints. Venkateshetal. generated scheduling recommendations [100] that in situ workows are benecial from using Intel Optane Persistent Memory for I/O streaming. Taufer et al. introduced a 2-step model [C97] to predict the frequency of frames 92 to be analyzed in situ such that computing resources are not underutilized while the simulation is not slowed down by the analysis. Our proposed approach for scheduling targets ensembles of in situ workows. Co-scheduling strategies have been recently proposed to deliver better resource utilization and in- creasing overall application throughput [16, 106, 62] However, only few works incorporated co-scheduling into the in situ workows, where components (simulations and analyses) in the workow are tightly coupled together. Sewell et al. proposed a thorough performance study [90] for co-scheduling in situ components that leverages the memory system for data staging with low latencies. Aupy et al. introduced an optimization-based approach [8] to determine the optimal fraction of caches and cores allocated to in situ tasks. Due to the lack of capability to submit a batch of concurrent applications with conventional scheduler, co-scheduling was not fully-supported for the workow ensembles at production scale. To overcome this limitation, Flux [5] was designed as a hierarchical resource manager to meet the needs of co-scheduling large-scale ensemble of various simulations and in situ analyses. To the best of our knowledge, this paper is the rst eort that addresses co-scheduling problem for in situ workows at ensemble level. 5.2 Model The goal of this paper is to study the scheduling of complex ensembles of in situ workows [J33] on parallel machines, where every workow within an ensemble has several concurrent jobs coupled together using in situ processing, either be asimulation or ananalysis. Unfortunately, one might not have enough compute nodes to run concurrently all simulations and analyses and, will have to make scheduling choices: (i) which simulation and analysis should run together on the same resources and (ii) how much resources should be allocated to each. Ourapproach. The core of this work is a powerful but versatile model to express complex data dependencies between simulations and in situ analyses in a workow ensemble, and based on that design appropriate scheduling strategies. The chosen approach is to model the scheduling problem under simplied conditions 93 where we make several assumptions to reduce the large space of exploration in an ensemble of in situ workows. Then, we demonstrate that our solutions designed for the simplied world, also perform well in realistic settings. 5.2.1 Couplings A simulationS i represents a job in an ensemble that simulates a state of interest in the molecular system. An analysis, denoted asA j , represents a computational kernel that is suitable to analyze in situ data produced by a simulation, e.g. transforming high-dimensional molecular structures to eigenvalue metadata [57]. Regularly, the output of in situ analyses are later post-processed, e.g. synthesized to infer conformational changes of the molecular system in a post-hoc analysis. However, in the scope of this paper, we focus on the part of the simulation-analysis pipeline where they are executed simultaneously in an ensemble. We denote the set of simulations and analyses in an ensemble asS andA respectively. In this work, each workow of an ensemble is composed of one simulation and at least one analysis. The analyses are executed in parallel with the simulation and periodically processed simulation data generated after each simulation steps in an iterative fashion. This essence of in situ processing expresses the important notion of coupling. Coupling indicates which simulation and analyses have a data dependency, i.e. data produced by the simulation is analyzed by the corresponding analyses. Scientists usually decide beforehand which analyses have to process the data from a given simulation, we call this mapping acoupling (see Figure 5.1). Therefore, couplings are decided beforehand as an input of the problem. More formally, a couplingp :S!A denes which analyses inA are coupled with which simulation inS, then for everyS i 2S, letp(S i ) be the set of analyses that couple data with the simulationS i . For example, in Figure 5.1, we havep(S 1 ) =fA 1 ;A 3 g. Now that we have dened the notion of couplings, we need to dene dierent notions aroundco-scheduling. 94 Figure 5.1: Illustration of co-scheduling and coupling notions. 5.2.2 Co-scheduling The notion of co-scheduling is at the core of this work, co-scheduling is dened as running multiple applications concurrently on the same set of compute nodes, each application using a fraction of the number of cores per node. We assume the total number of cores assigned to each application is distributed evenly among the compute nodes the application is co-scheduled on, which is similar to the manner of distributing resources in emerging schedulers, e.g. Flux [5], for co-scheduling concurrent jobs of a massive ensemble in next-generation computers. In this paper, we distinguish two scenarios in which we map the simulations and analyses to computational resources: • Co-scheduling:S i andA j run on the same compute nodes and thus have access to the shared memory; • In transit:S i andA j run on two dierent set of nodes and sending data fromS i toA j involves the network. In an attempt to avoid performance degradation in co-scheduling, in this work, we consider that simulations, which are compute-intensive, are not co-scheduled together [24]. Now we need to dene several other important terms related to co-scheduling. 95 Co-scheduling mapping. A co-scheduling mappingm denes howS andA are co-scheduled together. For a given mappingm, we denote bym(S i ) the set of analyses co-scheduled withS i . For example, with the mapping illustrated in Figure 5.1,m(S 1 ) =fA 1 ;A 3 g andm(S 3 ) =fA 2 g. Assumption1. In this model, we ignore co-scheduling interference, such as cache sharing, that could degrade the performance when multiple applications share resources [24]. Co-scheduling allocation. A co-scheduling allocation represents a set of applications which are co- scheduled together (i.e., share computing resources). LethfS i ;A j gi denote a co-scheduling allocation where simulationS i is co-scheduled with analysisA j . Recall that, we have at most one simulation per allocation, but we can have multiple analyses. Formally, for a given co-scheduling mappingm, a simulation S i and analysesm(S i ) that are co-scheduled withS i , the corresponding co-scheduling allocation is denoted byhS i [m(S i )i. However, not all analyses are required to be co-scheduled with a simulation, there exists co-scheduling allocations in which only analyses are co-scheduled together, called analysis-only co- schedulingallocations. For example, in Figure 5.1,hfA 3 ;A 5 gi is an analysis-only co-scheduling allocation. A co-scheduling allocation also determines the amount of computer resources assigned for each co-scheduled application. Letn x denote the number of nodes assigned to applicationx. For instance, ifS i andA j are co-scheduled on the same co-scheduling allocation, thenn S i = n A j . Letc x be the number of cores per node assigned tox. As each co-scheduled application takes a portion of cores in the total ofC cores a node has, then for every co-scheduling allocationhXi, we have P x2X c x C . In Figure 5.1, since A 1 andA 4 are co-scheduled withS 1 on a co-scheduling allocation spreading in 2 compute nodes, then n S 1 =n A 1 =n A 4 = 2. And sincec S 1 = 8, thenS 1 occupies 8 cores each node, so 16 cores in total. We consider a traditional parallel computing platform withN identical compute nodes, each node has C identical cores, a shared memory of sizeM and a bandwidth ofB per node. In addition, we consider that the interconnect network between nodes is a fully-connected topology, hence the communication time is the same for any pairs of nodes. 96 5.2.3 Applicationmodel Because of the in situ execution, simulations and analyses are executed iteratively, in which iterations exhibit consistent behavior [J31]. Specically, the simulation executes certain compute-intensive tasks each iteration and then generates output data. The analysis, in every iteration, subscribes the data produced by the simulation and performs several specic computations. Since the complexity of computation and the amount of data are identical across iterations, the time to execute one iteration is approximately the same across iterations. Based on this reasoning and, by assuming the simulations and analyses have the same number of iterations (n steps ), we simply consider the execution time of a single iteration as the entire execution time can be derived from multiplying the time taken by a single iteration byn steps [J31]. Computationmodel. Let us now denote byt(x) the time to execute one iteration of the iterative application x (x can either be a simulation or an analysis). For readability, we pay particular attention to the class of perfectly parallel applications, even though the approach can be generalized to other models, such as Amdahl’s Law [7]. Assumption2. Every simulation and analysis follow a perfectly parallel speedup model [7]. In other words, a jobx that runs in timet x (1) on one core will run in timet x (1)=c when usingc cores. Thus, we express the time to execute one iteration of the simulation as follows t(S i ) = t S i (1) n S i c S i ; (5.1) whereS i usesn S i nodes withc S i cores/node. Communicationmodel. In in situ processing, data is usually stored on node-local storage [P30] or local memory [59] to reduce the overhead associated with data movement. In this work, we process data residing in memory to allow near real-time data analysis. Data communications are modeled as follows: (i)S i writes 97 its results into local memory of the nodes it runs on, and (ii) each analysis coupled withS i reads from that memory the data it needs whether locally or remotely depends onA j is co-scheduled withS i or not. Assumption 3. Intra-node communications (i.e., shared memory) are considered negligible compared to inter-node communications (i.e., distributed memory). In this model, we consider that writing/reading to/from the local memory of a node (e.g., the RAM) is negligible, therefore the cost of communicating data between simulation and analysis co-scheduled together is approximately close to zero. However, communications going over the interconnect have to be modeled. For example, in Figure 5.1,A 3 ,A 4 andA 5 are not co-scheduled with the simulations they couple with, hence they incur overhead of remote transfers. To model these inter-node communications, we adopt a classic linear bandwidth model [104] where it takes S B to communicate a message of sizeS, whereB is the bandwidth. LetV A j be the amount of data received and processed byA j . Remember thatA j runs onn A j compute nodes withc A j core per node. Then, based on the linear bandwidth model, the I/O time ofA j can be expressed asV A j =Bn A j , whereB is the maximum bandwidth of one compute node. Even though the bandwidth is actually variable as it is shared by multiple I/O operations during the execution, we use the maximum bandwidth of a compute node for an optimistic consideration. Executiontime. Finally, we combine computation and communication model to get actual time to execute one iteration: t(A j ) = 8 > > > > > > > > < > > > > > > > > : t A j (1) n A j c A j ifA j is co-scheduled with the simulation it couples with; t A j (1) n A j c A j + V A j Bn A j otherwise; (5.2) while t(S i ) is simply Equation (5.1) as communications to local memory are free (cf. Assumption 3). Table 5.1 summarizes the notation introduced in this section. 98 Notation Description S i ;A j Set of simulations and analyses (1i;jn) N;C Total number of nodes and number of cores per node B Maximum bandwidth per node p(S i ) Set of analyses that couple data with simulationS i V A j Amount of data received by analysisA j m(S i ) Set of analyses co-scheduled with simulationS i hfS i ;A j gi Co-scheduling allocation whereS i andA j are co-scheduled n x ;c x Number of nodes and cores/node assigned tox t x (1) Time to execute one iteration ofx on one core t(x) Actual time to execute one iteration ofx onn x ;c x P NC Set of analyses co-scheduled on analysis-only allocations Table 5.1: Notations. 5.2.4 Makespan We have dened the execution time for the simulation and the analysis and, armed with the notion of co-scheduling mapping previously discussed, we are ready to dene the makespan of an ensemble of in situ workows. Letn steps denote the number of iterations, recall that we assume the simulations and analyses in an ensemble have the same number of iterations. The makespan can now be expressed as follows: Makespan = max S i 2S;A j 2A (t(S i );t(A j ))n steps (5.3) We are now ready to dene dierent scheduling problems tackled in this work such that this makespan is minimized. 5.3 SchedulingProblems 5.3.1 Problems We consider two problems in this research, the rst problem,Co-Sched, consists into nding a mapping that minimized the makespan of an ensemble of workows given that we have a resource allocation, so for each simulation and analysis in the workow we know on how many nodes/cores they are running. The 99 second problem,Co-Alloc, is the reverse problem, from a given mapping we aim to compute an optimal resource allocation scheme. Problem 1 (Co-Sched). Given a number nodes and cores assigned to simulations and analysis, nd a co-scheduling mappingm minimizing the makespan of the entire ensemble. Problem2 (Co-Alloc). Givenaco-schedulingmappingm,computetheamountofresourcesn S i ;c S i assigned for each simulationS i 2 S andn A j ;c A j for each analysisA j 2 A such that the makespan of the entire ensemble is minimized. We consider simulations and analysis are moldable jobs [43] (i.e., resources allocated to a job are determined before its execution starts). In addition, the results presented in this section are rst using rational amount of resources, i.e. number of nodes and cores are rational. Our approach is to design optimal, but rational solutions and then adapt these results to a more realistic setup where resources are integer, by using rounding heuristics discussed at the end of this section. Rounding heuristic is a well-known practice commonly used in resource scheduling [96, 9] to round to integer values the rational values assigned to integer variables, which are number of nodes and cores in our case. Idealscenario. To nd a co-scheduling mapping which minimizes the makespan forCo-Sched, we start with a ideal co-scheduling mapping which is dened as a mapping where all analyses are co-scheduled with their coupled simulations. This scenario is denoted as ideal as it implies that you have enough compute nodes to accommodate such mapping. We show that the makespan is minimized under an ideal co-scheduling mapping by verifying the correctness of the following theorem: Theorem 1. The makespan is minimized if, and only if, each analysis is co-scheduled with its coupled simulation. Proof. The proof is given in Appendix A.1. 100 Based on Theorem 1, we have a solution for Co-Sched but under the strict assumption that we have enough resources to achieve such ideal mapping (i.e., co-schedule all analyses with their respective coupled simulations). However, due to resource constraints, for instance, there is insucient memory to accommodate all the analyses and their coupled simulations, or bandwidth congestion as multiple analyses performs I/O at the same time, that ideal co-scheduling mapping might be unachievable, and some analyses are not able to co-schedule with their coupled simulations. Constrainedresourcesscenario. We therefore have to consider co-scheduling mappings in which there are analyses that are not co-scheduled with their coupled simulations. There are numerous such co- scheduling mappings to explore, indeed each analysis that is not co-scheduled with its coupled simulation could potentially be co-scheduled with any other simulation or analysis. To reduce number of considered co-scheduling mappings, we show that the makespan is minimized if, and only if, analyses that are not co-scheduled with their coupled simulations are co-scheduled inside analysis-only co-scheduling allocations, i.e. without the presence of any simulation. Theorem2. Givenasetofanalysesthatarenotco-scheduledwiththeircoupledsimulations,themakespanof theensembleisminimizedwhentheseanalysesareco-scheduledwithinanalysis-onlyco-schedulingallocations. Proof. The proof is given in Appendix A.2. Note that, since analyses co-scheduled on analysis-only allocations have no data dependency among each other (they only communicate with their simulations), adjusting co-scheduling placements among them has no impact on the makespan. 101 5.4 ResourceAllocation 5.4.1 RationalAllocation We here show how to compute the (rational) number of nodes and cores assigned to each simulation and analysis in the workow ensemble. In other words, we demonstrate how to solveCo-Alloc. The idea is to allocate resources to co-scheduled applications such that dierences among execution time of every co-scheduling allocations are minimized, thereby leading to minimal makespan. The intuition is that if one allocation has a smaller execution time than another allocation, we can take resources (thanks to rational number of nodes and cores) from the faster allocation to accelerate the slower one until all allocations nish approximately at the same time, hence improving the overall makespan. Analysis-only allocations. Based on the above reasoning, we rst compute resource allocation for analysis-only co-scheduling allocations (c.f. Theorem 3). LetP NC denote the set of analyses that are not co- scheduled with their coupled simulations. Based on Theorem 2, we have to nd a resource allocation for the co-scheduling mapping where analyses inP NC are co-scheduled on analysis-only co-scheduling allocations. Let assume that the analysis inP NC are distributed amongL analysis-only co-scheduling allocations, in which each allocation is a no-intersecting subset of analyses denoted byP NC i , wherei2f1;:::;Lg. Given X is a set of jobs, which are simulations or analyses, for the ease of presentation, we dene the following notations: • Q(X) = P x2X t x (1) is the time to execute sequentially one iteration ofX, i.e. executing on single core; • U(X) = P x2X c x V x is the communication cost of processing data remotely forX. Following the aforementioned intuition, to minimize makespan, the optimal resources for each allocation are proved to be proportional to computational needs (Q) and communication costs (U) of the allocation as follows. 102 Theorem3. To minimize makespan, number of nodes and cores assigned to each analysis in analysis-only co-scheduling allocationhP NC i i are as follows: n hP NC i i = BQ(P NC i ) +U(P NC i ) BQ(S[A) +U(P NC ) N ;8i2f1;:::;Lg (5.4) c A k = BQ(A k ) BQ(P NC i ) +U(P NC i )CV A k C ;8A k 2P NC i (5.5) Proof. The proof is given in Appendix A.3. Now the remaining part is to ndU(P NC i ). To prevent from underutilized resources, for each analysis- only co-scheduling allocation containingP NC i , we expect X A k 2P NC i c A k =C. By substituting Equation (5.5) to this equality, we have: X A k 2P NC i Q(A k ) BQ(P NC i ) +U(P NC i )CV A k = 1 B (5.6) Because the left-hand side of Equation (5.6) is strictly decreasing whenU(P NC i ) is increased, Equation (5.6) has only a real rootU (P NC i ). In the experiments, we use SymPy, a Python package for symbolic computing to solve it numerically. SubstitutingU (P NC i ) to Equations (5.4) and (5.5), resource assignment for the analysis-only co-scheduling allocations is determined. Simulation-basedAllocations. In this second step, we nd a resource allocation for the remaining co- scheduling allocations, in which every simulation is co-scheduled with a subset of analyses it couples with. Specically, we distribute remaining resources ofN e N nodes, where e N = P L i=1 n hP NC i i (the value of n hP NC i i is computed from Theorem 3), for each simulationS i 2S and each analysisA j 2AnP NC . Recall thatm(S i ) is the set of analyses that are co-scheduled withS i . 103 Theorem4. To minimize makespan, number of nodes and cores assigned to each simulation and analysis co-scheduled on co-scheduling allocationhS i [m(S i )i are as follows: n S i = Q(S i [m(S i )) Q(S[AnP NC ) (N e N) (5.7) c S i = Q(S i ) Q(S i [m(S i )) C (5.8) c A j = Q(A j ) Q(S i [m(S i )) C ;8A j 2m(S i ) (5.9) Proof. The proof is given in Appendix A.4. 5.4.2 IntegerAllocation We have a solution toCo-Alloc with rational numbers of resources, however a more practically solution must use integer number of cores and nodes. Therefore, we relax the rational solution to an integer solution by applying a resource-preserving rounding heuristic. There are two levels of rounding needed to be handled: node-level rounding and core-level rounding. The node-level one rounds the number of nodes assigned to each co-scheduling allocation, while the core-level rounds the number of cores per node assigned to each application co-scheduled within the same co-scheduling allocation. The objective is to keep the makespan as minimized as possible after rounding. These rounding problems are all declared as sum preserving rounding as the sum of nodes or cores per node are the same after rounding. Formally, let assume we would like to roundx i to an integer value x I i such that P x I i = P x i = s. For everyx i , a rounding way is to determinei(x i )2f0; 1g such that x I i = x i +i(x i ). Hence, P i(x i ) = s P bx i c. The idea is to picks P bx i c numbers amongx i and round them up while round the others down, so that the sum of them are preserved. The remaining task is to determine the number of nodes or cores of which jobs to round up based on the aforementioned objective of minimizing makespan, or minimizing the dierence in newT , i.e. time to 104 execute one iteration, after rounding (see Equation (5.3)). From Equations (5.1) and (5.2), since assigned resources (i.e. number of nodes, number of cores per node) are inversely proportional to time to execution one iteration on one core, we present a rounding heuristic based on single-core execution time. Specically, at core-level, among applicationsx in a co-scheduling allocation, we prioritize to round upc x whoset x (1) is larger. At node-level, among co-scheduling allocationsX, we prioritize to round upn hXi whoseQ(X) is larger. The algorithm following this rounding heuristic is described in Algorithm 1. Algorithm1: Rounding algorithm at node-level Input :N; e N;n S i ;Q(S i [m (S i )) Output:n I S i 1 k N e N 2 dict fg 3 forS i 2S do 4 k kbn S i c 5 dict dict +fS i :Q(S i [m (S i ))g 6 end 7 sorted_dict sortByValue(dict) 8 j 1 9 forS i 2sorted_dictdo 10 ifj >k then 11 n I S i bn S i c 12 else 13 n I S i dn S i e 14 end 15 j j + 1 16 end 17 returnn I S i 5.5 Evaluation In this section, we present an evaluation of our proposed co-scheduling strategies developed to solveCo- Sched and show the eectiveness of computing resource allocation using the approach developed to solveCo-Alloc. 105 Simulator. We implement a simulator based on WRENCH [18] and SimGrid [19] to accurately emulate (thanks to I/Os and network models validated by SimGrid) the execution of simulations and in situ analyses in a workow ensemble (see Figure 5.2). To emulate behavior of simulations and analyses, every simulation and analysis step is divided into two ne-grained stages: computational stage to perform certain simulation or analysis task, and I/O stage to publish/subscribe data to/from the simulation/analysis, respectively. The executions of these ne-grained stages tightly depend on each other following a validated in situ execution model [J31] to reect coupling behavior between the simulation and in situ analyses. Task dependencies serve as an input for simulator workloads to simulate coupling behavior between simulations and in situ analyses. Since a job in WRENCH is simulated on a single node, we emulate multi-node jobs by replicating stages of the job on multiple nodes and extrapolating task dependencies such that the execution order is satised by in situ model. The simulator is open-source and available online ∗ . Figure 5.2: The simulation setup used for experiments. ∗ https://github.com/Analytics4MD/insitu-ensemble-simulator 106 5.5.1 ExperimentalSetup Setup. The simulator takes the following inputs: (1) a workow ensemble’s congurations (i.e., number of simulations, number of coupled analyses per simulation, coupling relations), (2) application proles (i.e., simulations and analyses’ sequential execution time), (3) resources constraints (i.e., number of nodes, number of cores per node, processor speed, network bandwidth), (4) a co-scheduling mapping (i.e., which applications are co-scheduled together) derived fromCo-Sched, and (5) a resource allocation (i.e., number of nodes and cores each application is assigned) derived fromCo-Alloc. We vary the workow ensemble’s congurations to evaluate the importance and the impact of dierent congurations on the eciency of our proposed scheduling choices. We x simulations’ proles, e.g. sequential execution time of the simulations, to study the impact of analyses on scheduling decisions. The sequential execution time of the analyses is generated randomly and relatively to the sequential execution time of their coupled simulations within the range 50% to 150%. The amount of data processed by each analysis is kept at a xed amount among all simulations, even though that amount is varied at dierent workow ensemble’s congurations. Scenarios. We compare a variety of scenarios corresponding to dierent co-scheduling mappings (see Ta- ble 5.3). Ideal is the co-scheduling mapping where all analyses are co-scheduled with their respective simulation. In-transit represents the scenario where all analyses are co-scheduled together on an analysis- only co-scheduling allocation. We also compose hybrid scenarios where some analyses are co-scheduled with the simulation they couple with while the others are co-scheduled on the co-scheduling allocation without the presence of any simulation. For Increasing-x% (resp. Decreasing-x%), we pick the largest x% of the analyses, which is sorted in ascending (resp. descending) order of sequential execution time, to not co-schedule with the simulation they couple with (i.e. to co-schedule them on the analysis-only co-scheduling allocation). In this evaluation, we consider dierent percentages of number of analyses, 25%, 107 50% and 75%. For each of these co-scheduling mappings, we compute the resource allocations for the ensemble usingCo-Alloc, which is described in Section 5.4. Bandwidthcalibration. In order for the simulator to accurately reect the execution platform, we have to calibrate the bandwidth per node used in our model to improve the precision of our solution usingCo-Alloc. In our model, the bandwidth model is simple and optimistic as it does not account for concurrent accesses (i.e., the bandwidth available to each application is the maximum bandwidthB). The idea is to compare the makespan given by our performance model (i.e., Theorem 3) to the makespan given by our simulation as WRENCH/SimGrid oers pretty complex and accurate communication models. In Figure 5.3, the theoretical makespan estimated by our model using the maximum bandwidthB exhibits a large dierence from the makespan resulted by the simulator (up to 30 times). The simulated bandwidth is smaller thanB during the execution as it is shared among concurrent I/O operations. Therefore, we explore dierent bandwidth values (see Table 5.2) and re-estimate the makespan using our model based on the resource allocation at maximum bandwidth. Model Bandwidth Baseline B B 0 1 B=jP NC j B 0 2 B= e N B 0 3 B=( e NjP NC j) Table 5.2: Bandwidth models to cali- brate the simulator. e N is number of nodes assigned to analysis inP NC . 1 2 4 8 16 32 Number of simulations 0 10 20 30 40 Normalization to Simulator Simulator Model(B) Model(B 0 1 ) Model(B 0 2 ) Model(B 0 3 ) Figure 5.3: Makespan estimated by the model for the Increasing-50% scenario with dierent values of bandwidth per node (see Table 5.2). Figure 5.3 conrms our hypothesis that bandwidth is shared among concurrent I/Os and the accuracy of Co-Alloc depends on choosing appropriate bandwidth used in the model.B 0 3 provides an approximation 108 Table 5.3: Experimental scenarios, whereA(x%) is thex% largest analyses’ sequential time ofA. Scenarios Analyses co-scheduled with their coupled simulation (P C ) Analyses not co-scheduled with their coupled simulation (P NC ) Ideal A Ø In-transit Ø A Increasing-x% AnA(x%) A(x%) Decreasing-x% A(100x%) AnA(100x%) that is close to the makespan given by the simulator, we therefore chooseB 0 3 as the calibrated bandwidth for the experiments hereafter. 5.5.2 Results In this subsection, we explore the conguration space of the workow ensemble. We aim to characterize the impact of dierent co-scheduling strategies by comparing the makespan of the simulator over various scenarios described in Table 5.3. To ensure the reliability of results, each value is averaged over 5 trials. ImplicationsofCo-Sched. We vary one conguration of the workow ensemble at a time while keeping the other congurations xed. Figure 5.4a shows the makespan of the workow ensemble running on 16 nodes with 4 simulations, 4 analyses per simulation when varying the size of data processed by each analysis each iteration. Figure 5.4b is the result with 4 simulations, 4 analyses per simulation and 4GB of data when varying the number of nodes. In Figure 5.4c, the number of analyses per simulation is varied while keeping the number of simulation xed at 4 and 4GB of data, 4 analyses per simulation. In Figure 5.4d, the number of simulations is varied while keeping the number of analyses per simulation xed at 4, 4GB of data and 64 nodes. Figure 5.4 demonstrates the following order if the scenarios are sorted in ascending order by their makespan: Ideal,x-25%,x-50%,x-75%,In-transit, wherex is eitherIncreasing orDecreasing as those scenarios’ makespan are approximately close to each other. Note that, forIdeal, 0% of the analyses are not co-scheduled with their coupled simulation while that percentage is 100% forIn-transit. This observation 109 aligns with our nding in Theorem 1 that not co-scheduling an analysis with its coupled simulation yields a higher makespan. The more number of analyses not co-scheduled with their coupled simulation, the slower the makespan. SinceIdeal outperforms the other scenarios, ideal co-scheduling mappings should be favored when the available resources can sustain. In Figures 5.4a, 5.4c and 5.4d, the makespan scales linearly with the size of data processed by the analyses each iteration, the number of analyses per simulation and the number of simulations, respectively in whichIdeal imposes the smallest escalation when these parameters are increased. In Figure 5.4b, only the makespan of theIdeal is decreased when the number of nodes is increased. This is because the more nodes are utilized, the more communications are required to exchange the data, which requires the communication bandwidth is shared among them. The communication stages in each step are therefore slower even though the computational stages are faster due to more computing resources are assigned. Only theIdeal which incurs no remote communication can fully take advantage of the available resources. Eciency of Co-Alloc. We compare the resource allocation computed by Co-Alloc with a naive approach of evenly distributing the resources, which is called asEv-Alloc. Specically, at node-level, with Ev-Alloc, the number of nodes is equally divided among co-scheduling allocations while at core-level, for each co-scheduling allocation, the number of cores per node is equally divided among applications co-scheduled in such the allocation. Let (n :x;c :y) is a manner to compute resource allocations, in which we applyx method at node-level while usingy method at core-level. We study 4 possible combinations: (n : Co-Alloc;c : Co-Alloc), (n : Co-Alloc;c : Ev-Alloc), (n : Ev-Alloc;c : Co-Alloc), and (n : Ev-Alloc;c : Ev-Alloc). Even though the experiment is conducted over all the scenarios in the conguration space, due to the lack of space, we selectively present a subset of scenarios where the allocations signicantly dier between observing combinations. Figure 5.5 shows the makespan, which is normalized to (n : Co-Alloc;c : Co-Alloc), in two representative extreme scenarios: Ideal and In-transit when the amount of data 110 processed each iteration is varied. In the Ideal scenario, (n : Co-Alloc;c : Co-Alloc) surpasses the other combinations, except (n : Ev-Alloc;c : Co-Alloc). This is because Ideal comprises no remote communication, thus execution time of each application’s step is dominant by the portion of computational stages, which cause the performance is more sensitive to the changes at core-level than at node-level. For In-transit, usingEv-Alloc at node-level results in a slower makespan (up to two times higher) than using Co-Alloc as o-node communications are primary in this scenario. For larger data sizes, the I/O stages take the large proportion in each step in In-transit, which makes the computing time negligible, thus there is no clear dierence among discussed combinations. 5.6 Conclusion In this paper, we have introduced an execution model of coupling behavior between in situ jobs in a complex workow ensemble. Based on the proposed model, we theoretically characterized computation and communication characteristics in each in situ component. We further relied on this model to determine co-scheduling policies and resource proles for each simulation and in situ analysis such that the makespan of the workow ensemble is minimized under given available computing resources. By evaluating the scheduling solutions via simulation, we validate our proposed approach and conrm its correctness. We are also able to conrm the relevance of data locality as well as the need of well-management shared resources (e.g., bandwidth per node) in co-scheduling concurrent applications together. For future work, we plan to extend the scheduling model to include contention and interference eects between in situ jobs sharing the same underlying resources. Another promising research direction is to consider more complex use cases leveraging ensembles, e.g. adaptive sampling [10] in which more coordination will be needed to achieve better performance as the simulations are periodically stopped and restarted. 111 1 2 4 8 16 32 Data size (GB) 0 2000 4000 6000 8000 Makespan (s) 1 2 4 8 16 32 Data size (GB) 0 10 20 30 40 50 Normalization toIdeal Ideal In-transit Increasing-25% Increasing-50% Increasing-75% Decreasing-25% Decreasing-50% Decreasing-75% (a) Data size is varying. The workow ensemble runs on 16 nodes and has 4 simulations, 4 analyses per simulation. 8 16 32 64 128 Number of compute nodes 0 200 400 600 800 1000 1200 Makespan (s) 8 16 32 64 128 Number of compute nodes 0 20 40 60 Normalization toIdeal (b) Number of nodes is varying. The workow ensemble has 4 simulations, 4 analyses per simulation. Each analysis processes 4GB of data each iteration 1 2 4 8 16 32 Number of analyses per simulation 0 2000 4000 6000 8000 Makespan (s) 1 2 4 8 16 32 Number of analyses per simulation 1 2 3 4 5 6 7 Normalization toIdeal (c) Number of analyses per simulation is varying. The workow ensemble runs on 16 nodes and has 4 simulations. Each analysis processes 4GB of data each iteration 1 2 4 8 16 32 Number of simulations 0 2000 4000 6000 8000 Makespan (s) 1 2 4 8 16 32 Number of simulations 0 10 20 30 Normalization toIdeal (d) Number of simulations is varying. The workow ensemble runs on 64 nodes and has 4 analyses per simulation. Each analysis processes 4GB of data each iteration Figure 5.4: Makespan of the workow ensemble when varying various congurations. 112 1 2 4 8 16 32 Data size (GB) 0.8 1.0 1.2 1.4 1.6 Normalized makespan to (n : Co-alloc, c : Co-alloc) Ideal 1 2 4 8 16 32 Data size (GB) 1.0 1.5 2.0 In-transit (n : Co-alloc, c : Co-alloc) (n : Co-alloc, c : Ev-alloc) (n : Ev-alloc, c : Co-alloc) (n : Ev-alloc, c : Ev-alloc) Figure 5.5: Comparison of normalized makespan of Co-Alloc toEv-Alloc, where resources are evenly distributed. (n : Co-Alloc;c : Ev-Alloc) indicatesCo-Alloc is applied at node-level whileEv-Alloc is applied at core-level. 113 Summary With the continuous growth of computational power oered by present-day supercomputers and anticipated in next-generation computer systems, scientic workows are becoming larger in terms of number of tasks and more complex in terms of execution patterns. The emergence of in situ workows and workow en- sembles is one of the advancements in workows, which is driven by the demand of not only computing more but also analyzing data at runtime to obtain near-real-time feedback. However, most of current eort in optimizing the execution of in situ workows is done mainly on a case-by-case basis with a focus on application-specic use cases and therefore lacks generalization. In this thesis, we studied novel approaches to optimize the execution of in situ workows from two perspectives: performance evaluation and resource scheduling. From the point of view of performance evaluation, we provided a capability to determine the eciency of an in situ workow based on quantifying idle time during its executions. The intuition is to maximize resource utilization by minimizing time sitting idle. Moreover, the notion of computational eciency is dened and proposed as a standard for metric for evaluating performance of in situ workows. From the point of view of resource management, we proposed a method to schedule resources and placements for an in situ workows such that the idle time within the workow is minimized, thereby minimizing the makespan of the workow. In the end, optimizing the execution of those workows oers a better chance to maximize scientic discovery while leveraging the available computing resources in an optimal way. In a rst time, the execution of in situ workows are studied and considered in the context of workow ensembles, which consists of many more 114 concurrent components and complex data dependencies and therefore requires sophisticated co-scheduling and resource management algorithms. From a broader point of view, the solutions proposed in this thesis are not only restricted to in situ workows, but also can be to generalized to other types of modern workows whose tasks exhibit tightly-coupled data and iterative behavior. One of important lessons learned when conducting research on optimizing execution of in situ workows is that data locality is very important in in situ workows as the behavior of tasks in the workow is tightly-coupled via data communicating/staging. Therefore, resource and job management solutions in situ workows should prioritize co-scheduling of components that have tight data relationships, i.e. data generated by the simulation is analyzed by the analyses. As future directions, we plan to expand the scheduling problems in Chapter 5 to take into account several objectives to optimize the execution of workow ensembles. In addition to minimizing makespan, one of the other goals is the ability of running as many simulations and analyses as possible under given resource constraints to maximize insights and discoveries obtained. We can optimize the execution of in situ ensembles towards that aim such that the throughput, i.e. the number of simulations/analyses running at a time, is the highest. Moreover, as the main idea of running simulations in an ensemble is to speed up the probability of observing rare events, we can further consider scientic-oriented objectives, such as minimizing time needed to detect conformational changes (protein folding), or to explore 95% of the congurational space [54]. On the co-scheduling side, another promising direction is to consider co-scheduling interference, e.g. cache interference, to model the execution of in situ jobs sharing the same resources and then leverage cache-partitioning and bandwidth-partitioning technologies [89] to reduce that interference. Even though modeling interference is problematic, there are still some attempts [72, 62] to estimate cross-application interference. It might be very interesting to validate our solutions for co-scheduling in situ workows in the context where interference is integrated into the execution model. 115 From the system architecture’s perspectives, as the thesis currently only focuses on many-core archi- tectures, we may also examine the execution of in situ workows on computer systems that are equipped with heterogeneous CPU-GPU processors with deep memory hierarchies that use dierent types of storage technologies, such as non-volatile, solid-state and ash memory [50]. 116 Bibliography [1] Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C. Smith, Berk Hess, and Erik Lindahl. “GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers”. In: SoftwareX 1-2 (2015), pp. 19–25. [2] J. Paul Abston, Ryan M. Adamson, Scott Atchley, Ashley D. Barker, David E. Bernholdt, Katie L. Bethea, Kevin S. Bivens, Arthur S. Bland, Christopher Brumgard, Reuben Budiardja, Adam Carlyle, Christopher Fuson, Rachel M. Harken, Jason J. Hill, Judith C. Hill, Jonathan D. Hines, Katie Jones, Jason C. Kincl, Graham Lopez, Stefan A. Maerz, George Markomanolis, Don Maxwell, Veronica Melesse Vergara, Bronson Messer II, Ross Miller, Jack Morrison, Sarp Oral, William Renaud, Andrea Schneibel, Mallikarjun Shankar, Woong Shin, Preston Shires, Hyogi Sim, Scott Simmerman, James A. Simmons, Suhas Somnath, Kevin G. Thach, Georgia Tourassi, Coury Turczyn, Sudharshan S. Vazhkudai, Feiyi Wang, Bing Xie, and Christopher Zimmer. “US Department of Energy, Oce of Science High Performance Computing Facility Operational Assessment 2019 Oak Ridge Leadership Computing Facility”. In: (Apr. 2020). [3] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. “HPCTOOLKIT: Tools for Performance Analysis of Optimized Parallel Programs Http://Hpctoolkit.Org”. In: Concurr. Comput.: Pract. Exper. 22.6 (Apr. 2010), pp. 685–701. [4] Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Je Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker. “The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’14. New Orleans, Louisana: IEEE Press, 2014, pp. 154–165. [5] Dong H. Ahn, Ned Bass, Albert Chu, Jim Garlick, Mark Grondona, Stephen Herbein, Helgi I. Ingólfsson, Joseph Koning, Tapasya Patki, Thomas R. W. Scogland, Becky Springmeyer, and Michela Taufer. “Flux: Overcoming scheduling challenges for exascale workows”. In: Future Generation Computer Systems 110 (2020), pp. 202–213. 117 [6] Nurunisa Akyuz, Elka R. Georgieva, Zhou Zhou, Sebastian Stolzenberg, Michel A. Cuendet, George Khelashvili, Roger B. Altman, Daniel S. Terry, Jack H. Freed, Harel Weinstein, Olga Boudker, and Scott C. Blanchard. “Transport domain unlocking sets the uptake rate of an aspartate transporter”. In: Nature 518.7537 (2015), pp. 68–73. [7] Gene M. Amdahl. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities”. In: Proceedings of the April 18-20, 1967, Spring Joint Computer Conference. AFIPS ’67 (Spring). Atlantic City, New Jersey: Association for Computing Machinery, 1967, pp. 483–485. [8] Guillaume Aupy, Anne Benoit, Brice Goglin, Loïc Pottier, and Yves Robert. “Co-scheduling HPC workloads on cache-partitioned CMP platforms”. In: The International Journal of High Performance Computing Applications 33.6 (June 2019), pp. 1221–1239. [9] Guillaume Aupy, Brice Goglin, Valentin Honoré, and Bruno Ran. “Modeling high-throughput applications for in situ analytics”. In: The International Journal of High Performance Computing Applications 33.6 (June 2019), pp. 1185–1200. [10] Vivek Balasubramanian, Travis Jensen, Matteo Turilli, Peter Kasson, Michael Shirts, and Shantenu Jha. “Adaptive Ensemble Biomolecular Applications at Scale”. In: SN Computer Science 1.2 (2020), p. 104. [11] Alessandro Barducci, Massimiliano Bonomi, and Michele Parrinello. “Metadynamics”. In: WIREs Computational Molecular Science 1.5 (2011), pp. 826–843. [12] A. C. Bauer, H. Abbasi, J. Ahrens, H. Childs, B. Geveci, S. Klasky, K. Moreland, P. O’Leary, V. Vishwanath, B. Whitlock, and E. W. Bethel. “In Situ Methods, Infrastructures, and Applications on High Performance Computing Platforms”. In: Computer Graphics Forum 35.3 (2016), pp. 577–597. [13] Robert B. Best, Xiao Zhu, Jihyun Shim, Pedro E. M. Lopes, Jeetain Mittal, Michael Feig, and Alexander D. MacKerell. “Optimization of the Additive CHARMM All-Atom Protein Force Field Targeting Improved Sampling of the Backboneφ,ψand Side-Chainχ 1 andχ 2 Dihedral Angles”. In: Journal of Chemical Theory and Computation 8.9 (Sept. 2012), pp. 3257–3273. [14] Pär Bjelkmar, Per Larsson, Michel A. Cuendet, Berk Hess, and Erik Lindahl. “Implementation of the CHARMM Force Field in GROMACS: Analysis of Protein Stability Eects from Correction Maps, Virtual Interaction Sites, and Water Models”. In: Journal of Chemical Theory and Computation 6.2 (Feb. 2010), pp. 459–466. [15] Olga Boudker, Renae M. Ryan, Dinesh Yernool, Keiko Shimamoto, and Eric Gouaux. “Coupling substrate and ion binding to extracellular gate of a sodium-dependent aspartate transporter”. In: Nature 445.7126 (2007), pp. 387–393. [16] Jens Breitbart, Simon Pickartz, Stefan Lankes, Josef Weidendorfer, and Antonello Monti. “Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization”. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). Sept. 2017, pp. 400–409. 118 [17] B R Brooks, 3rd Brooks C L, Jr Mackerell A D, L Nilsson, R J Petrella, B Roux, Y Won, G Archontis, C Bartels, S Boresch, A Caisch, L Caves, Q Cui, A R Dinner, M Feig, S Fischer, J Gao, M Hodoscek, W Im, K Kuczera, T Lazaridis, J Ma, V Ovchinnikov, E Paci, R W Pastor, C B Post, J Z Pu, M Schaefer, B Tidor, R M Venable, H L Woodcock, X Wu, W Yang, D M York, and M Karplus. “CHARMM: the biomolecular simulation program”. In: Journal of computational chemistry 30.10 (July 2009), pp. 1545–1614. [18] Henri Casanova, Rafael Ferreira da Silva, Ryan Tanaka, Suraj Pandey, Gautam Jethwani, William Koch, Spencer Albrecht, James Oeth, and Frédéric Suter. “Developing Accurate and Scalable Simulators of Production Workow Management Systems with WRENCH”. In: Future Generation Computer Systems 112 (2020), pp. 162–175. [19] Henri Casanova, Arnaud Giersch, Arnaud Legrand, Martin Quinson, and Frédéric Suter. “Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms”. In: Journal of Parallel and Distributed Computing 74.10 (June 2014), pp. 2899–2917. [20] Riccardo Chelli and Giorgio F. Signorini. “Serial Generalized Ensemble Simulations of Biomolecules with Self-Consistent Determination of Weights”. In: Journal of Chemical Theory and Computation 8.3 (Mar. 2012), pp. 830–842. [21] Jong Youl Choi, Choong-Seock Chang, Julien Dominski, Scott Klasky, Gabriele Merlo, Eric Suchyta, Mark Ainsworth, Bryce Allen, Franck Cappello, Michael Churchill, Philip Davis, Sheng Di, Greg Eisenhauer, Stephane Ethier, Ian Foster, Berk Geveci, Hanqi Guo, Kevin Huck, Frank Jenko, Mark Kim, James Kress, Seung-Hoe Ku, Qing Liu, Jeremy Logan, Allen Malony, Kshitij Mehta, Kenneth Moreland, Todd Munson, Manish Parashar, Tom Peterka, Norbert Podhorszki, Dave Pugmire, Ozan Tugluk, Ruonan Wang, Ben Whitney, Matthew Wolf, and Chad Wood. “Coupling Exascale Multiphysics Applications: Methods and Lessons Learned”. In: 2018 IEEE 14th International Conference on e-Science (e-Science). Oct. 2018, pp. 442–452. [22] E. Chow, J. L. Klepeis, C. A. Rendleman, R. O. Dror, D. E. Shaw, and Edward H. Egelman. “9.6 New Technologies for Molecular Dynamics Simulations”. In: Comprehensive Biophysics. Amsterdam: Elsevier, 2012, pp. 86–104. [23] Jerey Comer, James C. Phillips, Klaus Schulten, and Christophe Chipot. “Multiple-Replica Strategies for Free-Energy Calculations in NAMD: Multiple-Walker Adaptive Biasing Force and Walker Selection Rules”. In: Journal of Chemical Theory and Computation 10.12 (Dec. 2014), pp. 5276–5285. [24] Daniel Dauwe, Ryan Friese, Sudeep Pasricha, Anthony Maciejewski, Gregory Koenig, and Howard Siegel. “Modeling the Eects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems”. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications. WorldComp, July 2014. [25] Jai Dayal, Drew Bratcher, Greg Eisenhauer, Karsten Schwan, Matthew Wolf, Xuechen Zhang, Hasan Abbasi, Scott Klasky, and Norbert Podhorszki. “Flexpath: Type-Based Publish/Subscribe System for Large-Scale Science Analytics”. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. May 2014, pp. 246–255. 119 [26] Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and Kent Wenger. “Pegasus, a Workow Management System for Science Automation”. In: Future Gener. Comput. Syst. 46.C (May 2015), pp. 17–35. [27] Peter J Denning. “Is computer science science?” In: Communications of the ACM 48.4 (2005), pp. 27–31. [28] Luiz DeRose, Bill Homer, Dean Johnson, Steve Kaufmann, and Heidi Poxon. “Cray Performance Analysis Tools”. In: Tools for High Performance Computing. Ed. by Michael Resch, Rainer Keller, Valentin Himmler, Bettina Krammer, and Alexander Schulz. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 191–199. [29] Estelle Dirand, Laurent Colombet, and Bruno Ran. “TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics”. In: Supercomputing Frontiers. Ed. by Rio Yokota and Weigang Wu. Cham: Springer International Publishing, 2018, pp. 159–178. [30] Tu Mai Anh Do, Ming Jiang, Brian Gallagher, Albert Chu, Cyrus Harrison, Karan Vahi, and Ewa Deelman. “Enabling Data Analytics Workows using Node-Local Storage”. In: International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX. 2018. [31] Tu Mai Anh Do, Loïc Pottier, Silvina Caíno-Lores, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman. “A lightweight method for evaluating in situ workow eciency”. In: Journal of Computational Science 48 (2021), p. 101259. [32] Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. “Assessing Resource Provisioning and Allocation of Ensembles of In Situ Workows”. In: 50th International Conference on Parallel Processing Workshop. ICPP Workshops ’21. Lemont, IL, USA: Association for Computing Machinery, 2021. [33] Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. “Performance assessment of ensembles of in situ workows under resource constraints”. In: Concurrency and Computation: Practice and Experience (June 2022), e7111. [34] Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Frédéric Suter, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. “Co-scheduling Ensembles of In Situ Workows”. In: 17th Workshop on Workows in Support of Large-Scale Science (WORKS’22). IEEE. 2022, To appear. [35] Tu Mai Anh Do, Loïc Pottier, Stephen Thomas, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman. “A Novel Metric to Evaluate In Situ Workows”. In: Computational Science –ICCS 2020. Cham: Springer International Publishing, 2020, pp. 538–553. [36] Tu Mai Anh Do, Loïc Pottier, Orcun Yildiz, Karan Vahi, Patrycja Krawczuk, Tom Peterka, and Ewa Deelman. “Accelerating Scientic Workows on HPC Platforms with In Situ Processing”. In: 2022 IEEE 22nd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE. 2022, pp. 1–10. 120 [37] Ciprian Docan, Manish Parashar, and Scott Klasky. “DataSpaces: an interaction and coordination framework for coupled simulation workows”. In: Cluster Computing 15.2 (2012), pp. 163–181. [38] Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jerey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. “The International Exascale Software Project Roadmap”. In: Int. J. High Perform. Comput. Appl. 25.1 (Feb. 2011), pp. 3–60. [39] Jack Dongarra, Laura Grigori, and Nicholas J. Higham. “Numerical algorithms for high-performance computational science”. In: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378.2166 (2020), p. 20190066. eprint: https://royalsocietypublishing.org/doi/pdf/10.1098/rsta.2019.0066. [40] M. Dreher and T. Peterka. Decaf: Decoupled Dataows for In Situ High-Performance Workows. Tech. rep. United States: Argonne National Lab.(ANL), Argonne, IL (United States), 2017. [41] Peter Eastman, Jason Swails, John D. Chodera, Robert T. McGibbon, Yutong Zhao, Kyle A. Beauchamp, Lee-Ping Wang, Andrew C. Simmonett, Matthew P. Harrigan, Chaya D. Stern, Rafal P. Wiewiora, Bernard R. Brooks, and Vijay S. Pande. “OpenMM 7: Rapid development of high performance algorithms for molecular dynamics”. In: PLOS Computational Biology 13.7 (July 2017), pp. 1–17. [42] N. Fabian, K. Moreland, D. Thompson, A. C. Bauer, P. Marion, B. Gevecik, M. Rasquin, and K. E. Jansen. “The ParaView Coprocessing Library: A scalable, general purpose in situ visualization library”. In: 2011 IEEE Symposium on Large Data Analysis and Visualization. 2011, pp. 89–96. [43] Dror G Feitelson. “Job scheduling in multiprogrammed parallel systems”. In: IBM Research Report 19790 (1997). [44] Pradeep Fernando, Ada Gavrilovska, Sudarsun Kannan, and Greg Eisenhauer. “NVStream: Accelerating HPC Workows with NVRAM-based Transport for Streaming Objects”. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’18. Tempe, Arizona: ACM, 2018, pp. 231–242. [45] Rafael Ferreira da Silva, Scott Callaghan, and Ewa Deelman. “On the Use of Burst Buers for Accelerating Data-Intensive Scientic Workows”. In: 12th Workshop on Workows in Support of Large-Scale Science (WORKS’17). 2017. 121 [46] Rafael Ferreira da Silva, Scott Callaghan, Tu Mai Anh Do, George Papadimitriou, and Ewa Deelman. “Measuring the impact of burst buers on data-intensive scientic workows”. In: Future Generation Computer Systems 101 (2019), pp. 208–220. [47] Rafael Ferreira da Silva, Rosa Filgueira, Ilia Pietri, Ming Jiang, Rizos Sakellariou, and Ewa Deelman. “A characterization of workow management systems for extreme-scale applications”. In: Future Generation Computer Systems 75 (2017), pp. 228–238. [48] Ian Foster, Mark Ainsworth, Bryce Allen, Julie Bessac, Franck Cappello, Jong Youl Choi, Emil Constantinescu, Philip E. Davis, Sheng Di, Wendy Di, Hanqi Guo, Scott Klasky, Kerstin Kleese Van Dam, Tahsin Kurc, Qing Liu, Abid Malik, Kshitij Mehta, Klaus Mueller, Todd Munson, George Ostouchov, Manish Parashar, Tom Peterka, Line Pouchard, Dingwen Tao, Ozan Tugluk, Stefan Wild, Matthew Wolf, Justin M. Wozniak, Wei Xu, and Shinjae Yoo. “Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales”. In: Euro-Par 2017: Parallel Processing. Ed. by Francisco F. Rivera, Tomás F. Pena, and José C. Cabaleiro. Cham: Springer International Publishing, 2017, pp. 3–19. [49] Yuankun Fu, Feng Li, Fengguang Song, and Zizhong Chen. “Performance Analysis and Optimization of In-Situ Integration of Simulation with Data Analysis: Zipping Applications Up”. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’18. Tempe, Arizona: Association for Computing Machinery, 2018, pp. 192–205. [50] Marc Gamell, Ivan Rodero, Manish Parashar, and Stephen Poole. “Exploring energy and performance behaviors of data-intensive scientic workows on systems with deep memory hierarchies”. In: 20th Annual International Conference on High Performance Computing. Dec. 2013, pp. 226–235. [51] Devarshi Ghoshal and Lavanya Ramakrishnan. “MaDaTS: Managing Data on Tiered Storage for Scientic Workows”. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’17. Washington, DC, USA: Association for Computing Machinery, 2017, pp. 41–52. [52] Alan Heirich, Elliott Slaughter, Manolis Papadakis, Wonchan Lee, Tim Biedert, and Alex Aiken. “In Situ Visualization with Task-Based Parallelism”. In: Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization. ISAV’17. Denver, CO, USA: Association for Computing Machinery, 2017, pp. 17–21. [53] Adam Hospital, Josep Ramon Goñi, Modesto Orozco, and Josep L Gelpí. “Molecular dynamics simulations: advances and applications”. In: Advances and applications in bioinformatics and chemistry : AABC 8 (Nov. 2015), pp. 37–47. [54] Eugen Hruska, Jayvee R. Abella, Feliks Nüske, Lydia E. Kavraki, and Cecilia Clementi. “Quantitative comparison of adaptive sampling methods for protein dynamics”. In: The Journal of Chemical Physics 149.24 (2022/10/03 2018), p. 244119. 122 [55] D. Huang, Z. Qin, Q. Liu, N. Podhorszki, and S. Klasky. “A Comprehensive Study of In-Memory Computing on Large HPC Systems”. In: 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). Nov. 2020, pp. 987–997. [56] Ramin Izadpanah, Nichamon Naksinehaboon, Jim Brandt, Ann Gentile, and Damian Dechev. “Integrating Low-Latency Analysis into HPC System Monitoring”. In: Proceedings of the 47th International Conference on Parallel Processing. ICPP 2018. Eugene, OR, USA: Association for Computing Machinery, 2018. [57] Travis Johnston, Boyu Zhang, Adam Liwo, Silvia Crivelli, and Michela Taufer. “In situ data analytics and indexing of protein trajectories”. In: Journal of Computational Chemistry 38.16 (2017), pp. 1419–1430. [58] Gideon Juve, Benjamin Tovar, Rafael Ferreira Da Silva, Dariusz Krol, Douglas Thain, Ewa Deelman, William Allcock, and Miron Livny. “Practical Resource Monitoring for Robust High Throughput Computing”. In: 2015 IEEE International Conference on Cluster Computing. Sept. 2015, pp. 650–657. [59] Hyojong Kim, Hyesoon Kim, Sudhakar Yalamanchili, and Arun F. Rodrigues. “Understanding Energy Aspects of Processing-near-Memory for HPC Workloads”. In: Proceedings of the 2015 International Symposium on Memory Systems. MEMSYS ’15. Washington DC, DC, USA: Association for Computing Machinery, 2015, pp. 276–282. [60] Patrycja Krawczuk, George Papadimitriou, Ryan Tanaka, Tu Mai Anh Do, Srujana Subramanya, Shubham Nagarkar, Aditi Jain, Kelsie Lam, Anirban Mandal, Loïc Pottier, and Ewa Deelman. “A Performance Characterization of Scientic Machine Learning Workows”. In: 2021 IEEE Workshop on Workows in Support of Large-Scale Science (WORKS). Nov. 2021, pp. 58–65. [61] James Kress, Matthew Larsen, Jong Choi, Mark Kim, Matthew Wolf, Norbert Podhorszki, Scott Klasky, Hank Childs, and David Pugmire. “Comparing the Eciency of In Situ Visualization Paradigms at Scale”. In: High Performance Computing. Ed. by Michèle Weiland, Guido Juckeland, Carsten Trinitis, and Ponnuswamy Sadayappan. Cham: Springer International Publishing, 2019, pp. 99–117. [62] Ruslan Kuchumov and Vladimir Korkhov. “Analytical and Numerical Evaluation of Co-Scheduling Strategies and Their Application”. In: Computers 10.10 (2021). [63] Yu Li, Xiaohong Zhang, Ashwin Srinath, Rachel B. Getman, and Linh B. Ngo. “Combining HPC and Big Data Infrastructures in Large-Scale Post-Processing of Simulation Data: A Case Study”. In: Proceedings of the Practice and Experience on Advanced Research Computing. PEARC ’18. Pittsburgh, PA, USA: Association for Computing Machinery, 2018. [64] Weihao Liang, Yong Chen, and Hong An. “Interference-Aware I/O Scheduling for Data-Intensive Applications on Hierarchical HPC Storage Systems”. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). Aug. 2019, pp. 654–661. 123 [65] Xiang-ke Liao, Kai Lu, Can-qun Yang, Jin-wen Li, Yuan Yuan, Ming-che Lai, Li-bo Huang, Ping-jing Lu, Jian-bin Fang, Jing Ren, and Jie Shen. “Moving from exascale to zettascale computing: challenges and techniques”. In: Frontiers of Information Technology & Electronic Engineering 19.10 (2018), pp. 1236–1244. [66] Ji Liu, Esther Pacitti, Patrick Valduriez, and Marta Mattoso. “A Survey of Data-Intensive Scientic Workow Management”. In: Journal of Grid Computing 13.4 (2015), pp. 457–493. [67] Jay F. Lofstead, Scott Klasky, Karsten Schwan, Norbert Podhorszki, and Chen Jin. “Flexible IO and Integration for Scientic Codes through the Adaptable IO System (ADIOS)”. In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments. CLADE ’08. Boston, MA, USA: Association for Computing Machinery, 2008, pp. 15–24. [68] Jakob Luttgau, Michael Kuhn, Kira Duwe, Yevhen Alforov, Eugen Betke, Julian Kunkel, and Thomas Ludwig. “Survey of Storage Systems for High-Performance Computing”. In: Supercomput. Front. Innov.: Int. J. 5.1 (Mar. 2018), pp. 31–58. [69] Preeti Malakar, Christopher Knight, Todd Munson, Venkatram Vishwanath, and Michael E. Papka. “Scalable In Situ Analysis of Molecular Dynamics Simulations”. In: Proceedings of the In Situ Infrastructures on Enabling Extreme-Scale Analysis and Visualization. ISAV’17. Denver, CO, USA: Association for Computing Machinery, 2017, pp. 1–6. [70] Preeti Malakar, Venkatram Vishwanath, Christopher Knight, Todd Munson, and Michael E. Papka. “Optimal Execution of Co-Analysis for Large-Scale Molecular Dynamics Simulations”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’16. Salt Lake City, Utah: IEEE Press, 2016. [71] I. Marincic, V. Vishwanath, and H. Homann. “SeeSAw: Optimizing Performance of In-Situ Analytics Applications under Power Constraints”. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). New Orleans, LA, USA: IEEE, 2020, pp. 789–798. [72] Maicon Melo Alves and Lúcia Maria de Assumpção Drummond. “A multivariate and quantitative model for predicting cross-application interference in virtual environments”. In: Journal of Systems and Software 128 (2017), pp. 150–163. [73] Andre Merzky, Matteo Turilli, Manuel Maldonado, Mark Santcroos, and Shantenu Jha. “Using Pilot Systems to Execute Many Task Workloads on Supercomputers”. In: Job Scheduling Strategies for Parallel Processing. Ed. by Dalibor Klusáček, Walfredo Cirne, and Narayan Desai. Cham: Springer International Publishing, 2019, pp. 61–82. [74] NAMD Performance. https://www.ks.uiuc.edu/Research/namd/benchmarks/. [75] NERSC, Lawrence Berkeley National Laboratory’s Supercomputer Cori. https://www.nersc.gov/users/computational-systems/cori. [76] Yuko Okamoto. “Generalized-ensemble algorithms: enhanced sampling techniques for Monte Carlo and molecular dynamics simulations”. In: Journal of Molecular Graphics and Modelling 22.5 (2004). Conformational Sampling, pp. 425–439. 124 [77] John Ossyra, Ada Sedova, Arnold Tharrington, Frank Noé, Cecilia Clementi, and Jeremy C. Smith. “Porting Adaptive Ensemble Molecular Dynamics Workows to the Summit Supercomputer”. In: High Performance Computing. Ed. by Michèle Weiland, Guido Juckeland, Sadaf Alam, and Heike Jagode. Cham: Springer International Publishing, 2019, pp. 397–417. [78] Mark A. Oxley, Eric Jonardi, Sudeep Pasricha, Anthony A. Maciejewski, Howard Jay Siegel, Patrick J. Burns, and Gregory A. Koenig. “Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers”. In: Journal of Parallel and Distributed Computing 112 (2018), pp. 126–139. [79] Szilárd Páll, Mark James Abraham, Carsten Kutzner, Berk Hess, and Erik Lindahl. “Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS”. In: Solving Software Challenges for Exascale. Ed. by Stefano Markidis and Erwin Laure. Cham: Springer International Publishing, 2015, pp. 3–27. [80] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. “Exploring the Performance Benet of Hybrid Memory System on HPC Environments”. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). May 2017, pp. 683–692. [81] Tom Peterka, Deborah Bard, Janine Bennett, Ron Oldeld, Line Pouchard, Christine Sweeney, and Matthew Wolf. ASCR Workshop on In Situ Data Management: Enabling Scientic Discovery from Diverse Data Sources. Tech. rep. United States, 2019. [82] J. Luc Peterson, Ben Bay, Joe Koning, Peter Robinson, Jessica Semler, Jeremy White, Rushil Anirudh, Kevin Athey, Peer-Timo Bremer, Francesco Di Natale, David Fox, Jim A. Ganey, Sam A. Jacobs, Bhavya Kailkhura, Bogdan Kustowski, Steven Langer, Brian Spears, Jayaraman Thiagarajan, Brian Van Essen, and Jae-Seung Yeom. “Enabling Machine Learning-Ready HPC Ensembles with Merlin”. In: Future Gener. Comput. Syst. 131.C (June 2022), pp. 255–268. [83] James C Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D Skeel, Laxmikant Kalé, and Klaus Schulten. “Scalable molecular dynamics with NAMD”. In: Journal of computational chemistry 26.16 (Dec. 2005), pp. 1781–1802. [84] Line Pouchard, Kevin Huck, Gyorgy Matyasfalvi, Dingwen Tao, Li Tang, Huub Van Dam, and Shinaje Yoo. “Prescriptive provenance for streaming analysis of workows at scale”. In: 2018 New York Scientic Data Summit (NYSDS). Aug. 2018, pp. 1–6. [85] Prakash Prabhu, Thomas B. Jablin, Arun Raman, Yun Zhang, Jialu Huang, Hanjun Kim, Nick P. Johnson, Feng Liu, Soumyadeep Ghosh, Stephen Beard, Taewook Oh, Matthew Zoufaly, David Walker, and David I. August. “A Survey of the Practice of Computational Science”. In:State of the Practice Reports. SC ’11. Seattle, Washington: Association for Computing Machinery, 2011. [86] Paolo Raiteri, Alessandro Laio, Francesco Luigi Gervasio, Cristian Micheletti, and Michele Parrinello. “Ecient Reconstruction of Complex Free Energy Landscapes by Multiple Walkers Metadynamics”. In: The Journal of Physical Chemistry B 110.8 (Mar. 2006), pp. 3533–3539. 125 [87] David Rogers, Kenneth D. Moreland, Ron A. Oldeld, and Nathan D. Fabian. Data co-processing for extreme scale analysis level II ASC milestone (4745). Tech. rep. United States, Mar. 1, 2013. [88] David Schneider. “The Exascale Era is Upon Us: The Frontier supercomputer may be the rst to reach 1,000,000,000,000,000,000 operations per second”. In: IEEE Spectrum 59.1 (Jan. 2022), pp. 34–35. [89] Vicent Selfa, Julio Sahuquillo, Lieven Eeckhout, Salvador Petit, and María E. Gómez. “Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology”. In: 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). Sept. 2017, pp. 194–205. [90] Christopher Sewell, Katrin Heitmann, Hal Finkel, George Zagaris, Suzanne T. Parete-Koon, Patricia K. Fasel, Adrian Pope, Nicholas Frontiere, Li-ta Lo, Bronson Messer, Salman Habib, and James Ahrens. “Large-Scale Compute-Intensive Analysis via a Combined in-Situ and Co-Scheduling Workow Approach”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’15. Austin, Texas: Association for Computing Machinery, 2015. [91] Sameer S. Shende and Allen D. Malony. “The Tau Parallel Performance System”. In: The International Journal of High Performance Computing Applications 20.2 (2006), pp. 287–311. [92] Rafael Ferreira da Silva, Scott Callaghan, and Ewa Deelman. “On the Use of Burst Buers for Accelerating Data-Intensive Scientic Workows”. In: Proceedings of the 12th Workshop on Workows in Support of Large-Scale Science. WORKS ’17. Denver, Colorado: Association for Computing Machinery, 2017. [93] Nikolay A. Simakov, Joseph P. White, Robert L. DeLeon, Steven M. Gallo, Matthew D. Jones, Jerey T. Palmer, Benjamin D. Plessinger, and Thomas R. Furlani. “A Workload Analysis of NSF’s Innovative HPC Resources Using XDMoD”. In: CoRR abs/1801.04306 (2018). arXiv: 1801.04306. [94] Alok Singh, Arvind Rao, Shweta Purawat, and Ilkay Altintas. “A Machine Learning Approach for Modular Workow Performance Prediction”. In: Proceedings of the 12th Workshop on Workows in Support of Large-Scale Science. WORKS ’17. Denver, Colorado: Association for Computing Machinery, 2017. [95] Dan Stanzione, John West, R. Todd Evans, Tommy Minyard, Omar Ghattas, and Dhabaleswar K. Panda. “Frontera: The Evolution of Leadership Computing at the National Science Foundation”. In: Practice and Experience in Advanced Research Computing. PEARC ’20. Portland, OR, USA: Association for Computing Machinery, 2020, pp. 106–111. [96] Mark Stillwell, David Schanzenbach, Frédéric Vivien, and Henri Casanova. “Resource allocation algorithms for virtualized service hosting platforms”. In: Journal of Parallel and Distributed Computing 70.9 (2010), pp. 962–974. 126 [97] Michela Taufer, Stephen Thomas, Michael Wyatt, Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Harel Weinstein, Michel A. Cuendet, Trilce Estrada, and Ewa Deelman. “Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers”. In: 2019 15th International Conference on eScience (eScience). Sept. 2019, pp. 188–198. [98] Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Matthew Shields. Workows for E-Science: Scientic Workows for Grids. Springer Publishing Company, Incorporated, 2014. [99] Gareth A. Tribello, Massimiliano Bonomi, Davide Branduardi, Carlo Camilloni, and Giovanni Bussi. “PLUMED 2: New feathers for an old bird”. In: Computer Physics Communications 185.2 (2014), pp. 604–613. [100] Ranjan Sarpangala Venkatesh, Tony Mason, Pradeep Fernando, Greg Eisenhauer, and Ada Gavrilovska. “Scheduling HPC Workows with Intel Optane Persistent Memory”. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). June 2021, pp. 56–65. [101] Jerey S. Vetter, Ron Brightwell, Maya Gokhale, Pat McCormick, Rob Ross, John Shalf, Katie Antypas, David Donofrio, Travis Humble, Catherine Schuman, Brian Van Essen, Shinjae Yoo, Alex Aiken, David Bernholdt, Suren Byna, Kirk Cameron, Frank Cappello, Barbara Chapman, Andrew Chien, Mary Hall, Rebecca Hartman-Baker, Zhiling Lan, Michael Lang, John Leidel, Sherry Li, Robert Lucas, John Mellor-Crummey, Paul Peltz Jr., Thomas Peterka, Michelle Strout, and Jeremiah Wilke. “Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity”. In: (Dec. 2018). [102] Dimitrios Vlachakis, Elena Bencurova, Nikitas Papangelopoulos, and Sophia Kossida. “Chapter Seven - Current State-of-the-Art Molecular Dynamics Methods and Applications”. In: ed. by Rossen Donev. Vol. 94. Advances in Protein Chemistry and Structural Biology. Academic Press, 2014, pp. 269–313. [103] Andrey Vladimirov and Ryo Asai. MCDRAM as High-Bandwith Memory (HBM) in Knights Landing Processors: Developer’s Guide. Tech. rep. Colfax International, May 2016. [104] Samuel Williams, Andrew Waterman, and David Patterson. “Rooine: An Insightful Visual Performance Model for Multicore Architectures”. In: Commun. ACM 52.4 (Apr. 2009), pp. 65–76. [105] Chad Wood, Sudhanshu Sane, Daniel Ellsworth, Alfredo Gimenez, Kevin Huck, Todd Gamblin, and Allen Malony. “A Scalable Observation System for Introspection and in Situ Analytics”. In: Proceedings of the 5th Workshop on Extreme-Scale Programming Tools. ESPT ’16. Salt Lake City, Utah: IEEE Press, 2016, pp. 42–49. [106] Felippe Vieira Zacarias, Vinicius Petrucci, Rajiv Nishtala, Paul Carpenter, and Daniel Mossé. “Intelligent Colocation of Workloads for Enhanced Server Eciency”. In: 2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). Oct. 2019, pp. 120–127. 127 [107] Fan Zhang, Ciprian Docan, Manish Parashar, Scott Klasky, Norbert Podhorszki, and Hasan Abbasi. “Enabling In-situ Execution of Coupled Scientic Workow on Multi-core Platform”. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium. May 2012, pp. 1352–1363. [108] Fan Zhang, Tong Jin, Qian Sun, Melissa Romanus, Hoang Bui, Scott Klasky, and Manish Parashar. “In-memory staging and data-centric task placement for coupled scientic simulation workows”. In: Concurrency and Computation: Practice and Experience 29.12 (2017). e4147 cpe.4147, e4147. [109] Xuechen Zhang, Hasan Abbasi, Kevin Huck, and Allen D. Malony. “WOWMON: A Machine Learning-based Proler for Self-adaptive Instrumentation of Scientic Workows”. In: Procedia Computer Science 80 (2016). International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, pp. 1507–1518. [110] Hongbo Zou, Yongen Yu, Wei Tang, and Hsuan-Wei Michelle Chen. “FlexAnalytics: A Flexible Data Analytics Framework for Big Data Applications with I/O Performance Improvement”. In: Big Data Research 1 (2014), pp. 4–13. [111] Pengfei Zou, Xizhou Feng, and Rong Ge. “Contention Aware Workload and Resource Co-Scheduling on Power-Bounded Systems”. In: 2019 IEEE International Conference on Networking, Architecture and Storage (NAS). Aug. 2019, pp. 1–8. 128 ListofPublications ArticlesinInternationalRefereedJournals [J31] Tu Mai Anh Do, Loïc Pottier, Silvina Caíno-Lores, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman. “A lightweight method for evaluating in situ workow eciency”. In: Journal of Computational Science 48 (2021), p. 101259. [J33] Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. “Performance assessment of ensembles of in situ workows under resource constraints”. In: Concurrency and Computation: Practice and Experience (June 2022), e7111. [J46] Rafael Ferreira da Silva, Scott Callaghan, Tu Mai Anh Do, George Papadimitriou, and Ewa Deelman. “Measuring the impact of burst buers on data-intensive scientic workows”. In: Future Generation Computer Systems 101 (2019), pp. 208–220. ArticlesinInternationalRefereedConferences [C34] Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Frédéric Suter, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. “Co-scheduling Ensembles of In Situ Workows”. In: 17th Workshop on Workows in Support of Large-Scale Science (WORKS’22). IEEE. 2022, To appear. 129 [C35] Tu Mai Anh Do, Loïc Pottier, Stephen Thomas, Rafael Ferreira da Silva, Michel A. Cuendet, Harel Weinstein, Trilce Estrada, Michela Taufer, and Ewa Deelman. “A Novel Metric to Evaluate In Situ Workows”. In: Computational Science –ICCS 2020. Cham: Springer International Publishing, 2020, pp. 538–553. [C36] Tu Mai Anh Do, Loïc Pottier, Orcun Yildiz, Karan Vahi, Patrycja Krawczuk, Tom Peterka, and Ewa Deelman. “Accelerating Scientic Workows on HPC Platforms with In Situ Processing”. In: 2022 IEEE 22nd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE. 2022, pp. 1–10. [C97] Michela Taufer, Stephen Thomas, Michael Wyatt, Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Harel Weinstein, Michel A. Cuendet, Trilce Estrada, and Ewa Deelman. “Characteriz- ing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers”. In: 2019 15th International Conference on eScience (eScience). Sept. 2019, pp. 188– 198. ArticlesinInternationalRefereedWorkshops [W32] Tu Mai Anh Do, Loïc Pottier, Rafael Ferreira da Silva, Silvina Caíno-Lores, Michela Taufer, and Ewa Deelman. “Assessing Resource Provisioning and Allocation of Ensembles of In Situ Workows”. In: 50th International Conference on Parallel Processing Workshop. ICPP Workshops ’21. Lemont, IL, USA: Association for Computing Machinery, 2021. [W60] Patrycja Krawczuk, George Papadimitriou, Ryan Tanaka, Tu Mai Anh Do, Srujana Subramanya, Shubham Nagarkar, Aditi Jain, Kelsie Lam, Anirban Mandal, Loïc Pottier, and Ewa Deelman. “A 130 Performance Characterization of Scientic Machine Learning Workows”. In:2021IEEEWorkshop on Workows in Support of Large-Scale Science (WORKS). Nov. 2021, pp. 58–65. ResearchPosters [P30] Tu Mai Anh Do, Ming Jiang, Brian Gallagher, Albert Chu, Cyrus Harrison, Karan Vahi, and Ewa Deelman. “Enabling Data Analytics Workows using Node-Local Storage”. In: International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX. 2018. 131 AppendixA ProofsforChapter5 Denition1. LetX is a set of jobs, which are simulations or analyses. We dene the following: • T (X) = max x2X t(x) is the time to execute one iteration ofX concurrently; • Q(X) = P x2X t x (1) is the time to execute sequentially one iteration ofX, i.e. executing on single core; • U(X) = P x2X c x V x is the communication cost of processing data remotely forX. Remark. To minimizeT (X) = max x2X t(x) given that resources are nite, we allocate resources such that everyt(x), wherex2X, is equal to minimize dierence amongt(x). We use this remark to minimize execution time of concurrent applications. This remark is held when the number of resources (i.e. number of compute nodes, number of cores) are rational numbers. We also note that all the following proofs are given under the context of rational allocation. We discuss how to relax to integer allocations in Section 5.4.2. Based on Appendix A, we compute minimal execution time in one iteration of a co-scheduling allocation, which is foundation for our following proofs. 132 Lemma1. Given a co-scheduling allocationhS i ;m(S i )i, andP i is the set of analyses that are co-scheduled withS i , but not coupled withS i . The execution time for one iteration is minimized at: T (S i [m(S i )) = BQ(S i [m(S i )) +U(P i ) BCn S i (A.1) Proof of Lemma 1. For every analysisA j 2m(S i )nP i (set of analyses co-scheduled withS i and coupled withS i ) andA k 2P i (set of analyses co-scheduled withS i , but not coupled withS i ),T (S i [m(S i )) is minimized atT (S i [m(S i )) when: T (S i [m(S i )) =t(S i ) =t(A j ) =t(A k ) (A.2) SinceA k 2P i is not coupled withS i , from Equations (5.1) and (5.2), we have: t S i (1) n S i c S i = t A j (1) n A j c A j = t A k (1) n A k c A k + V A k Bn A k = BQ(S i [m(S i )) +U(P i ) B(n S i c S i + X A j 2m(S i ) n A j c A j ) (A.3) We note thatn S i =n A j , as they are co-scheduled on the same co-scheduling allocation. To avoid resources are underutilized, we expect the entire cores of every allocation are used, or: c S i + X A j 2m(S i ) c A j =C (A.4) Substituting Equation (A.4) to Equation (A.3), we obtain Equation (A.1), which is what we need to prove. A.1 ProofofTheorem1 We compare the minimized makespan of two following co-scheduling mappings: (1) the original co- scheduling mappingm and (2) the co-scheduling mappingm 0 forming fromm by not co-scheduling an 133 analysisA k with its coupled simulation. We are going to prove that the makespan ofm 0 is greater than one ofm, which implies not co-scheduling an analysis with the its coupled simulation causes the makespan slower. We therefore, indirectly prove the theorem. SinceA k is co-scheduled with its coupled simulation in the co-scheduling mappingm, from Appendix A and Equation (5.3),T (S[A) of the co-scheduling mappingm, is minimized when: T (S[A) =t(A k ) = t A k (1) n A k c A k (A.5) On the other hand, in the co-scheduling mappingm 0 ,A k is not co-scheduled with its coupled simulation, thenT 0 (S[A) is minimized when: T 0 (S[A) =t 0 (A k ) = t A k (1) n 0 A k c 0 A k + V A k Bn 0 A k (A.6) ProofbyContradiction. Let assumeT (S[A)T 0 (S[A), thent(A k )t 0 (A k ). From Equations (A.5) and (A.6), we have: t A k (1) n A k c A k t A k (1) n 0 A k c 0 A k + V A k Bn 0 A k > t A k (1) n 0 A k c 0 A k n A k c A k <n 0 A k c 0 A k (A.7) Letr(x) = n x c x denote the amount of resources in terms of total cores assigned tox. Equation (A.7) indicates more resources assigned toA k inm 0 thanm. However, as the total number of cores is limited toNC, in the co-schedulingm 0 , there must exist an applicationI which is either a simulation inS or an analysis inAnA k such that it is assigned fewer resources than it is in the co-schedulingm. Hence, T 0 (I) > T (I), which makesT (S[A) < T 0 (S[A). This is contradicted to our assumption. Hence, from Equation (5.3), the makespan of co-schedulingm is smaller than the makespan ofm 0 . 134 A.2 ProofofTheorem2 We compare the minimized makespan of two following co-scheduling mappings: (1) the co-scheduling mappingm where all analyses not co-scheduled with their coupled simulation, denoted byP NC , are co-scheduled on analysis-only co-scheduling allocations and (2) the co-scheduling mappingm 0 in which not all analyses inP NC are co-scheduled on analysis-only co-scheduling allocations. Hence, in this mapping m 0 ,P NC is able to partition into two parts,P 0NC is the set of analyses that are co-scheduled in analysis- only co-scheduling allocations, andP 0C is the set of analyses that are co-scheduled on the co-scheduling allocations that has the simulations they do not couple with. ThenP 0C [P 0NC =P NC . We also denote the set of analyses co-scheduled with their coupled simulation in these co-scheduling mappings byP C , thusP C [P NC =A. We are going to prove the makespan ofm 0 is greater than the makespan ofm . Co-scheduling mappingm . For the mappingm , for every co-scheduling allocationhS i [m (S i )i, from Lemma 1,T (S i [m (S i )) is minimized at: T (S i [m (S i )) = BQ(S i [m (S i )) BCn S i (A.8) Based on Appendix A,T (S[P C ) is minimized when everyT (S i [m (S i )) is equal, or T (S[P C ) = B X S i 2S Q(S i [m (S i )) BC X S i 2S n S i = BQ(S[P C ) BC X S i 2S n S i (A.9) Following the same procedure for analysis-only co-scheduling allocations inm , from Lemma 1,T (P NC ) is minimized at: T (P NC ) = BQ(P NC ) +U (P NC ) BC e N ; (A.10) 135 where e N is total number of nodes assigned to analysis-only allocations. Now combining simulation-present co-scheduling allocations and analysis-only co-scheduling allocations, T (S[A) is minimized, denoted by t , when t = T (S[P C ) = T (P NC ). From Equations (A.9) and (A.10): t = BQ(S[P C ) BC X S i 2S n S i = BQ(P NC ) +U (P NC ) BC e N = BQ(S[A) +U (P NC ) NBC (A.11) U (P NC ) = BQ(S[P C ) X S i 2S n S i NBQ(S[A) (A.12) Co-schedulingmappingm 0 . Similarly, with the mappingm 0 ,T (S[A) is minimized, denoted byt 0 when t 0 =T 0 (S[P C [P 0C ) =T 0 (P 0NC ). For the mappingm 0 , we use prime notations to dierentiate from the mappingm . t 0 = BQ(S[P C [P 0C ) +U 0 (P 0C ) BC X S i 2S n 0 S i = BQ(P 0NC ) +U 0 (P 0NC ) BC e N 0 = BQ(S[A) +U 0 (P NC ) NBC (A.13) U 0 (P NC ) = BQ(S[P C [P 0C ) +U 0 (P 0C ) X S i 2S n 0 S i NBQ(S[A) (A.14) Comparet 0 andt . For the sake of simplicity, we denote: • n 0 = P S i 2S n 0 S i andn = P S i 2S n S i • U =U (P NC ) andU 0 =U 0 (P NC ) From Equations (A.11) and (A.13), we have: t 0 t = U 0 (P NC )U (P NC ) NBC (A.15) 136 Thus to comparet andt 0 , we only need to compareU (P NC ) andU 0 (P NC ). Moreover, from Equations (A.12) and (A.14): U 0 U = N n 0 ( BQ(S[P C ) n (n n 0 ) +U 0 (P 0C ) +BQ(P 0C )) (A.16) Again, from Lemma 1, we have: t = BQ(S[P C ) BCn ) BQ(S[P C ) n =t BC (A.17) Substituting Equation (A.17) to Equation (A.16): U 0 U = N n 0 (t BC(n n 0 ) +U 0 (P 0C ) +BQ(P 0C )) (A.18) Let us dene the set of simulations that analyses inP 0C co-scheduled with asS 0C . We note thatjS 0C j jP 0C j, then again from Lemma 1: t 0 =T 0 (S 0C [P 0C ) = BQ(P 0C ) +U 0 (P 0C ) BC X S l 2S 0C n S l (A.19) By substituting Equation (A.15) to Equation (A.19): BQ(P 0C ) +U 0 (P 0C ) = (t BC + U 0 U N ) X S l 2S 0C n 0 S l (A.20) By substituting Equation (A.20) to Equation (A.18), we have: (U 0 U )(n 0 X S l 2S 0C n 0 S l ) =Nt BC(n + X S l 2S 0C n 0 S l n 0 ) (A.21) 137 Since the total number of compute nodes for co-schedulingS andP C plus the number of nodes of co- scheduling allocations containing analyses inP 0C is always greater than the total number of nodes for co-schedulingS;P C andP 0C , then n hS;P C i + X S l 2S 0C n 0 hS l ;m 0 (S l )i >n 0 hS;P C ;P 0C i n + X S l 2S 0C n 0 S l >n 0 (A.22) We also note that asS 0C S then: n 0 = X S i 2S n 0 S i > X S l 2S 0C n 0 S l (A.23) From Equations (A.21) to (A.23), we haveU 0 >U , ort 0 >t . A.3 ProofofTheorem3 Let e N denote total number of nodes assigned to analysis-only co-scheduling allocations. Hence, e N = P L i=1 n hP NC i i , where[ L i=1 P NC i = P NC . We determine resource assignment to each analysis-only co- scheduling allocation at node- and core-level, respectively. Node-levelAllocation. To minimize makespan of entire ensemble (see Equation (5.3)), we need to minimize T (S[A) by verifying thatT of analysis-only co-scheduling allocations is equal toT of simulation-based co-scheduling allocations (according to Appendix A), orT (S[P C ) =T (P NC ). By applying Lemma 1: Q(S[P C ) C(N e N) = BQ(P NC ) +U(P NC ) BC e N e N = BQ(P NC ) +U(P NC ) BQ(S[A) +U(P NC ) N (A.24) 138 Among analysis-only co-scheduling allocationshP NC i i, from Appendix A, to minimizeT (P NC ), we verify T (P NC ) =T (P NC i );8i2f1;:::;Lg. Similarly, applying Lemma 1: BQ(P NC ) +U(P NC ) BC e N = BQ(P NC i ) +U(P NC i ) BCn hP NC i i (A.25) Substituting Equation (A.24) to Equation (A.25), the number of nodes assigned to each analysis-only co-scheduling allocation is computed as: n hP NC i i = BQ(P NC i ) +U(P NC i ) BQ(S[A) +U(P NC ) N Core-levelAllocation. Similarly, based on Appendix A, for each analysis-only co-scheduling allocation hP NC i i,T (P NC i ) is minimized whenT (P NC i ) = T (A k ) for everyA k 2 P NC i . From Equations (5.2) and (A.25): BQ(P NC i ) +U(P NC i ) BCn hP NC i i = t A k (1) n A k c A k + V A k Bn A k (A.26) Note thatn hP NC i i =n A k asA k is co-scheduled onhP NC i i. Hence, from Equation (A.26), the number of cores assigned to each analysis in analysis-only co-scheduling allocation expressed as: c A k = BQ(A k ) BQ(P NC i ) +U(P NC i )CV A k C A.4 ProofofTheorem4 Similarly to the method used to prove Theorem 3, we solve the number of nodes and cores per node assigned to each application in simulation-based co-scheduling allocations, respectively. 139 Node-levelAllocation. Appendix A shows thatT (S[P C ) is minimized whenT (S[P C ) =T (S i [ m(S i )) for everyS i 2 S. From Lemma 1, the number of nodes assigned to each simulation-based co- scheduling allocation is computedhS i [m(S i )i as follows: Q(S[P C ) C(N e N) = Q(S i [m(S i )) Cn hS i [m(S i )i n hS i [m(S i )i = Q(S i [m(S i )) Q(S[P C ) (N e N) (A.27) Core-level Allocation. For each co-scheduling allocation,T (S i [m(S i )) is minimized whenT (S i [ m(S i )) =T (S i ). Applying Lemma 1: Q(S i [m(S i )) Cn hS i [m(S i )i = Q(S i ) n S i c S i (A.28) AsS i is co-scheduled inhS i [m(S i )i, thusn hS i [m(S i )i =n S i . From Equation (A.28), the number of cores per node assigned to each simulationS i co-scheduled inhS i [m(S i )i: c S i = Q(S i ) Q(S i [m(S i )) C For every analysisA j 2m(S i ) is co-scheduled on co-scheduling allocationhS i [m(S i )i,T (S i [m(S i )) is minimized whenT (S i [m(S i )) =T (A j ). From Lemma 1, the number of cores per node assigned to each analysisA j co-scheduled inhS i [m(S i )i: Q(S i [m(S i )) Cn hS i [m(S i )i = Q(A j ) n A j c A j c A j = Q(A j ) Q(S i [m(S i )) C 140
Abstract (if available)
Abstract
Advances in high-performance computing (HPC) allow scientific simulations to run at an ever-increasing scale, generating a large amount of data that needs to be analyzed over time. Conventionally, the simulation outputs the entire simulated data set to the file system for later post-processing. Unfortunately, the slow growth of I/O technologies compared to the computing capability of present-day processors causes an I/O bottleneck of post-processing as saving data to storage is not as fast as data is generated. According to data-centric models, a new processing paradigm has recently emerged, called in situ, where simulation data is analyzed on-the-fly to reduce the expensive I/O cost of saving massive data for post-processing. Since an in situ workflow usually consists of co-located tasks running concurrently on the same resources in an iterative manner, the execution yields complicated behaviors that create challenges in evaluating the efficiency of an in situ run. To enable efficient execution of in situ workflows, this dissertation proposes a framework to enable in situ execution between simulations and analyses and introduces a computational efficiency model to characterize efficiency of an in situ execution. By extending the proposed performance model to resource-aware performance indicators, we introduce a method to assess resource usage, resource allocation, and resource provisioning for in situ workflow ensembles. Finally, we discuss the ideas of designing effective scheduling of a workflow ensemble through determining appropriate co-scheduling strategies and resource assignment for each simulation and analysis in the ensemble.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Resource management for scientific workflows
PDF
Efficient data and information delivery for workflow execution in grids
PDF
A resource provisioning system for scientific workflow applications
PDF
Schema evolution for scientific asset management
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Scientific workflow generation and benchmarking
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Simulation and machine learning at exascale
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
An end-to-end framework for provisioning based resource and application management
PDF
Process implications of executable domain models for microservices development
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
Asset Metadata
Creator
Do, Tu Mai Anh
(author)
Core Title
Optimizing execution of in situ workflows
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
12/14/2022
Defense Date
11/21/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
high-performance computing,in situ analysis,OAI-PMH Harvest,scientific workflows
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Deelman, Ewa (
committee chair
), Nakano, Aiichiro (
committee member
), Prasanna, Viktor (
committee member
), Taufer, Michela (
committee member
)
Creator Email
tudo@usc.edu,tumaianhdo@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112620744
Unique identifier
UC112620744
Identifier
etd-DoTuMaiAnh-11365.pdf (filename)
Legacy Identifier
etd-DoTuMaiAnh-11365
Document Type
Dissertation
Format
theses (aat)
Rights
Do, Tu Mai Anh
Internet Media Type
application/pdf
Type
texts
Source
20221214-usctheses-batch-996
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
high-performance computing
in situ analysis
scientific workflows