Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Introspective resilience for exascale high-performance computing systems
(USC Thesis Other)
Introspective resilience for exascale high-performance computing systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTROSPECTIVE RESILIENCE FOR EXASCALE HIGH-PERFORMANCE COMPUTING SYSTEMS by Saurabh Hukerikar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2015 Copyright 2015 Saurabh Hukerikar For my Aai (my mom), Veena Hukerikar ii Acknowledgments This dissertation would never have been accomplished without the help, guid- ance and support of my advisors, collaborators, friends and family. First and fore- most, I thank Bob Lucas, my advisor and dissertation committee chair. Through- out my graduate studies at USC and ISI, Bob has been an incredible source of inspiration, encouragement and wisdom. His mentorship and insights were an immense asset to my development as researcher and person. I feel privileged to have had the opportunity to learn from and work with him. I am grateful to Jeff Draper for serving as the co-chair of my defense committee and for his continued support and helpful advice. I would like to thank Pedro Diniz for his guidance and support on my research. I have learned a lot from my interactions with him. I would also like to thank Jacqueline Chame for providing valuable feedback and critique on my research. Special thanks to Hans Zima, who provided invaluable suggestions that substantially improved the work on the language extensions. I am indebted to Alice Parker for giving me the confidence and encouragement to pursue my PhD. I am especially grateful to Aiichiro Nakano for all his advice and his feedback on this work. I also thank Viktor Prasanna for kindly agreeing to be part of my dissertation committee, for his time and advice. I am thankful to iii Janice Wheeler for her help in refining the final draft of this dissertation. Thanks also to my colleagues and friends in the CS&T division at ISI. I would like to thank my colleagues in the Scalable Modeling and Analy- sis department at Sandia National Laboratories with whom I spent two sum- mers: Nicole Lemaster Slattengren, Janine Bennett, Keita Teranishi, Craig Ulmer, Jeremiah Wilke and Robert Clay. I am also grateful to Srimat Chakradhar and Michela Becchi for giving me the opportunity to learn from them during my intern- ship at NEC Labs. I would also like to acknowledge the team at USC’s HPCC facility for the com- puting resources that I used extensively for the experiments presented in this work as well for their promptness in providing technical support. This research has been supported by the Semiconductor Research Corporation’s MuSyC program, the U.S. Department of Energy’s SciDAC SUPER and the U.S. Army Research Laboratory’s Army Research Office (ARO). I would like to thank the numerous people I met and interacted with during the review meetings, who provided valu- able feedback on my research and helped shaped several ideas presented in this dissertation. My deepest gratitude is to my family for instilling in me a love of learning. My grandmother, Vidya Laud and my uncle, Parag Laud have been steady sources of encouragement. My parents, Veena and Ramesh Hukerikar and my brother, Saumil have been unwavering in their love and support. Without their confidence in me, this dissertation would have not been possible. And to my beloved wife Tanu, who may just be the most loving, kind and caring person on earth. Her unconditional love and patience made my journey to the finish line much more enjoyable. iv Contents Dedication ii Acknowledgments iii List of Figures ix Abstract xii 1 Introduction 1 1.1 Brief Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Preliminaries: The Taxonomy of Resilient Computing 8 2.1 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Relationship between Fault, Error and Failure . . . . . . . . . . . . 10 2.3 Fault Tolerance and Resilience . . . . . . . . . . . . . . . . . . . . . 11 2.4 Classification of Faults . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Based on the Impact on the Application . . . . . . . . . . . 13 2.4.2 Based on Duration of the Fault . . . . . . . . . . . . . . . . 15 2.5 Recovery Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 Fault Recovery Techniques . . . . . . . . . . . . . . . . . . . 16 2.5.2 Error Recovery Techniques . . . . . . . . . . . . . . . . . . . 16 2.6 Reliability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.1 Mean Time to Failure (MTTF) . . . . . . . . . . . . . . . . 17 2.6.2 Mean Time to Repair (MTTR) . . . . . . . . . . . . . . . . 17 2.6.3 Mean Time Between Failures (MTBF) . . . . . . . . . . . . 17 2.6.4 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.5 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.6 Failure in Time (FIT) . . . . . . . . . . . . . . . . . . . . . 18 2.6.7 Workload Efficiency . . . . . . . . . . . . . . . . . . . . . . . 19 v 2.7 TheResilienceChallengeforFutureExascaleHigh-PerformanceCom- puting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Survey of Resilience Approaches in High-Performance Comput- ing 24 3.1 Checkpoint and Rollback Techniques . . . . . . . . . . . . . . . . . 25 3.2 Redundancy-Based approaches . . . . . . . . . . . . . . . . . . . . . 28 3.3 Algorithm-Based Fault Tolerance . . . . . . . . . . . . . . . . . . . 30 3.4 Programming Model-Based Resilience Techniques . . . . . . . . . . 32 4 Rolex: Resilience-Oriented Language Extensions 35 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Leveraging Programmer Knowledge . . . . . . . . . . . . . . . . . . 37 4.3 Design of the Resilience Oriented Language Extensions . . . . . . . 40 4.3.1 Goals for Resilience-Oriented Language Extensions . . . . . 40 4.3.2 Description of Syntactic Structure of Rolex . . . . . . . . . . 42 4.3.3 Rolex Keywords . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Rolex: Syntax and Semantics . . . . . . . . . . . . . . . . . . . . . 47 4.4.1 Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.3 Amelioration . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Compiler and Runtime Support . . . . . . . . . . . . . . . . . . . . 67 4.5.1 Compiler Infrastructure . . . . . . . . . . . . . . . . . . . . 67 4.5.2 Runtime Inference System . . . . . . . . . . . . . . . . . . . 70 4.5.3 Workflow of a Resilient Execution Environment . . . . . . . 74 4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6.1 Fault Injection Framework . . . . . . . . . . . . . . . . . . . 77 4.6.2 Accelerated Fault Injection Experiments . . . . . . . . . . . 78 4.6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 86 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5 Application-Level Fault Detection and Correction Through Adap- tive Redundant Multithreading 89 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 The Redundancy Solution for Fault Detection/Correction . . . . . . 91 5.3 Programmer Managed Scoping of Redundancy . . . . . . . . . . . . 94 5.4 Compiler Support for Adaptive Redundant Multithreading . . . . . 97 5.4.1 CompilerSupportforApplicationProgrammerScopedSpheres of Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 CompilerSupportforAdaptiveExecutionofSpheresofRepli- cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5 Runtime Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 100 vi 5.5.1 Compiler-Runtime Interface . . . . . . . . . . . . . . . . . . 102 5.5.2 System Fault Event Indicators for Runtime Adaptation . . . 103 5.5.3 Redundancy Adaptation Algorithm . . . . . . . . . . . . . . 104 5.6 Examples: Error Detection/Correction Through Adaptive Redun- dant Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.6.1 DoublePrecisionGeneralMatrix-MatrixMultiplication(DGEMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.6.2 Sparse Matrix Vector Multitplication (SpMV) . . . . . . . . 109 5.7 Optimization Strategies for Adaptive RMT . . . . . . . . . . . . . . 111 5.7.1 Lazy Fault Detection . . . . . . . . . . . . . . . . . . . . . . 111 5.7.2 Thread Clustering . . . . . . . . . . . . . . . . . . . . . . . 114 5.8 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 115 5.8.1 Fault Injection Framework . . . . . . . . . . . . . . . . . . . 115 5.8.2 Application Codes . . . . . . . . . . . . . . . . . . . . . . . 116 5.8.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 118 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6 An Introspective Runtime Framework for Resilience 127 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 The State of the Art in HPC Execution Environments . . . . . . . 129 6.3 Introspection-Based Runtime System . . . . . . . . . . . . . . . . . 131 6.3.1 Reflecting on Faults . . . . . . . . . . . . . . . . . . . . . . . 132 6.3.2 Algorithm for Self-Reflection . . . . . . . . . . . . . . . . . . 133 6.4 Resilience-Aware Scheduling . . . . . . . . . . . . . . . . . . . . . . 134 6.4.1 Fault-Aware Thread Assignment Based on Runtime Intro- spection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.5 Resilience-Aware Dynamic Voltage and Frequency Scaling . . . . . . 138 6.5.1 The Dynamic Voltage and Frequency Scaling Solution . . . . 138 6.5.2 Reliability Impact of DVFS . . . . . . . . . . . . . . . . . . 139 6.5.3 Resilience-Driven Policies for DVFS . . . . . . . . . . . . . . 140 6.5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.6 Integrating the Rolex Programming Model . . . . . . . . . . . . . . 142 6.6.1 Leveraging Rolex Features . . . . . . . . . . . . . . . . . . . 143 6.6.2 Extending the Compiler-Runtime Interface . . . . . . . . . . 145 6.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 147 6.7.1 Fault Injection Framework . . . . . . . . . . . . . . . . . . . 147 6.7.2 Evaluation of Resilience-Aware Thread Assignment . . . . . 148 6.7.3 Evaluating Resilience-Aware DVFS . . . . . . . . . . . . . . 149 6.7.4 Evaluation of Rolex-Based Runtime Introspection . . . . . . 150 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 vii 7 Conclusions and Future Work 154 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.1.1 Resilience-Oriented Programming Model . . . . . . . . . . . 155 7.1.2 AdaptiveRedundantMultithreadingforErrorDetectionand Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.1.3 Introspective Runtime Framework for Resilience-Aware Exe- cution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.2 Recommendations for Future Work . . . . . . . . . . . . . . . . . . 157 7.2.1 Enhanced Programming Model Features . . . . . . . . . . . 157 7.2.2 Expansion of Runtime Management Capabilities . . . . . . . 158 7.2.3 Integration with Checkpoint and Roll-back Libraries . . . . 158 Reference List 160 viii List of Figures 2.1 Relationship between fault, error and failure . . . . . . . . . . . . . 11 2.2 Propagation of errors in multi-component systems . . . . . . . . . . 11 2.3 Analyzing sources of failure in HPC systems: (a) Distribution of root causes of failures; (b) Distribution of system downtimes . . . . 20 2.4 Estimates of the relative increase in error rates as a function of process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Projections for exascale system features and mean time to failure . 23 4.1 Themes of programmer knowledge to enhance application resilience 37 4.2 IEEE 754 floating point representation . . . . . . . . . . . . . . . . 49 4.3 Unsigned integer (32-bit) representation . . . . . . . . . . . . . . . 49 4.4 Compiler infrastructure for Rolex . . . . . . . . . . . . . . . . . . . 67 4.5 Decision tree for error management by runtime inference system . . 73 4.6 Overview of application compilation and execution with RoLex . . . 74 4.7 EvaluationoftoleranceRolexextensions: Acceleratedfaultinjection results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.8 Evaluation of robustness Rolex extensions: Accelerated fault injec- tion results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 ix 4.9 EvaluationofameliorationRolexextensions: Acceleratedfaultinjec- tion results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1 Sphere of replication . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Compiler infrastructure for redundant multithreading . . . . . . . . 98 5.3 Code outlining for redundant multithreaded execution . . . . . . . . 100 5.4 Code outlining and transformations for adaptive redundant multi- threaded execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5 Timeline view of adaptive redundant multithreading . . . . . . . . . 106 5.6 Use of adaptive RMT for double precision matrix-matrix multipli- cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.7 Application of RMT directive for DGEMM . . . . . . . . . . . . . . 109 5.8 Use of adaptive RMT for sparse matrix vector multiplication . . . . 110 5.9 Application of RMT directive for SpMV . . . . . . . . . . . . . . . 111 5.10 Timeline view of adaptive redundant multithreading with lazy eval- uation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.11 Results: Comparison of performance overhead of adaptive redun- dant multithreading with process replication . . . . . . . . . . . . . 118 5.12 Results: Fault detection with adaptive redundant multithreading . . 121 5.13 Results: Fault detection and correction with adaptive redundant multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.14 Results: Fault detection with aRMT with lazy evaluation . . . . . . 123 5.15 Results: Fault detection with aRMT with thread clustering . . . . . 124 6.1 Integration of Rolex with introspective runtime system . . . . . . . 142 6.2 Results: Fault-aware thread scheduling . . . . . . . . . . . . . . . . 148 6.3 Results: Fault-aware dynamic voltage frequency scaling . . . . . . . 150 x 6.4 Results: Fault-aware thread scheduling with Rolex . . . . . . . . . . 151 6.5 Results: Fault-aware dynamic voltage frequency scaling with Rolex 152 xi Abstract Futureexascalehigh-performancecomputing(HPC)systemswillbeconstructed usingVLSIdeviceswithsmallerfeaturesizesthatwillbefarlessreliablethanthose used today. Furthermore, in the pursuit of higher floating point operations per sec- ond (FLOPS), these systems are projected to exponentially increase the number of processor cores and memory chips used. Unfortunately, the mean time to failure (MTTF) of the system scales inversely in relation to the number of components. Therefore, faults and resultant system-level failures will become the norm, not the exception. This will pose significant problems for system designers and program- mers who, for half a century have enjoyed an execution model that assumed correct behavior by the underlying computing system. However, not every error detected needs to result in catastrophic failure. Many HPC applications are inherently fault resilient, but lack convenient mechanisms to express their resilience features to the execution environments that are designed to be fault oblivious. In this dissertation work, we propose an execution model based on the notion of introspection. We develop a set of resilience-oriented language extensions that facilitate the incorporation of fault resilience as an intrinsic property of scientific application codes. These extensions are supported by a simple compiler infras- tructure and a runtime system that reasons about the context and significance of faults to the outcome of an application’s execution. We extend the compiler xii infrastructure to provide an application-level methodology for fault detection and correction that is based on redundant multithreading (RMT). We also propose an introspective runtime framework that continuously observes and reflects upon the system-level fault indicators to assess the vulnerability of the system’s resources. The introspective runtime system provides a unified execution environment that reasons about the implications of resource management actions for the resilience and performance of the application processes. Our results, which cover several high-performance computing applications and different fault types and distribu- tions, demonstrate that a resilience-aware execution environment is important in order to solve the most demanding computational challenges using future extreme scale HPC systems. xiii Chapter 1 Introduction By the end of this decade, exascale-class high-performance computing (HPC) systems promise to push the frontiers of scientific and engineering research in a broad range of disciplines including high energy and nuclear physics, chemistry and material sciences, nanotechnology, astrophysics, biology and medicine [1]. These next generation exascale systems, capable of performing a quintillion (10 18 ) oper- ations per second, will enable vastly more accurate predictive models and the analysis of massive data sets through simulation and modeling of physical phe- nomena. There was a paradigm shift in the early 1990s from vector supercomputers to cluster-based parallel machines, built from commodity off-the-shelf (COTS) com- ponents [2]. The bulk synchronous parallel model based on the message passing interface (MPI) [3] libraries became the programming model of choice for writing parallel programs for these systems. The HPC community has persisted with this model over the past two decades, deriving performance from faster interconnects and rising clock frequencies [4]. While the growth of clock frequencies tapered off 1 by the early 2000s with the end of Dennard’s scaling [5], Moore’s law continues to enable shrinking transistor geometries in successive process technology genera- tions. Recent supercomputing systems have seen the performance driven by mas- siveincreasesinprocessorcoresandmemorychipcounts[4]. Basedonthesetrends, variousstudies[6][7]haveprojectedthatfuturesystemswillrequireanexponential increase in compute and memory resources. For exascale-class machines, the sheer scale of the system presents several important technical challenges for its designers, programmers and users. Notable among these is the system’s ability to maintain operation in the presence of faults. The bounty provided by Moore’s law has been accompanied by reliability challenges as devices have become smaller [8] [9]. The reliability of individual components is projected to continue decreasing in relation to the shrinking transistor geometries in successive process technology generations due to device manufacturing challenges, as well as variation in operational behav- ior [10]. Furthermore, the scale of an exascale-class supercomputer will amplify the reliability challenges caused by the probabilistic nature of transistor behavior. With the expectation that faults will become the norm and not the exception, a system’s ability to tolerate faults and maintain service will become an increasingly difficult challenge. 1.1 Brief Overview Traditionally, HPC systems have viewed faults as anomalous events that are dealt with, through the process of checkpoint and rollback recovery, only when they result in catastrophic failure. This requires that the state of the application is periodically written to persistent storage (referred to as a checkpoint). Recovery from failures is handled by restoring the application state to the last committed 2 checkpoint and restarting the application. This approach has served the HPC community well over the past two decades. As systems scale, the fraction of computing capacity dedicated to checkpoint and restart grows proportionally [11], which adds an increasing overhead to the applications’ time-to-solution. Additionally, for the projected rates of faults in future exascale systems, HPC applications may also need to contend with multiple errors within a checkpointing interval. If the interval between errors happens to be less than the time required to create a checkpoint or recover state from persistent storage, then the checkpoint and rollback approach will no longer be a viable solution [12]. Moreover, checkpoint and restart approaches apply rollback recovery without evaluating the root cause of the failure (whether it resulted from a processor, memory, network or I/O fault), its persistence (whether it resulted from a transient, intermittent or permanent fault) or its impact on the application outcome (whether it affects an instruction, pointer variable or an application data structure). Therefore, due to the diversity and frequency of faults expected in future exascale-class HPC systems, C/R approaches will not be viable or sufficient. The application developers and users of HPC systems have grown accustomed to an execution model where the application software can always assume cor- rect behavior. In today’s systems, the occurrence of any error is often masked by hardware-based mechanisms or by the operating system (OS) to provide the notion of continuous operation for an HPC application. It is only when the error cannot be masked by the hardware or the OS that the system is halted, which also terminates the application processes. With the rising rate of faults in future systems, present hardware-based masking techniques will prove inadequate, lead- ing to fewer application runs reaching completion. In many cases, the application execution will complete, but with incorrect or invalid results. 3 TheHPCworkloadconsistsofapplicationsthatarepredominantlyusedforsim- ulation and modeling in a variety of scientific and engineering disciplines. These applications consist of numerical computations that can often cope with the pres- ence of errors, since they are designed to live with errors introduced by discretiza- tion, or errors introduced through assumptions and simplifications in the compu- tational model. The applications tend to be based on various numerical algorithms that solve problems through an iterative process. Therefore, the algorithm may smooth over the effects of occasional errors. Certain codes allow detection and cor- rection of errors with very low overheads at the application level through simple algorithmic techniques. However, by perpetuating the notion of a fault-oblivious execution environment, such application-level knowledge cannot be leveraged to enable the scientific applications to converge to reasonable solutions in the pres- ence of errors. 1.2 Contributions In this dissertation we put forth plausible new software-based strategies to manage the fault resilience of HPC applications. It is our hypothesis that the combination of an introspective runtime system and modest extensions to existing programming models, as well as the compiler-based transformations, can enable HPC applications to leverage their inherent resilience properties. This will enable applications to detect the presence of faults in their program state, reason about their significance to an application’s outcome, and in certain cases, manage recov- ery. Thisapproachenablesanapplicationandtheexecutionenvironmenttocollab- oratively manage the effect of system faults on the application’s outcomes. It also enables the execution environment to become aware of its own fault-tolerant state, 4 thus allowing reasoned utilization of algorithmic resilience strategies. Introspec- tion also permits a system resource allocation that is cognizant of application-level fault resilience features. Programmers of scientific applications are often well positioned to understand the various fault resilience features of their codes through their domain expertise and through experiences gained while optimizing the code performance. However, there are no convenient mechanisms to convey their fault tolerance knowledge to the system. While we cannot expose the users and programmers to the complex- ity of HPC system architectures, we can capture their knowledge through sim- ple and minimal extensions to the existing programming models. We present a set of resilience-oriented language extensions (Rolex) [13] [14] [15] [16] that allow HPC programmers to specify their fault-tolerance knowledge and expectations as intrinsic features of the application code. In concert with compiler-based trans- formations and a runtime system, these extensions enhance the ability of HPC applications to tolerate, mask or ameliorate the errors in the program state. For various HPC codes, our results show that errors, which would be otherwise fatal to the application, can be survived. HPC applications running on future exascale-class systems will experience sig- nificantly higher rates of faults. Therefore, we believe that the applications them- selves should be enabled with capabilities to actively search for and correct errors in their computations. Redundancy-based solutions often require complete dupli- cation of the program data and/or code execution. Although such macroscale redundancy enables transparent error detection and recovery through the pro- cess of comparison and majority voting among the replicas of program state, it incurs significant overhead to the overall application performance. We present an application-level fault detection and correction method that is based on adaptive 5 redundant multithreading (RMT) [17] [18] [19] [20]. We extend Rolex to enable the scope of redundant computation to be tailored by the application programmer. This approach allows a programmer to tune the redundancy to the extent required by an application’s algorithmic features. The compiler outlines the programmer- defined code blocks to support detection/correction of errors in the computation. The outlined code sections are set up to be executed either in serial mode or by redundant thread instances. The runtime allows the trade-off between over- head to application performance and fault-resilience by adapting the use of the RMT to the rate of faults in the system. Our experimental results show that this approach yields much lower overheads to the application performance in compari- son to macroscale process-level redundant execution. The fault-oblivious nature of current execution models implies that resource management decisions are rarely affected by the reliability concerns of the appli- cation. The software stacks contain no feedback loop to observe and respond to fault events in the system. We present an introspective runtime system that monitors and uses the rate of fault events in the system [21] to assess the vul- nerability of system resources. This enables proactive reliability management through dynamic reconfiguration of the execution environment to enhance its error resilience. We demonstrate a reliability-aware scheduling strategy in the context of symmetric multiprocessor (SMP) systems. We also demonstrate an introspection- based methodology that contemplates the reliability impact of thermal manage- ment techniques such as DVFS. The introspective runtime system also leverages the application’s fault resilience features that are made explicit through the Rolex programming model. 6 1.3 Organization The remainder of this dissertation is organized as follows: In Chapter 2 we discuss the reliability challenges for future exascale supercomputing systems. We also provide some background on reliability terminology and metrics useful for understanding our resiliency techniques. Chapter 3 presents a survey of fault tol- erance approaches used in HPC systems. We discuss their respective strengths and shortcomings and analyze their relevance and effectiveness in the presence of extremely high rates of faults, errors and failures. Our resilience-oriented language extensions (Rolex) are described in Chapter 4. We describe the design goals and philosophies behind Rolex as well as their formal syntax and semantics. We also explain the role of the compiler and runtime inference engine and describe sev- eral representative HPC application codes to demonstrate the viability of applying these language extensions. The application-level fault detection and correction strategy based on RMT is described in Chapter 5. In Chapter 6, we present the design and implementation of the introspective runtime system. We demonstrate how the introspective reliability management of system resources permits reason- able trade-offs between application resilience and performance. Finally, Chapter 7 concludes with a summary of our research contributions and recommended future directions for this work. 7 Chapter 2 Preliminaries: The Taxonomy of Resilient Computing Fault tolerance, as an essential part of the computing system design process, was emphasized by John von Neumman in 1952 in "Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components" [24]. For over half a century, reliability has been an active area of study and practice for specific computing applications - such as on-board computing for space missions, avia- tion control systems, automobile electronics and medical equipment - all of which require fault-tolerant operation regardless of the likelihood of faults. In high- performance computing, the most widely used fault tolerance techniques are based on coordinated checkpointing and rollback recovery. However, the performance and recoverability of these techniques are inadequate for modern supercomputing systems that are built by clustering thousands of NUMA and SMP nodes and use commodity off-the-shelf components. The analysis of system logs of large-scale 8 HPC systems [25] [26] points to a rising rate of errors resulting in failures as sys- tems scale out in the pursuit of higher performance. Reliable operation in future systems will also be affected by the decrease in dependability of components due to shrinking transistor devices. Based on the current trends and forward-looking projections on the reliability of semiconductor devices and large-scale system archi- tectures, studies indicate the need for fault resilience as an essential capability for emerging high-performance computing platforms and execution environments [7] [11] [27] [28] [29]. In this chapter we review the background and the terminology for understand- ing the reliability challenge (Section 2.1 - Section 2.6). Much of the description follows from work by the committees of the IEEE Computer Society’s Techni- cal Committee on Fault Tolerant Computing and the IFIP 1 Working Group on Dependable Computing and Fault Tolerance presented in [30] [31] [32]. For a more detailed background reading, we refer the reader to classic texts [33] [34] [35] [36] [37] and survey papers [38] [39] [40]. In Section 2.7, we seek to understand the reliability challenges for future exascale-class supercomputing systems. 2.1 Dependability Dependability is defined by the IFIP 10.4 Working Group on Dependable Com- puting and Fault Tolerance [41] as the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers. Reliability, avail- ability, safety and security are specific attributes of the dependability of the com- puting system. Relevant to this dissertation are: 1 International Federation for Information Processing 9 • Availability, which is defined as readiness for correct service. • Reliability, which is defined as continuity of correct service. 2.2 Relationship between Fault, Error and Failure The termsfault, error andfailure are sometimes used interchangeably. How- ever in fault tolerance literature, these terms are associated with distinct formal concepts which are defined as follows [32]: Fault is an underlying flaw or defect in a system that has potential to cause problems. A fault can be dormant and can have no effect, e.g., incorrect program codethatliesoutsidetheexecutionpath. Whenactivatedduringsystemoperation, a fault leads to an error. Fault activation may be due to triggers that are internal or external to the system. Errors result from the activation of a fault and cause an illegal system state. For e.g., a faulty assignment to a loop counter variable may result in an error characterized by an illegal value for that variable. When such a variable is used for control of a for-loop’s execution, it may lead to incorrect program behavior. Failure occurs if an error reaches the service interface of a system, resulting in system behavior that is inconsistent with the system’s specification. For e.g., a faulty assignment to a pointer variable leads to erroneous accesses to a data structure or buffer overflow, which in turn may cause the program to crash due to an attempt to access an out-of-bound memory location. There is a causality relationship between fault, error and failure, as shown in Figure 2.1. When the system is composed of multiple components, the failure of a single component causes a permanent or transient external fault for the other 10 Figure 2.1: Relationship between fault, error and failure Figure 2.2: Propagation of errors in multi-component systems components that receive service from the failed component. Therefore, errors may be transformed into other errors and propagate through the system, generating further errors, as illustrated in Figure 2.2. For example, a faulty procedure argu- ment leads to erroneous computation and may manifest as a failure in the form of an illegal procedure return value. To the caller of the function, this may activate a chain of errors that propagate until service failure occurs, i.e., a program crash. 2.3 Fault Tolerance and Resilience Fault tolerance is an aspect of reliability that entails avoiding service failures in the presence of faults. This usually involves determining the possible causes and/or consequences of failure and providing mechanisms that prevent the faults from causing failures [40]. Resilience is an approach to fault tolerance that places emphasis on the cor- rectness of the applications. In the context of high-performance computing, the 11 term was defined by various technical studies and working groups tasked with identifying the key challenges to exascale computing [28]: Resilience is an approach to fault tolerance for High End Computing (HEC) systems that seeks to keep the application workloads running to correct solutions in a timely and efficient manner in spite of frequent errors. Inotherwords, resiliencytechniquesembracethefactthattheunderlyingfabric of hardware and system software will be unreliable and seek to enable effective and resource-efficient use of the platform in the presence of system degradations and failures. In the context of HPC, the need to define resilience as distinct from fault tolerance arose because the workload consists of scientific simulations and data analysis applications. These applications typically represent continuous variables as discrete values, and there is already some discretization error in the data. The computations often contain solvers for systems of linear equations, partial and ordinary differential equations. At the core of these applications are numerical analysis methods that do not seek exact answers, because exact answers are often impossible to obtain in practice. Iterative algorithms start from an initial guess, and successive approximations converge to a solution which is within the bounds of certain error. The expectation of correctness for different regions of the program state of these applications may therefore be different. For certain regions, correctness may imply bit reproducibility, while for others it may be defined within bounds of a rounding error [29]. Resiliency techniques focus on the fidelity of the application outcome based on these definitions of correctness, rather than seeking to maintain the notion of perfect program state throughout the application execution. 12 2.4 Classification of Faults Faults may be categorized based on the impact on the application’s address space and the effect on the eventual outcome of the execution. They may also be categorized on the basis of their persistence in time. We detail the categories for these classifications in Section 2.4.1 and Section 2.4.2, since these are most relevant to the resiliency techniques presented in this dissertation. Other classi- fications based on various considerations such as root cause, instantiation with respect to system boundaries, dimension etc., are possible, and [32] contains a complete taxonomy and descriptions of each class of faults. In designing HPC resilience strategies, while we must be aware of different types of faults and the trends behind the rising number of faults, it is not necessary to understand the root cause of every possible fault and its activation sequence. Rather a more gen- eral concept called fault model describes the impact of the fault to the HPC application independent of the underlying phenomena causing it. 2.4.1 Based on the Impact on the Application • Detected and Corrected Errors (DCE): Such errors are detected and potentially corrected transparently without affecting the application’s execu- tion. Example: A single bit error detected and corrected by hardware-based mechanisms such as error correction codes, parity, etc. Such mechanisms usually raise notifications to the operating system for logging purposes, but the hardware-based correction ensures that the application’s address space is not affected. 13 • Detected but Unrecoverable Errors (DUE): These errors are detected but cannot be masked. DUEs typically lead to a catastrophic crash for the application run. Example: a double bit error on a DRAM line which is protected by ECC may be detected but a single-bit error correction double- bit error detection (SECDED) capability provided by ECC is inadequate to mask the error. A notification is raised to the operating system via a non- maskable interrupt, which in turn forces a system shutdown, terminating the application processes. • Silent Errors (SE): A silent fault is one which remains undetected by any detection mechanism. It may affect part of the program state but its impact is not visible at the application level, i.e., it is benign. If the fault causes an error in the application visible state, it is referred to as a silent data corruption (SDC). SDC errors do not always lead to application crash; in certain cases, the application may complete, but with incorrect results. This classification serves to select a strategy to manage the impact on the application. For example, SDC requires error detection as well as correction capa- bilities, whereas the DUE requires understanding the part of the application state affected and determining whether the error can be masked or whether the applica- tion cannot continue. DCE notifications can be used to affect proactive migrations or future resource assignments. 14 2.4.2 Based on Duration of the Fault Faults and errors can be transient, permanent, or intermittent in nature. • Transient: A transient fault occurs once but does not persist; the error resulting from the transient fault is called a soft error or single-event upset (SEU). • Intermittent: An intermittent fault occurs repeatedly at irregular inter- vals in a system that otherwise operates normally; it manifests itself as an intermittent error. • Permanent: A permanent fault persists after it occurs, and is called a "hard fault". The resulting error usually manifests itself repeatedly until the component is repaired or replaced. The persistence of the fault determines the recovery strategy. Permanent faults usuallyrequirethatthefaultycomponentbeavoideduntilitisrepairedorreplaced. On the other hand, transient faults do not require repair/replacement of the com- ponent, but the impact of the resulting soft error needs to be masked. Intermittent faults may be treated as transient or permanent depending on their location and frequency. 2.5 Recovery Techniques Fault masking entails affecting an anomaly in the system so that a fault does not become an error. Masking may be performed at the logical level, in the architecture state, or even at the application level. Recovery techniques seek to prevent a fault or error from causing a catastrophic failure for the system. 15 2.5.1 Fault Recovery Techniques Fault recovery prevents the activation of faults that lead to error states; in certain cases, such techniques also seek to prevent the reactivation of faults. These techniques include: • Diagnosis: identification of the location and type of fault. • Isolation: exclusion of a faulty component to prevent its interaction with other system components; this avoids further fault activations in the other components. • Reconfiguration: online replacement of a faulty component or migration of tasks to other fault-free components in the system. • Reinitialization: reset of the system to a pristine state by reinitialization of the faulty components. 2.5.2 Error Recovery Techniques Error recovery techniques eliminate the error condition and attempt to place the system in a correct state: • Roll-back recovery: the system is restored to a known correct state using a previously created checkpoint roll-back, or by using retry mechanisms. • Roll-forwardrecovery: thesystemstateismovedforwardtoanewcorrect state. • Compensation recovery: the error is corrected by maintaining redundant state information. 16 2.6 Reliability Metrics 2.6.1 Mean Time to Failure (MTTF) MTTF is the measure of the average time elapsed until failure of the system. Often a component failure notification is communicated via an interrupt, and the equivalent term "mean time to interrupt (MTTI)" is also found in literature. The HPC community also defines the terms "job mean time to interrupt (JMTTI)", "node mean time to interrupt (NMTTI)" and "system mean time to interrupt (SMTTI)" [29] [42]. 2.6.2 Mean Time to Repair (MTTR) The "mean time to repair (MTTR)" is the average length of an unscheduled system downtime period necessary to repair or replace component(s) in the system and restore its operational status. 2.6.3 Mean Time Between Failures (MTBF) "Mean time between failures (MTBF)" is the average elapsed time between failures. It is similar to MTTF, except that it also considers the time to repair the failed component(s): MTBF =MTTF +MTTR (2.1) 2.6.4 Reliability Reliability is qualitatively defined as the ability of a system to operate cor- rectly under a given set of conditions for a specified period of time. The reliability 17 of a system at time t is the probability that the system has been operating cor- rectly from t 0 until time t. If the system consists of m components, and the MTBF of any component i is independent of all other components, the reliability R of the system is: R = 1 MTBF 1 + 1 MTBF 2 + 1 MTBF 3 +...+ 1 MTBF m (2.2) 2.6.5 Availability Availability is a measure of the system’s reliability. The availability of a system at time t is the probability that the system is operating correctly at time t. It is also expressed as the fraction of the total time that the complete system, or at least one of its components is in functioning condition. A = Uptime (Uptime +Downtime) (2.3) Availability is specified in terms of "number of nines." For example, when a system has availability of "six nines," it has 99.999% availability. The availability may be expressed in terms of the MTTF, MTTR and MTBF: Availability = MTTF MTTF +MTTR = MTTF MTBF (2.4) 2.6.6 Failure in Time (FIT) The failures in time (FIT) rate of a component or a system is the number of failures incurred in over one billion (10 9 ) hours of operation. The FIT rate is 18 inversely proportional to MTTF. When calculating the FIT rate of the complete system, the FIT rates of the individual components are additive: FIT system =FIT A +FIT B ...FIT N (2.5) such that the failure of any one component results in failure of the system. 2.6.7 Workload Efficiency In order to quantify the merit of any resilience mechanism, its overhead to the application’s time-to-solution is measured by the workload efficiency [29]. The workload efficiency for an application is the ratio of the ideal time-to-solution on a fault-free system to the actual running time in the presence of faults. Efficiency = t fault−free t actual−in−presence−faults (2.6) The difference between t fault−free and t actual−with−faults is the overhead associated with dealing with faults, errors, and failures; this includes the time for fault detec- tion,diagnosis,repairand/orrecoveryandcompensation,aswellasanyapplication time lost due to of degraded performance of any system resource. 19 Figure 2.3: Analyzing sources of failure in HPC systems: (a) Distribution of root causes of failures; (b) Distribution of system downtimes 2.7 The Resilience Challenge for Future Exascale High-Performance Computing Systems Various studies have sought to understand failures in large-scale HPC systems, including their root cause analysis and whether the causality can be statistically modeled. A study at the Los Alamos National Laboratory (LANL) [26], which analyzes the failure log data of their production HPC systems, attributes 64% of failures to hardware errors, as indicated by the summary of their results in Figure 2.4. As many as 40% of these failures are attributed to memory subsystem errors. Similar analyses of the operational logs from the IBM Blue Gene/L (BG/L) supercomputer at the Lawrence Livermore National Laboratory (LLNL) and the RedStorm supercomputer at the Sandia National Laboratories (SNL) reveal that 62% and 93% of the events were attributed to hardware anomalies with root causes ranging from bus parity errors, data TLB and address exceptions, to disk errors that affect the file system [25]. 20 Figure 2.4: Estimates of the relative increase in error rates as a function of process technology DRAM errors are one of the leading causes of HPC system failures, and studies of system logs of the IBM Blue Gene/L (BG/L) at LLNL and the Blue Gene/P (BG/P) at the Argonne National Laboratory (ANL) note that more than 50% of the nodes are affected by more than 100 DRAM errors per month, while 5% of the nodes experience in excess of 1 million DRAM errors per month. A significant fraction of errors on the BlueGene/P (22%) are multi-bit errors (MBE), and 17% of all errors are not correctable with the SECDED capability offered by ECC [43]. The Jaguar supercomputer at the Oak Ridge National Laboratory (ORNL) experienced about 20 faults per hour with as many as 861 machine exceptions and 12 kernel panics (node crashes) over a 72-hour observation period [44]. The analysis of the memory-related errors on the Jaguar estimated a failure rate of 0.057-0.071 FIT/Mbit, which for the total system memory of 584 TB translated to one DRAM-related failure every three to four hours [45]. These studies indicate that faults are not rare events in large-scale systems and that the distribution of failure root cause is dominated by faults that originate 21 in hardware. These may include faults due to radiation-induced effects such as particle strikes from cosmic radiation, circuit aging related effects, and faults due to chip manufacturing defects and design bugs that remain undetected during post-silicon validation and manifest themselves during system operation. With aggressive scaling of CMOS devices, the amount of charge required to upset a gate or memory cell is decreasing with every process shrink. For very fine transistor feature sizes, the lithography used in patterning transistors causes variations in transistor geometries such as line-edge roughness, body thickness variations and random dopant fluctuations. These lead to variations in the electrical behavior of individual transistor devices, and this manifests itself at the circuit-level in the form of variations in circuit delay, power, and robustness [46]. It is expected that future exascale-capability systems will use components that have transistor feature sizes between 5 nm and 7 nm, and that these effects will become more prevalent, thereby causing the system components to be increasingly unreliable. The estimates of error rates as a function of semiconductor process technology are shown in Figure 2.4 [27]. The modeling and mitigation of these effects through improved manufacturing processes and circuit-level techniques might prove too difficult or too expensive. Today’spetascale-classHPCsystemsalreadyemploymillionsofprocessorcores and memory chips to drive HPC application performance. The recent trends in system architectures suggest that future exascale-class HPC systems will be built from hundreds of millions of components organized in complex hierarchies. How- ever, with the growing number of components, the overall reliability of the system decreases proportionally. If p is the probability of failure of an individual compo- nent and the system consists of N components, the probability that the complete system works is (1 - p) N when the component failures are independent. It may 22 Figure 2.5: Projections for exascale system features and mean time to failure therefore be expected that some part of an exascale class supercomputing system will always be experiencing failures or operating in a degraded state. The drop in MTTF of the system is expected to be dramatic based on the projected system features (Figure 2.5) [7]. In future exascale-class systems, the unreliability of chips due to transistor scal- ing issues will be amplified by the large number of components. For long running scientific simulations and analysis applications that will run on these systems, the accelerated rates of system failures will mean that their executions will often ter- minate abnormally, or in many cases, complete with incorrect results. Finding solutions to these challenges will therefore require a concerted and collaborative effort on the part of all the layers of the system stack. 23 Chapter 3 Survey of Resilience Approaches in High-Performance Computing The concept of fault tolerance was formulated by Avizienis [34] and refers to a system capability that permits programs to be correctly executed despite the occurrence of logic faults. The designers of early computers used practical tech- niques to increase reliability: redundant structures to mask failed components, error-control codes and duplication or triplication with voting to detect or cor- rect information errors, diagnostic techniques to locate failed components, and automatic switchovers to replace failed subsystems [40]. Most innovations and advances in HPC are driven by the quest for higher system performance. For the designers and practitioners of HPC, incorporating fault-tolerance capabilities has traditionally been an afterthought rather than an essential part of the system design process. While the performance has risen from gigaflops to petaflops over the past two decades, the state of the art in reliability for these large-scale HPC systems has barely advanced beyond checkpoint and 24 rollback (C/R) techniques during that period of time. C/R approaches require that the application be paused while the checkpoint is committed to stable storage; recovery entails replaying all computation and communication from the last stable checkpoint. This scheme incurs much overhead on the application’s performance, due to large program address spaces and limited disk bandwidths. In this chapter we survey the various fault-tolerance techniques relevant to HPC. These include practical C/R approaches, that are widely used in production HPC systems today (Section 3.1), as well as promising research proposals. The latter include redundancy-based approaches (Section 3.2), algorithm-based fault- tolerance (ABFT) (Section 3.3), and approaches that propose modifications to existing programming models to include fault-resilience capabilities (Section 3.4). 3.1 Checkpoint and Rollback Techniques C/R approaches are based on the concept of capturing the state of the appli- cation at key points of the execution, which is then saved to persistent storage. Upon detection of a failure, the application state is restored from the latest disk committed checkpoint, and execution resumes from that point. The Condor stan- dalone checkpoint library [48] was developed to provide checkpointing for UNIX processes, while the Berkeley Labs C/R library [49] was developed as an extension to the Linux OS. The libckpt [50] provided similar OS-level process checkpointing, albeit based on programmer annotations. Inthecontextofparalleldistributedcomputingsystems, checkpointingrequires global coordination, i.e., all processes on all nodes are paused until all messages in-flight and those in-queue are delivered, at which point all the processes’ address spaces, register states, etc., are written to stable storage, generally a parallel file 25 system, through dedicated I/O nodes. The significant challenge in these efforts is the coordination among processes so that later recovery restores the system to a consistent state. These approaches typically launch daemons on every node that form and maintain communication groups that allow tracking and managing recovery by maintaining the configuration of the communication system. The failure of any given node in the group is handled by restarting the failed process on a different node, by restructuring the computation, or through transparent migration to another node [51] [52] [53]. Much work has also been done to optimize the process of C/R. A two-level recovery scheme proposed optimization of the recovery process for more probable failures, so that these incur a lower performance overhead while the less probable failuresincurahigheroverhead[54]. Thescalablecheckpoint/restart(SCR)library [55] proposes multilevel checkpointing where checkpoints are written to storage that use RAM, flash, or local disk drive, in addition to the parallel file system, to achieve much higher I/O bandwidth. Oliner et al. propose an opportunistic checkpointing scheme that writes checkpoints that are predicted to be useful - for example, when a failure in the near future is likely [56]. Incremental checkpointing dynamically identifies the changed blocks of memory since the last checkpoint through a hash function [57] in order to limit the amount of state required to be captured per checkpoint. Data aggregation and compression also help reduce the bandwidth requirements when committing the checkpoint to disk [58]. Plank et al. eliminate the overhead of writing checkpoints to disk altogether with a diskless in-memory checkpointing approach [59]. In general, automatic application-oblivious checkpointing approaches suffer from scaling issues due to the considerable I/O bandwidth for writing to persistent 26 storage. Also, practical implementations tend to be fragile [11]. Therefore, sev- eral MPI libraries have been enabled with capabilities for C/R [60]. The CoCheck MPI [61], based on the Condor library, uses synchronous checkpointing in which all MPI processes commit their message queues to disk to prevent messages in flight from getting lost. The FT-MPI [62], Open MPI [63], MPICH-V [64] and LAM/MPI [65] implementations followed suit by incorporating similar capabili- ties for C/R. In these implementations, the application developers do not need to concern themselves with failure handling; the failure detection and application recovery are handled transparently by the MPI library, in collaboration with the OS. Petascale systems today spend as much as 20% or more of their computing capacity [11] on checkpoint and rollback recovery. Future exascale systems will need to gather state from thread counts on the order of a billion threads. There is concern that the ratio of computing capacity spent on C/R may rise to the point that the applications themselves make very little forward progress [66]. Some studies [12] even predict that future exascale systems might experience failures so frequently that the MTTF of the system might become smaller than the time needed to create the checkpoint and commit it to persistent storage and/or the time required to restore state from disk and restart the application, in which case C/R will no longer be a sufficient mechanism to recover applications from node failures. 27 3.2 Redundancy-Based approaches An n-modular redundancy approach entails creating n replicas of the compu- tation or program state. The results are compared to detect errors; error masking is possible through majority voting. The redundancy may be spatial (using multi- ple independent components) or temporal (replicating the computation using the same components, but separating the replication by some time interval). Dual- modular redundancy (DMR) and triple-modular redundancy (TMR) approaches have been successfully used in mission-critical systems using hardware-based repli- cation. Examples of fault-tolerant servers include the Tandem Non-Stop [68] and theHPNonStop[69]thatusetworedundantprocessorsrunninginlockedstep. The IBM G5 [70] employs two fully duplicated lock-step pipelines to enable low-latency detection and rapid recovery. While these solutions are transparent to the super- visor software and application programmer, they require specialized hardware. In the world of high-performance computing, studies have argued for multi-modular redundancy in compute nodes and have been shown to accommodate a reduction in individual component reliability by a factor of 100 to 100,000 to justify the 2x or 3x increase in cost and energy [71]. Stearley et al. [72] observe that while partial process replication helps increase the JMTTI of tasks, there is no alternative to full process replication for highly resilient operation. Error correction codes (ECC) use a flavor of redundancy in memory structures that typically add additional bits to enable detection and correction of errors. Single bit-error correction and double bit-error detection (SECDED) is the most widely used variant of ECC, while researchers have also explored Bose-Chaudhuri- Hocquenghem (BCH) and double-bit error correction and triple-bit error detection (DECTED) [73] for multi-bit detection and correction. Chipkill [74] is a stronger 28 memory protection scheme that is widely used in production HPC systems. It accommodates single DRAM memory chip failure as well as multi-bit errors from any portion of a single memory chip by interleaving bit error-correcting codes across multiple memory chips. In order to enhance Chipkill with more robust error detection and correction schemes, memory designers will need to incorporate further redundant information in the memory lines, which increases chip area as well as overheads to power and data access latencies. Software-based redundancy promises to offer more flexibility and tends to be less expensive in terms of silicon area as well as chip development and verifica- tion costs; it also eliminates the need for modifications of architectural specifica- tions. Process-level redundancy (PLR) [75] creates a set of redundant application processes whose output values are compared. The scheduling of the redundant processes is left to the operating system (OS). Software redundancy through mul- tithreading has also been explored. SWIFT [76] is a compiler-based transforma- tion which duplicates all program instructions and inserts comparison instructions during code generation so that the duplicated instructions fill the scheduling slack. The DAFT [77] approach uses a compiler transformation that duplicates the entire program in a redundant thread that trails the main thread and inserts instructions for error checking. The SRMT [78] uses compiler analysis to generate redundant threads mapped to different cores in a chip multi-processor and optimizes perfor- mance by minimizing data communication between the main thread and trailing redundant thread. The process-level redundancy approach has also been evaluated in the context of a MPI library implementation [79], where each MPI rank in the application is replicated and the replica takes the place of a failed rank, allowing the application to continue. The RedMPI library [80] replicates MPI tasks and compares the received messages between the replicas in order to detect corruptions 29 in the communication data. However, with the growing complexity of long run- ning scientific applications, complete multi-modular redundancy, whether through hardware or software-based approaches, will incur exorbitant overhead to costs, performance and energy, and is not a scalable solution to be widely used in future exascale-class HPC systems. 3.3 Algorithm-Based Fault Tolerance Algorithm-based fault tolerance (ABFT) schemes encode the application data to detect and correct errors, e.g., the use of checksums on dense matrix struc- tures. The algorithms are modified to operate on the encoded data structures. ABFT was shown to be an effective method for application-layer detection and correction by Huang and Abraham [81] for a range of basic matrix operations including addition, multiplication, scalar product, transposition. Such techniques were also proven effective for LU factorization [82], Cholesky factorization [83] and QR factorization [84]. Several papers propose improvements for better scalability in the context of parallel systems, that provide better error detection and correc- tion coverage with lower application overheads [85] [86] [87]. The checksum-based detectionandcorrectionmethodstendtoincurveryhighoverheadstoperformance in sparse matrix-based applications. Sloan et al. [89] have proposed techniques for fault detection that employ approximate random checking and approximate clustered checking by leveraging the diagonal, banded diagonal, and block diag- onal structures of sparse problems. Algorithm-based recovery for sparse matrix problems has been demonstrated through error localization and re-computation [90] [91]. 30 Various studies have evaluated the fault resilience of solvers of linear algebra problems [92]. Iterative methods including Jacobi, Gauss-Seidel and its variants, the conjugate gradient, the preconditioned conjugate gradient, and the multi-grid begin with an initial guess of the solution and iteratively approach a solution by reducing the error in the current guess of the answer until a convergence criterion is satisfied. Such algorithms have proved to be tolerant to errors, on a limited basis, since the calculations typically require a larger number of iterations to converge, based on magnitude of the perturbation, but eventual convergence to a correct solution is possible. Algorithm-based error detection in the multigrid method shown by Mishra et al. [93], uses invariants that enable checking for errors in the relaxation, restriction and the interpolation operators. For fast Fourier transform (FFT) algorithms, an error-detection technique calledthesum-of-squares(SOS)waspresentedbyReddyet al. [94]. Thismethodis effective for a broader class of problems called orthogonal transforms and therefore applicable to QR factorization, singular-value decomposition, and least-squares minimization. Error detection in the result of the FFT is possible using weighted checksums on the input and output [95]. While the previously discussed methods are primarily for numerical algorithms, fault tolerance for other scientific application areas has also been explored. In molecular dynamics (MD) simulations, the property that pairwise interactions are anti-symmetric (F ij = - F ji ) may be leveraged to detect errors in the force calcu- lations [97]. The resilience of the Hartree-Fock algorithm, which is widely used in computational chemistry, can be significantly enhanced through checksum-based detection and correction for the geometry and basis set objects. For the two- electron integrals and Fock matrix elements, knowing their respective value bounds allows for identifying outliers and correcting them with reasonable values from a 31 range of known correct values. The iterative nature of the Hartree-Fock algo- rithm helps to eliminate the errors introduced by the interpolated values [98]. The fault-tolerant version of the 3D- protein reconstruction algorithm (FT-COMAR) proposed by Vassura et al. [99] is able to recover from errors in as many as 75% of the entries of the contact map. 3.4 Programming Model-Based Resilience Techniques HPC programs usually deploy a large number of nodes to implement a single computation and use MPI with a flat model of message exchange in which any node can communicate with another. Every node that participates in a computa- tion acquires dependencies on the states of the other nodes. Therefore, the failure of a single node results in the failure of the entire computation since the message passing model lacks well-defined failure containment capabilities [11]. User-level failure mitigation (ULFM) [100] extends MPI by encouraging programmer involve- ment in the failure detection and recovery by providing a fault-tolerance API for MPI programs. The error handling of the communicator has changed from MPI _ERRORS_ARE_FATAL to MPI_ERRORS_RETURN so that error recovery maybehandledbytheuser. TheproposedAPIincludesMPI_COMM_REVOKE, MPI_COMM_SHRINK to enable reconstruction of the MPI communicator after process failure and the MPI_COMM_AGREE as a consistency check to detect failures when the programmer deems such a sanity check necessary in the applica- tion code. The abstraction of the transaction has also been proposed to capture a pro- grammer’s fault-tolerance knowledge. This entails division of the application code 32 into blocks of code whose results are checked for correctness before proceeding. If thecodeblockexecution’scorrectnesscriteriaarenotmet, theresultsarediscarded andtheblockcanbere-executed. SuchanapproachwasexploredforHPCapplica- tions through a programming construct called Containment Domains by Sullivan et al. [101] which is based on weak transactional semantics. It enforces the check for correctness of the data value generated within the containment domain before it is communicated to other domains. These containment domains can be hierar- chical and provide the means to locally recover from an error within that domain. A compiler technique that, through static analysis, discovers regions that can be freely re-executed without checkpointed state or side-effects, called idempotent regions, was proposed by de Kruijf et al. [102]. Their original proposal [103], how- ever, was based on language-level support for C/C++ that allowed the application developer to define idempotent regions through specification of relax blocks and recover blocks that perform recovery when a fault occurs. The FaultTM scheme adapts the concept of hardware-based transactional memory where atomicity of computation is guaranteed. The approach requires an application programmer to define vulnerable sections of code. For such sections, a backup thread is created. The original and the backup thread are executed as an atomic transaction, and their respective committed result values are compared [104]. Complementary to approaches that focus on resiliency of computational blocks, the Global View Resilience (GVR) project [105] concentrates on application data and guarantees resilience through multiple snapshot versions of the data whose cre- ation is controlled by the programmer through application annotations. Bridges et al. [106] proposed a malloc_failable that uses a callback mechanism to han- dle memory failures on dynamically allocated memory, so that the application programmer can specify recovery actions. The Global Arrays implementation of 33 the Partitioned Global Address Space (PGAS) model presents a global view of multidimensional arrays that are physically distributed among the memories of processes. Through a set of library API for checkpoint and restart with bindings for C/C++/FORTRAN, the application programmer can create checkpoints of array structures. The library guarantees that updates to the global shared data are fully completed and any partial updates are prevented or undone [107]. Most programming model approaches advocate a collaborative management of the reliability requirements of applications through a programmer interface in conjunction with compiler transformations, a runtime framework and/or library support. Eachapproachrequiresdifferentlevelsofprogrammerinvolvement, which has an impact on amount of effort to re-factor the application code, as well as on the portability of the application code to different platforms. 34 Chapter 4 Rolex: Resilience-Oriented Language Extensions 4.1 Overview In today’s high-performance computing (HPC) systems, we enjoy a model of execution in which the application presumes correct behavior by the underlying fabric of hardware and system software, i.e., the execution environment. Errors are typically masked by hardware-based mechanisms, and error events that cannot be handledbythesystemlayersusuallycausefatalsystemcrash, whichiscatastrophic for the application processes. For the projected fault rates in future exascale- class HPC systems, these mechanisms will prove inadequate, leading to frequent application failures or incorrect results. However, not all faults and errors need to result in a catastrophic crash. Many of the scientific applications that run on these systems contain features that allow theeffectofcertainfaultsanderrorstobemanagedattheapplicationlevelthrough 35 algorithmic methods. Since error information is rarely communicated to the upper layers, i.e., the applications and libraries, they are insufficiently equipped to handle the errors. Programmers of scientific applications, through their domain exper- tise and familiarity with the application codes, gained through code optimization efforts, are usually well positioned to understand such fault-resilience features. However, they lack convenient mechanisms to express such knowledge to the sys- tem. In this chapter we investigate whether a combination of simple language level extensions in concert with a compiler infrastructure and a runtime inference frame- work can enhance the ability of HPC applications to manage the effects of faults and errors in their state. We propose Rolex, a set of Resilience-OrientedLanguage Extensions that will allow HPC programmers to specify their knowledge of the fault-tolerance features of the program code and their expectations of application outcomes. Using the application-level knowledge, the execution environment can reason about the significance of the errors to the application correctness. We define the syntax of the resilience-oriented language extensions, describe their fault-resilience semantics, and their integration with the compiler infrastructure and runtime system. We also describe our experience of applying Rolex to sev- eral common HPC application codes and evaluate the application resilience using accelerated fault injection experiments. The remainder of this chapter is organized as follows: Section 4.2 explains the basis of our approach, which manages the application’s fault resilience using programmer knowledge captured through simple language extensions. Section 4.3 describes the design goals and philosophies behind Rolex, and Section 4.4 presents the syntax and semantics. In Section 4.4 we also provide several examples that demonstrate the viability of applying these language extensions in the context of 36 Figure 4.1: Themes of programmer knowledge to enhance application resilience real HPC applications. Section 4.5 elaborates on the role of the compiler and runtime inference system. Section 4.6 presents the results of our fault injection experiments and also studies the impact on application performance. 4.2 Leveraging Programmer Knowledge The HPC workload consists of scientific computations, many of which are nat- urally tolerant to data errors. Their algorithmic behavior might simply filter the occasional incorrect value, as is the case with many numerical iterative algorithms, or they might rely on pseudorandom processes, as is the case with Monte Carlo techniques. Several applications that use numerical analysis methods can toler- ate limited loss in floating point precision. In certain applications, the impact of errors in the data or computation can even be trivially healed through simple algo- rithmic methods. For example, parity and checksums can be applied to specific 37 data structures or procedure executions to detect the presence of data corrup- tions within the application’s address space. However, part of the variable state, especially that which affects program control flow and pointer arithmetic, is very sensitive to errors. Therefore, for certain parts of the program state, the notion of correctness may be defined within the bounds of certain rounding error, while for others it may require precise bit reproducible correctness [29]. However, such application-level fault-tolerance knowledge can rarely be exploited by the execu- tion environment since there are no convenient mechanisms available to capture such knowledge. Also, current execution environments are not designed to accept guidance from the application layer while managing error states. HPC application programmers are well positioned to understand this fault- tolerance knowledge because they tend to be experts in their respective scientific domains. They also tend to spend large amounts of time and effort optimizing and tuning their code to achieve better performance and therefore understand nuances of their program code structure. We believe that given appropriate inter- faces to express their fault tolerance knowledge, programmers can contribute to enhancing the execution environment’s management of the application resilience. Programming constructs can support the fault-tolerance capabilities, namely error detection, containment and recovery (previously discussed in Section 2.5) at the application level. Such mechanisms can utilize the fault-tolerance features of the algorithm and/or the program code structure to seek to prevent application failure for every error instance in the system. The programming model features can also ensure that the errors do not impact the other application data structures. They can also provide containment for computational errors at the application level. Certain parts of the program code may be re-executed and/or specific variable state can be re-initialized through the creation of well-defined code regions and 38 data scoping constructs. Broadly, programmers can express knowledge on three major themes (illustrated in Figure 4.1): • Tolerance: A programmer may choose to tolerate limited loss in floating precision for certain program values, or allow occasional perturbations of cer- tain data values. The programmer may also be aware of regions of computa- tionthatemployiterativerefinement,suchthaterrorswhichcauseanomalous intermediate results may be tolerated without affecting the correctness of the final application outcome. • Robustness: Certain data structures and computation, notably those related to the program control flow and pointer arithmetic need bit-level correctness. The programmer may identify application level constructs that require stronger checks for error detection, as well as masking mechanisms to guarantee deterministic program behavior. • Amelioration: Avarietyofalgorithmictechniquesexistthatnotonlydetect but also heal the effect of errors in data structures. Such techniques maintain redundantinformation, suchaschecksums, torecovererroneousvalues. They may also use value re-initialization to repair variable state. Certain applica- tions even allow compensating erroneous values by interpolating neighboring values. The programmer may be able to provide the appropriate methods to ameliorate program state. Programming model extensions may be designed to enable the execution envi- ronmenttocaptureapplication-levelfeaturesoneachofthesethemesofknowledge. When such a programming API is supported by a compiler infrastructure and a runtime system, the execution environment is able to maintain a knowledge base of programmer-specified resilience features for various program-level data structures 39 and functions. Such an approach supports a fault-aware execution environment that can provide error resilient operation for HPC application processes without compromising the application performance, or the productivity of programmers. 4.3 Design of the Resilience Oriented Language Extensions The aim of the resilience-oriented language extensions based programming model is to succinctly capture the fault-tolerance knowledge to equip HPC applica- tions with capabilities to deal with specific error states. In this section we describe the design goals and provide a notational description of the syntactic structures for Rolex. 4.3.1 Goals for Resilience-Oriented Language Extensions In designing the language extensions, we sought to capture each of the flavors of knowledge described in Section 4.2, and in the process, also enable each of the aspects of fault management (detection, containment, recovery) described in Sec- tion 2.5. Broadly, our goals for the resilience-aware programming model extensions are: 1. It is our goal to retain the familiarity of current programming paradigms. The dominant choice of programming languages for HPC application pro- grammers over the past few decades has been FORTRAN/C/C++. There- fore, we aim to adopt a simple C/C++ syntax for the resilience-oriented lan- guage constructs that allows embedding resilience capabilities within existing programming model features. 40 2. We seek to minimize the time and effort required by programmers to learn and adopt the language extensions; therefore, these resilience-oriented lan- guage extensions must provide a concise and elegant syntax and include a small set of new language keywords for expressing the resilience features. 3. We also seek a fair division of work between the language extensions and the compiler and runtime framework, such that the programmer does not need to be exposed to the complexity of the HPC execution environment, yet is provided with sufficient abstractions to be able to concisely convey fault management knowledge related to application-level constructs. 4. Recognizing that HPC programmers are very reluctant to trade off their performance, which is usually achieved by investing much time and effort in hand-tuning the code, we seek to ensure that the resilience-oriented language extensions and compiler transformations do not drastically affect the code structure. 5. As HPC systems grow increasingly large in pursuit of higher performance, they are becoming increasingly heterogeneous and topologically complex; therefore, theyneedtoharnessavarietyofnovelparallelprogrammingframe- works. Yet the applications seek to retain the well-understood foundation of the Message Passing Interface (MPI) as well as certain well-tuned productiv- ity libraries such as BLAS and LAPACK written in C and FORTRAN. It is also our goal to ensure that resilience-oriented language extensions integrate seamlessly with these language features and library frameworks. 41 4.3.2 Description of Syntactic Structure of Rolex The language extensions include a collection of features that extends the base language as well as compiler directives and runtime library routines. The function- ality provided by these extensions enables the execution environment to manage the application’s error resilience. For providing error resilience, the program state can be classified into three aspects [108]: • The computational environment, which includes the data needed to perform the computation, i.e., the program code, environment variables etc. • The static data, which represents the data that is computed once in the initialization phase of the application and is unchanged thereafter. • Thedynamic data,whichincludesallthedatawhosevaluemaychangeduring the computation. Rolex extends the C, C++ base language with constructs that provide application-level error detection, error containment and recovery for each of these aspects of the program state. These extensions fully comply with the syntactic structure of the base language grammar. Certain Rolex constructs serve as direc- tives for the compiler to automatically generate code to manage resilience, whereas theRolexroutinessupportapplicationresiliencethroughtheruntimeenvironment. Type Qualifiers We provide type qualifiers which enable attaching a specific resilience attribute to functions, data variables and other entities. The programmer can specify, through explicit association, an error detection and/or tolerance feature for spe- cific identifiers in the program code. Rolex introduces some new symbols as well as 42 declaration_specifiers : storage_class_specifier | storage_class_specifier declaration_specifiers | type_specifier | type_specifier declaration_specifiers | type_qualifier | type_qualifier declaration_specifiers ’;’ storage_class_specifier : TYPEDEF | EXTERN | STATIC | AUTO | REGISTER ’;’ type_specifier : VOID | CHAR | SHORT | INT | LONG | FLOAT | DOUBLE | SIGNED | UNSIGNED | struct/union_specifier | enum_specifier | TYPE_NAME ’;’ type_qualifier : CONST | VOLATILE | resilience_type_qualifier ’;’ resilience_type_qualifier : TOLERANT | TOLERANT ’(’ tolerance_limit ’)’ | ROBUST ’(’ robust_strength ’)’ | HEAL ’(’ function_declaration ’)’ ’;’ tolerance_limit : PRECISION ’=’ CONSTANT | MAXIMUS ’=’ CONSTANT robust_strength: DETECT | CORRECT Listing 4.1: Rules for resilience type qualifiers makes use of existing symbols, but guarantee that the extended grammar produc- tion rules are compliant with existing C/C++ implementations. Listing 4.1 shows the rules for declarations that include the resilience-oriented type qualifiers. Through these qualifiers, the programmer can specify how the program vari- ables are dealt with, when the associated variable is in error state. The error detection and correction are handled through bit manipulation on the low-level representation of the variables. Directives The language extensions can support error-resilient execution of specific parts of the application program code. For the purpose of this description, we define 43 a structured block as a C/C++ executable statement which may be a compound statement but has a single point of entry at the top and single point of exit at the bottom. The compound statement is enclosed within a pair of { and }. The point of entry cannot be the target of a branch and the point of exit cannot be a branch out. No branch is allowed from within the structured block, except for program exit. Instances of the structured block can be compound statements including iteration statements, selection statements, or try blocks. Theapplicationprogrammercanimposerulesforfault-tolerantexecutionofthe structured block using directives. In C/C++, a #pragma directive specifies pro- gram behavior. Listing 4.2 shows the rules for the extensions to the base language grammar for the resilience-oriented directives. The redundancy directives enable error detection and/or correction for the computation contained in the structured block. The strength clause indicates whether dual or triple modular redundant execution of the structured code block should be applied. The recovery directives offer error containment since any fault that is activated leading to error state during the execution of the structured block isnotallowedtopropagateoutsidetheblock. Errorrecoveryisperformedbyrolling forward or rolling back execution of the structured block. The roll-forward and roll-back semantics on the structured code blocks require explicit specification of the data scoping to comply with the C/C++ memory consistency model. The rules for the data management and scoping clauses are also shown in Listing 4.2. The data management attribute clauses permit the variable state to be restored when execution is rolled forward or rolled back. For the redundant directives, the data management clauses ensure that there are no races on the shared data used by redundant copies of the structured block. 44 statement-list: statement | resilience-directive | statement-list statement | statement-list resilience-directive statement : labeled_statement | compound_statement | expression_statement | selection_statement | iteration_statement | jump_statement | resilience-construct ’;’ resilience-construct: redundancy-construct | recovery-construct redundancy-construct: redundancy-directive structured-block recovery-construct: recovery-directive structured-block structured-block: statement recovery-directive:#pragma resilience recover-rollback recovery-data-clause(opt) new-line #pragma resilience recover-rollforward recovery-data-clause(opt ) new-line redundancy-directive: #pragma resilience robust robust-strength redundancy-data- clause(opt) new-line robust_strength: DETECT | CORRECT recovery-data-clause: data-default-clause | data-private-clause | data-share-clause | data-reinitialize-clause | data-ameliorate-clause redundancy-data-clause: data-default-clause | data-private-clause | data-share-clause | data-compare-clause data-default-clause: default ’(’ shared ’)’ | default ’(’ none ’)’ data-private-clause: private ’(’ variable-list ’)’ data-share-clause: share ’(’ variable-list ’)’ data-reinitialize-clause: reinitialize ’(’ variable-list ’)’ data-ameliorate-clause: ameliorate ’(’ function_declaration ’)’ data-compare-clause: compare ’(’ variable-list ’)’ Listing 4.2: Rules for resilience directives 45 return_type var = resilience_libraryfunc_capabilitity ( arguments ); Listing 4.3: Resilience-oriented runtime routines API format Runtime Library Routines Certain aspects of the resiliency of the execution environment can be controlled through runtime library routines. Also, some of the existing standard library calls may be extended to provide resilience capabilities. For example, the memory management library calls are equipped with error detection, correction and recov- ery capabilities on the allocated memory blocks. These routines are external C functions whose identifiers are prefixed with a resilience keyword. The routine identifier is suffixed with the fault management capability, as illustrated in Listing 4.3. 4.3.3 Rolex Keywords InordertosupporttheresiliencesemanticsontheC/C++constructs, weintro- duce a set of keywords that are distinct from the existing set of C/C++ reserved keywords. The Rolex directives and routines are identified by the resilience key- word. Additionally, the keywords tolerant, robust, heal are used as qualifiers in type declarations. The keywords detect and correct specify the strength of redundancy, i.e., whether dual or triple modular redundancy is required for the type qualifiers as well as for the directives. For the Rolex directives, the keyword robust is used to specify redundant execution for the associated code block. The keywords recover-rollback, recover-rollforwardareusedtoattacharecoverymethodtothestructuredcode block following the directive. The keywords default, private and share express 46 the scope of the data variables, whereas the keywords reinitialize, recover are used for data management during roll-forward or roll-back provided by the recovery-based directives. The redundancy directives use detect and correct keywords for specifying strength of redundancy, and the compare keyword for a list of variables that are generated by the redundant copies in order to detect and possibly correct errors. 4.4 Rolex: Syntax and Semantics This section describes the Rolex programming language extensions and their relationship to the runtime system. It describes lexical syntax (i.e., how real pro- grams are built based on these extensions), the semantics (i.e., what each extension means), and how the Rolex affects program structure. We also provide examples that demonstrate how the language extensions enable error resilience for scientific application codes. The extensions cover each of the previously described themes of knowledge, i.e., tolerance, robustness and amelioration. 4.4.1 Tolerance The tolerance language extensions are used to specify data variables or code block executions that can ignore the presence of a bit corruption and continue execution with the confidence that the algorithm can absorb the error or mask it through localized recovery. These language extensions assume that the error detec- tion is provided by the hardware or system software, and that the error notification is communicated to the runtime system via an interrupt mechanism. The runtime reversely maps the error’s physical location to the application-level constructs. 47 For errors that are detected and happen to be mapped to locations which have been explicitly specified as tolerant using a Rolex language extension, the runtime system ignores the error notification and allows the application execution to continue. For instances of errors which are mapped to locations on which tolerance is not specified, the runtime terminates the application execution, as is the standard behavior for unrecoverable errors. Type Qualifiers Syntax The tolerant type qualifier can be applied to primitive as well as compound data structures. For floating point variables, the qualifier contains an additional specifier for precision. These qualifiers can be applied to global variables and local automaticvariablesandcanincludestaticanddynamicprogramstate. Thesyntax for the qualifier for floating point variable declarations is: tolerant(PRECISION=...) float low_precision_32; tolerant(PRECISION=...) double low_precision_64; For integer values, the tolerant type qualifier also has an additional specifier for maximum value. The syntax for the tolerant qualifier for integer type declarations is: tolerant unsigned int rgb[X_RES][Y_RES]; tolerant (MAXIMUS = 1023) unsigned int counter; Semantics When the runtime is informed of the presence of an error that is mapped to a tol- erant qualified data variable, the runtime ignores the error notification and allows 48 the application execution to continue. The purpose of the additional PRECISION and MAXIMUS specifiers is for the runtime to be able to manipulate the bit repre- sentation of the data values to mask the presence of the error when possible. The standard IEEE floating point representation is (-1) sign * significand * base exponent (as shown in Figure 4.2 for the 32-bit floating point representation). The constant PRECISION value specifies the minimum floating point precision that the program- merexpects, i.e., itindicatestheamountofprecisionlosstheprogrammeriswilling to tolerate. Bit perturbation errors on the sign and exponent bits fundamentally alter the variable value, and the application is usually intolerant to such errors (shown in green in Figure 4.2). However, bit perturbations in the lower significan- d/mantissa bits may be ignored by the runtime and result in a truncation error in the value of the floating point variable (shown in grey in Figure 4.2). Figure 4.2: IEEE 754 floating point representation Figure 4.3: Unsigned integer (32-bit) representation Based on the maximum value contained by an integer variable, only certain lower significant bits in the bit representation are intolerant; i.e., these bits cannot accept bit perturbations without altering the value of the variable (shown in green in Figure 4.3). The upper significant bits are unused and are meant to always remain ’0’ (for unsigned integers in the binary representation). These bits can be treated as error tolerant since perturbations on these bits can be masked by simply resetting them and allowing the application process to resume execution. 49 Examples Scientific modeling entails representation of continuous problems in terms of finite precision values which incur some discretization error. Certain data structures in theseapplicationsmayacceptbitperturbationsthatresultinround-offerrorswith- out affecting the validity of the simulation. These applications consist of numerical analysis algorithms which often yield approximate solutions. For example, itera- tive methods for solving systems of linear equations such as the conjugate gradi- ent method and the generalized minimal residual method (GMRES) progressively improve an initial approximate solution and terminate only when the solution is below a certain error norm. Direct methods such as Gaussian elimination and the QR factorization method terminate in a finite number of steps, but still yield an approximate solution. For applications that employ these algorithms the presence of bit corruption errors, which only cause loss of floating point precision in the intermediate solution state, barely affect the correctness of the final solution. The tolerant(PRECISION=)qualifiermaybeusedforthedeclarationofdatastructures corresponding to the solution vectors in these application codes. Visualization applications contain large data structures, such as the frame buffer, that can tolerate arbitrary bit flips and not be noticed by the human user. Although these structures may be of integer data type, the graphics rendering pipeline can account for the incorrect pixel attributes. Even if the error was to propagate into the final rendered scene, the anomaly is often imperceptible due to the limitation of human visual perception. Such data structures may be declared with the tolerant type qualifier. The MAXIMUS construct is useful for static state in the application address space, i.e., variables that are initialized and then hold constant integer values for the lifetime of the application. Certain variables in the dynamic state, such as 50 counter variables, matrix dimension variables, etc., whose upper-bound data value is known a priori by the application programmer, may also be qualified using the tolerant(MAXIMUS= ) type qualifier. Directives Syntax The tolerant directives provide limited localized recovery capability from errors in the computation for the programmer-defined code regions. When the detected error maps to code sections, i.e., instruction memory of the application address space, or to the variables manipulated by the code region, the tolerance directive offers roll-back and roll-forward capabilities for the affected structured code block. The syntax of the tolerance roll-forward and roll-back directives is: #pragma resilience recover-rollback share ( variable_list ) private ( variable_list ) { /* code block */ } #pragma resilience recover-rollforward share ( variable_list ) private ( variable_list ) { /* code block */ } The only constraint with roll-forward/roll-back capabilities is that the new program state must be consistent, i.e., the variable state upon rollback must be the same as that during initial entry. Therefore, we provide an optional share data scoping clause that lists the variables that are initialized prior to the entry of the structured block and are operated on by the code statements within the block. The variables specified within the private clause are not pre-initialized. 51 Semantics When the runtime is informed of the presence of an error that is mapped to the instruction memory of the tolerant structured code block, or to one of the data structure variables specified in the data clauses, the execution is rolled forward or rolled back depending on the directive used. When the execution is rolled back, the structured block is re-entered. When the execution is rolled forward, the remaining code block is skipped and execution is resumed at the end of the code block. If the program was in the middle of the structured block and was writing a specific data structure, then initiating a roll-forward or roll-back immediately upon detection of the errors would cause the data structure to remain in an inconsistent state. Therefore, prior to original entry into the structured code block, the state of the variables specified in the share clause is saved and upon roll-forward or roll-back recovery, the variable state is restored to the previously preserved values. The variables in the private clause are not restored and are treated much like local automatic variables declared inside a function. The tolerance directives provide the application with error containment capa- bilities by limiting the scope of the error to the structured block and preventing its propagation to the rest of the execution environment. The roll-back directive also provides limited error recovery through retry semantics for the structured block. The roll-forward directive provides compensation-based recovery by restoring part of the variable state. Examples The fault-tolerant generalized minimal residual method (FT-GMRES) algorithm [109] allows selective reliable computation. The algorithm uses inner-outer iter- ations where the inner solver step preconditions the outer iteration. The inner 52 solver step may be treated as an unreliable phase since it may return an incorrect solution without affecting the outer solver step. Therefore, the inner solver may be included in a block following the #pragma resilience recover-rollforward directive. The outer iteration requires reliable computation and typically scans for invalid values in the returned solution vector. This code region may be enclosed within the code following the #pragma resilience recover-rollback directive. Since neutron transport (NT) simulations use the Monte Carlo method, we may leverage its stochastic nature along with the fact that the simulation of every particle is independent. The application code that supports the creation and sim- ulation of individual particles can be included in the structured block following the #pragma resilience recover-rollforward directive. This allows the sim- ulation to selectively discard the particles whose computation experienced errors. Also, thelocaldatapertainingtoindividualparticlesdoesnotneedtobepreserved. Runtime Library Routines Syntax The tolerant runtime routine for memory allocation extends the standard function- ality provided by malloc() by allowing the errors detected on this memory region on the heap to be ignored. float* intermediate_sol_array = (float*) resilience_malloc_tolerant ( N * sizeof (float), NULL ); float* molecule_position = (float*) resilience_malloc_tolerant ( N * sizeof ( float) , (rolex_precision) (6) ); /* PRECISION = 6 */ unsigned int* true_color_pixel_buffer = (unsigned int*) resilience_malloc_tolerant ( N * N * sizeof (unsigned int), ( rolex_precision) (16,777,216) ); /* MAXIMUS = 16,777,216 */ The routine accepts an additional parameter of type rolex_precision to spec- ify the MAXIMUS and PRECISION for individual primitive types (when the routine 53 is used to allocate arrays of primitive integer or floating point type). This argu- ment supports tolerance, i.e., it allows the application to ignore error and resume execution as long as the precision and maximum values, which are requested by the application programmer on individual primitive data elements in the allocated memory block, are not perturbed. Semantics The behavior of the tolerant version of malloc is identical to the standard malloc. It allocates a block of memory of size (in bytes) equal to the function argument and returns a void* type pointer to the beginning of the block. The address bounds of the newly allocated block of memory are registered with the runtime system. Since such error-tolerant memory was explicitly requested, the runtime ignores the notifications of any errors detected on this memory block and allows the application execution to resume. For compound data structures composed of floating point or integer primitive types, the precision argument allows error tolerance to be specified on the bit representation of each individual primitive type similar to the specification of the MAXIMUS and PRECISION in type qualifiers. Examples The algebraic multigrid method, which is used to solve various elliptical par- tial differential equations, is based on a hierarchy of discretization steps that involve smoothing, restriction and interpolation stages. A single multigrid "V- cycle" repeats the relaxation and restriction steps ’m - 1’ times through ’m’ coars- ening levels. The multilevel structure affords fault resilience since the occurrence of errors increases the impact of high-frequency components of the residual that may be corrected during the relaxation phase. For the data structures that contain the intermediate solution of the iterative method at level m, the vector u m as well 54 as the residual for the intermediate solution at level m, r m may be allocated using the resilience_malloc_tolerant() runtime routine. This permits the memory errors on the intermediate solution state to be ignored by the runtime, since the relaxation on the coarser-level grids corrects the errors. Often the algorithm con- verges in the same number of V-cycles as a fault-free execution, however in specific cases the convergence to a correct solution is accomplished at the cost of a few additional V-cycles [110]. Molecular dynamics (MD) simulations can maintain the numerical stability withlimitedlossinfloatingpointprecisionforvariousconstantenergyandconstant temperature simulations. The deviations in the force calculations are often small enough that the particle trajectories are almost identical in terms of numerical stabilityasfullprecisioncalculations. Inlargesystems,thelossinprecisioninlower significand floating point bits results in a negligible difference in the coordinates of the simulations over millions of time steps [111]. The programmer can allocate these data structures using the resilience_malloc_tolerant() runtime routine together with the PRECISION argument. In graphical applications that use specific color depth, the allocation of memory forlargedatastructuressuchastheframebufferscanusetheMAXIMUSargument to specify error tolerance. For example, true color supports 24-bits for each pixel in representing and storing the RGB color. When represented by a 32-bit integer, each pixel representation contains at least 25% error-tolerant bits. 4.4.2 Robustness Therobustness language extensions are used to specify data variables or code blocks that require error detection and correction at the application-level. In gen- eral, application code sections (instruction memory), pointer variables, and array 55 index references require bit-precise correctness. Variables that affect control flow decisions also require bit-level correctness. Otherwise, we cannot make a deter- ministic assertion on the correctness of the application outcome, even if it runs to completion without exceptions or abnormal termination. The robustness of these aspects of the program state may be guaranteed by the use of redundancy. This entails replicating part of the variable state, or specific portions of the program code execution, or at times both. The replicated part of the program state is compared to check for the presence of errors in the application’s address space, or to filter errors through majority voting. In most cases, the use of redundant data or computation is burdensome to the application performance since it places additional demands on the compute and/or memory resources. Using the language extensions that guarantee robustness, the redundancy is selectively applied only on the sensitive data variables and computation whose correctness is critical to produce a correct application outcome. Type Qualifiers Syntax The robust type qualifier may be applied to declarations of primitive as well as compound data structures. The syntax for the robust type qualifier is: robust (CORRECT) int* csr_matrix[row_offsets]; robust (DETECT) int* graph_edge_list[N]; Semantics The type qualifier serves as a directive to the compiler which does a source-to- source translation to duplicate or triplicate (based on the strength specified in the qualifier) the variable declaration. For pointer variables, this amounts to creating 56 aliases to the object being referenced. The compiler also duplicates/triplicates the statements in the program source that operate on the robust qualified variables as well as insert statements that compare the redundant variable values. Through statement-level DMR, the program detects errors on the robust annotated vari- able values and informs the runtime in case of a mismatch. For correction, the statement-level TMR uses majority voting to provide the correct value to subse- quent program statements. Examples Scientific applications that employ data structures, which heavily use pointer ref- erences are known to be highly sensitive to memory failures [112]. Even single-bit upsets in pointer variables may lead to invalid references, causing segmentation faults. Using the type qualifier robust on all pointer variable declarations guar- antees that potential error states (due to invalid pointer references) arising from bit corruptions are detected or even corrected. Additionally, application-level variables that affect the program control flow, such as loop condition and if-else condition variables, also require robust bit cor- rectness. Such variable state is particularly sensitive to perturbations since its correctness affects program control flow. These variables declarations may also be annotated with the robust qualifier. Directives Syntax The robustness directive provides application-level detection/correction for spe- cific regions of computation, whose scope is defined by the structured code block following the directive. The syntax for the directives is: 57 #pragma resilience robust detect share ( variable_list ) private ( variable_list ) compare ( variable_list ) { /* code block */ } #pragma resilience robust correct share ( variable_list ) private ( variable_list ) compare ( variable_list ) { /* code block */ } The strength attribute specifies whether detection or correction is required, thus implicitly specifying the number of redundant copies of the structured block. The directive uses data management clauses share and private to specify the data-sharing attributes for the variables listed in the respective clauses. The compare clause is used to specify the list of variables produced by the structured blocks that need to be compared/majority voted on to detect/correct an error in the computation. Semantics When the compiler encounters the robust directive, it outlines the application code containedinthestructuredcodeblockfollowingthedirective. Itinsertsstatements that guarantee the redundant execution of the outlined code block. The strength clause directs the compiler on the number of copies that need to be executed. The data variables appearing in the data-sharing clauses are either private, i.e., each redundant code block instance owns a separate copy of the variable, or shared, i.e., all the redundant code block instance access the same instance of the data and the programmer needs to guarantee synchronization. The values produced by the redundant code instances (the variables that are specified in the compare clause) are compared. 58 Examples Forvariousscientificapplications, thestatementsthatperformaddresscalculations or those that manipulate control variables can be enclosed within the structured code block following the #pragma resilience robust directive. For molecular dynamics (MD) simulations, the correctness of the calculation of the force between the particles is critical in order to maintain numerical stability and for the calculation of the position and velocity of the atoms at subsequent time steps. The MD codes perform pairwise force calculations between particles at every time step. The force calculation step may be contained in a structured block following the robust-detect directive to leverage the anti-symmetric property of the forces (for particles i and j, F ij = - F ji ). This allows errors in the force calculation [113] to be detected. Wecanalsomakethecasefortheuseoftherobust-detectorrobust-correct directives in the algorithms that require selective reliability. The self-stabilizing conjugategradientmethod[114]expectsreliablecomputationonlyfortheiteration step that promises to restore the stability of the algorithm. The FT-GMRES [109] partitions the computation into reliable and unreliable phases. For such algorithms, the application code that requires reliable execution may be enclosed in the #pragma resilience robust directive. Runtime Library Routines Syntax The robust version of the memory allocation routine supports error detection and/or correction for the dynamically allocated memory on the heap section of the application address space. The routine prototypes are: 59 float* problem_matrix = (float*) resilience_malloc_robust ( N * sizeof (float), STRENGTH ); void resilience_validate_detect ( void * problem_matrix); void resilience_validate_correct ( void * problem_matrix); Semantics The resilience_malloc_robust() allocates redundant copies of the memory requested by the programmer. The STRENGTH macro specifies whether the memory needs to be duplicated (for detection) or triplicated (or correction through comparisonandmajorityvoting). Theroutinesresilience_validate_detect () and resilience_validate_correct () may be inserted by the programmer to detect and/or correct any errors. The pointers to the replicated memory are also replicated at the source level, as well as any program statements that manipulate the memory. Examples While the resilience_malloc_robust() is not suitable for heavyweight data struc- tures such as those representing large sparse problems; these structures are often compressed for savings in memory requirements and efficient memory accesses. The structured formats such as dictionary of keys (DOK), list of lists (LIL), coordinate list (COO), compressed sparse row (CSR) or compressed sparse col- umn (CSC) tend to use complementary structures that refer to the non-zero ele- ments (NNZ) of the sparse matrix. For example, in the CSR format, the NNZ in a value array are accessed using a column index and row pointer array. The resilience_malloc_robust() can be used to allocate memory for such structures that exclusively contain address references and for which bit-level correctness is critical. 60 Large-scale, data-drivenanalysisapplicationsthatsolvediscretemathproblems are becoming increasingly important parts of the supercomputing workload. These applications tend to use the graph abstraction for data analysis, search and knowl- edge discovery applications. With graph nodes as objects and edges as pointers, large, sparse graph structures contain large numbers of pointer references. These are sensitive to bit corruptions since they represent the graph edges. Irrespective of the graph representation (adjacency lists, sparse adjacency matrix), application- level detection and correction provided by the resilience_malloc_robust() for allo- cating these structures is important for robust traversal, search, sort, clustering, and other graph algorithms. 4.4.3 Amelioration The amelioration language extensions are used to specify how data variables orcodeblockexecutionsmayberepairedduringprogramexecutionbyusingknowl- edge provided by the programmer. The amelioration knowledge is based on algo- rithmic features of the application that allow mitigating the effects of errors on the program state. Such methods may compensate for the presence of errors by main- taining redundant information on the variables, or reconstruct incorrect values by interpolating from neighboring values that are known to be correct. Certain ame- liorationapproachescauselimitedinformationloss, whichmaybeacceptabletothe user, but seek to keep the application running towards solution by avoiding catas- trophic failure of the application. Also, these lossy amelioration approaches do not burden the application performance when there are no errors, unlike system-level checkpoint and roll-back recovery techniques. For these amelioration methods, 61 the error detection may be algorithm-based that is included within the applica- tion programmer provided method or may be hardware-based. For the latter, the notification needs to be communicated to the runtime system via an interrupt. Type Qualifiers Syntax The fault amelioration type qualifier heal includes a construct to specify a rou- tine which may be invoked to repair anomalies in the annotated data structure. The heal may be applied to declarations of primitive as well as compound data structures. The syntax for the qualifiers is: heal (recovery_func()) float* matrix_A[N][N]; Semantics Thereferencetotherecoveryfunctionspecifiedinthehealqualifierfortheidentifier in the type declaration is maintained by the runtime system. When the runtime receives an error notification, it invokes an event handler function to which it passes the recovery function pointer. The recovery function seeks to repair the data structure before resuming the application execution. Examples The use of checksum schemes for linear algebra methods encodes the applica- tion data structures and the algorithm uses the redundant information to detect and correct errors. For example, in a matrix-matrix multiplication, the redundant information is in the form of checksum vectors for the operand and result matrices. Invocationoftherecovery_func()candetecterrorsonamatrixbycomputingthe 62 sum (S1) of the elements in each row and comparing it to the corresponding check- sum (S2) for the column. The error location is identified by the recovery_func() as the intersection of (S1∩ S2), while the corrupted element E is repaired by E = E’ +(S2 - S1) where E’ is the element in error state. This technique is useful for a variety of matrix-based operations besides matrix multiplication including Cholesky factorization, LU factorization, and QR factorization. Directives Syntax The amelioration-based directives provide limited localized recovery for regions of computation that are contained in the structured block following the directive and the associated data structures. The syntax for the amelioration directive is: #pragma resilience recover-rollback reinitialize ( variable_list ) { /* code block * } #pragma resilience recover-rollforward reinitialize ( variable_list ) { /* code block * } #pragma resilience recover-rollback ameliorate ( recovery_func() ) { /* code block */ } #pragma resilience recover-rollforward ameliorate ( recovery_func() ) { /* code block */ } The list in the reinitialize and ameliorate clauses may include variable identifiers, an expression list, or a recovery_func() provided by the application programmer. Semantics When the error notification to the runtime system finds that the error location is mapped to the program code contained in the structured block, or on the data variables manipulated by the statements in the block, the runtime initiates the 63 recovery. This entails restoring the variable state for the variable identifiers spec- ified in the reinitialize clause, or invocation of the recovery function specified in the ameliorate clause through an event handler. Then, the runtime affects a roll-back (re-entry of the code block) or a roll-forward (resume execution at the end of the code block). Examples The roll-back/roll-forward directives with the reinitialize may be applied to iterative solver codes for linear equations. The solver iterations may be enclosed in the #pragma resilience recover-rollback or #pragma resilience recover-rollforward code blocks and the problem matrix can be included in reinitialize clause. This enables the incorrect iterations to be dis- carded and keeps the solver on the path to correct completion. For problems in which the recovery of state needs to be more nuanced than simply reinitializ- ing the variable list specified in the data scoping clause, the rollforward/rollback directive can use the ameliorate clause along with a recovery function. For exam- ple, the intermediate solution in Krylov subspace solvers may be recovered using interpolation of the error-free values. The interpolated solution is used as a new initial guess before resuming the Krylov iterations. For recovery based on linear interpolation, the least-squares interpolation has been demonstrated to be effec- tive while maintaining the monotonic decrease in the residual norm [115]. These interpolation methods may be included in the recovery_func() while the Krylov iterations are part of the structured block of the recover-rollback ameliorate or recover-rollback ameliorate #pragma directive. For various linear algebra methods, including matrix-vector multiplication and conjugate gradient solvers, partial recomputation methods [116] enable recovery of the specific part of the 64 program output that is in error state. Such methods may be included in the recovery_func() in the ameliorate clause. For self-stabilizing version of the conjugate gradient method (SS-CG) [114], the use of a periodic correction step ensures that the algorithm remains in valid state, i.e., converges to a correct solution in finite number of iterations despite the presence of faults in the program state. The CG iterations may be included in a structured block following the #pragma resilience recover-rollforward directive, while the correction step that restores the stability of the algorithm may be part of the function specified in the ameliorate clause. Runtime Library Routines Syntax float* problem_matrix = (float*) resilience_malloc_repairable ( N * sizeof (float ), checksum_func_pointer ); void resilience_ameliorate_heal ( void* problem_matrix ); The resilience_malloc_repairable() routine accepts a size argument and a pointer reference to a user-defined recovery function. Often the ameliora- tion routines are expensive operations and therefore to enable the frequency of their invocation to be controlled by the programmer, the runtime library routine resilience_ameliorate_heal() is provided. Semantics When an error is detected on the memory allocated on the heap using this rou- tine, the recovery function is invoked through an event handler routine. When the recovery function is able to heal the memory block, the runtime allows the application execution to resume. In case the recovery function is unable to correct 65 the error, the runtime gracefully terminates the application process. The runtime library routine resilience_ameliorate_heal() invokes the recovery function. Examples For matrix-based problems, such as matrix multiplication, LU factorization, Cholesky factorization, etc, the matrix data structures when allocated using the resilience_malloc_repairable() may provide checksum-based detection/re- covery functions to the routine. For sparse matrix based problems, low overhead error detection and correction is possible by leveraging the structural properties of the matrix (diagonal, banded diagonal, block diagonal) using techniques such as approximate random (AR) checking and approximate clustered (AC) checking [117]. Such methods may be associated with the memory allocated for the matrix data structures using resilience_malloc_repairable(). Various methods for ameliorating the application data structures in the Hartree-Fock algorithm involve the use of heuristic application knowledge to develop bounds for the data values [118]. In certain cases, the values in error state may be ameliorated by replacing them with reasonable values within the bounds of the user’s expectation. For orthonormalization, density matrix, matrix exponen- tial and orbital transformation structures, exact bounds conditions are available for recovery from the impact of perturbations. Even in cases where sharp bounds are not known, such as the element values of the Fock matrix, a heuristic bound may be defined. Such heuristics are developed based on the insight that the Fock matrix elements do not change dramatically from one iteration to the next. There- fore, values that move out of bound may be truncated to the boundary values. The value amelioration leverages the self-corrective nature of the iterative algorithms 66 Figure 4.4: Compiler infrastructure for Rolex to account for any residual errors caused by the interpolation. The recovery func- tion, whose reference is passed to the resilience_malloc_repairable() library routine, may contain such application-specific heuristics to define the bounds for the element values of the data structure. 4.5 Compiler and Runtime Support 4.5.1 Compiler Infrastructure We envisage the role of the compiler as that of a key intermediary that allows fully propagating the fault-resilience knowledge expressed by the programmer to the generated target code as well as to the execution environment through the 67 runtime system. We have developed a compiler front-end that parses the qual- ifiers and directives to generate code that is equipped with the error detection, containment and correction capabilities specified by the Rolex annotations. The Rolex front-end parses the resilience knowledge into a profile file which is used by the runtime system. The Rolex front-end performs source-to-source code trans- formations that permit the application to manage error states during execution in collaboration with the runtime system. Since these are source-level transforma- tions, the generated application program code uses only base language (C/C++) constructs and calls to the runtime library. A native C/C++ compiler may still be used to generate code for the target platform. The overview of the compilation process is illustrated in Figure 5.2. The front-end source-to-source translators are built using the ROSE compiler infrastructure [119]. Since HPC compilers need to concern themselves with extracting inherent par- allelism in the application code as well as guaranteeing portability, a two-stage compilation process enables incorporating the error-resilience oriented transforma- tions in the front-end while leveraging standard C/C++ compiler infrastructures to generate the target platform code. Many implementations available from HPC vendors produce highly performant and portable executables. Additionally, this modular approach permits bypassing the front-end compilation phase altogether by ignoring all the resilience-oriented annotations inserted by the application pro- grammer. For the type qualifiers, the front-end compiler parses all the declarations in the program code that have been annotated with the Rolex qualifiers. For the tolerant qualifiers, the detection and correction semantics are based on manipu- lating the bit-level representations of the data variables. For data structures with these qualifiers, the compiler produces bit masks for the corresponding data types 68 which are included in the resilience profile file. For the integer type tolerant< MAX.VALUE=...>qualifier, it generates a detection bit mask whose unused upper significant bits are all 1’s. Performing a bitwise AND operation reveals whether any of the upper significant bits were perturbed. The front-end phase also gener- ates correction masks to reset any perturbation among the upper-significant bits of integer representations. Similarly, for the floating point tolerant<PRECISION = ...>annotation, a round-off mask is created to reset the bits of the mantissa. Fortherobusttypequalifiers, theRolexfront-endduplicates/triplicates(based on the choice of the strength clause (DETECT/CORRECT)) the declarations of these variables. The compiler front-end pass also traverses the uniform abstract syntax tree (AST) to discover the statements that perform operations on the robust qualified variables and inserts identical statements that operate on the redundant copies as well as statements that compare the redundant variable val- ues. The recovery function specified in the heal type qualifier is registered as a callback handler function to be invoked when the runtime system signals the appli- cation to initiate the amelioration for the data structure. The handler registration statements are inserted by the Rolex front-end compiler. The front-end compiler pass processes the Rolex directives and uses them to create computational blocks for which the error detection, containment and cor- rection behavior is explicitly defined. The front-end pass outlines the code in a structured block that follows the Rolex directive. This entails extraction of the code segment, i.e., the statement list in the structured block into new function and replacing the original code segment with a call to the outlined function [120]. The front-end pass also inserts code to affect the resilience behavior by associat- ing runtime library routines and internal control variables (ICV) with the calls to the outlined function. The ICVs affect the application program behavior but are 69 manipulated by the runtime system. The ICVs are given values at various times during the execution of the program, and therefore their values control when and how the outlined functions are executed. ICVs are initialized by the runtime, and the application is signaled when their value is modified. For the tolerant directives, the rollback and rollforward semantics are managed byRolexlibraryroutinecallswhichtestthevaluesofICVs. Theprogramcontextis preserved prior to the call to the outlined function that is also handled by runtime library routines. The compiler inserts calls to these routines including the list of variables in the share clause as arguments. The amelioration directives #pragma recover-rollback and rollforward are managed in a similar manner. The com- piler, however, also inserts statements for the expression list in the reinitialize clause. For the ameliorate clause, it inserts a call to the corresponding recovery function. For the #pragma robust directives, the compiler duplicates or triplicates the call to the outlined function based on the STRENGTH clause. The compiler also duplicates the declarations of the data variables in the share and compare clause so that each redundant execution operates on an independent copy of the data. The compiler also inserts statements that compare the data values of the variable in the compare clause and reports any mismatch to the runtime system. 4.5.2 Runtime Inference System In order to support a resilient execution environment, the role of the runtime system is to, whenever possible, manage the outcome of the error states in the application process and seek to prevent catastrophic process failure. The run- time system maintains a resilience knowledge base, called the Dynamic Resilience Map (DRM). It contains the list of Rolex annotated data structures, their address 70 offset in the address space and error-management strategies. The rules for error detection, containment and recovery strategies are those inferred from the Rolex annotations in the program source and parsed by the compiler into the profile file. These are populated into the DRM at the commencement of the application pro- cess execution. The entries are also dynamically added and removed through the runtime routines. The runtime also provides an interface to the compiler front-end that includes routines which are visible only to the compiler framework. These routines are asso- ciated with the outlined structured blocks but they may not be invoked through simple function calls in the user code. The runtime library includes routines for initialization (__rolex_initialize()) and termination (__rolex_finalize()) that allocate and populate the DRM and subsequently manage the shutdown and clean up of the runtime system. In order to support the #pragma resilience recover-rollbackandrecover-rollforward,theruntimeinterfaceincludesrou- tinesthatpreserveinformationabouttheprogram’scurrentstate, environmentand the point in execution, as well as those that allow restoring the program state to the same point in execution (__rolex_preserve_state(), __rolex_jmp_fwd() and __rolex_jmp_back()). The runtime interface also includes routines that support data scoping. Such routines allows preserving and restoring the variable state by creating a check- point copy of all the variable state listed in the share clause in these direc- tives (__rolex_checkpoint()) and restoring them (__rolex_restore()) when rollback/roll-forward is initiated. It also provides routines to copy in private data into the outlined function. For the robust directives, it also supports a rou- tine that compares values from the redundant execution of the outlined function (__rolex_compare()). The dynamic memory allocation routines for tolerance, 71 robustness and amelioration are also supported by the runtime system. These library routines not only allocate the requested memory, but also add the offset address and size into the DRM along with the appropriate method for detection, recovery or bit manipulations for constituent primitive data types. When the runtime is notified of the presence of an error state in the application address space, it queries the DRM to find the specific application-level construct that is in error state. The DRM contains the error management actions for each of the Rolex annotated application constructs. For tolerant qualified data vari- ables, the runtime library routines provide detection and correction through bit manipulations. Whentheaffectedregionintheaddressspacemapstocomputation contained in the tolerant directives, the runtime initiates a roll-forward or roll-back recovery. This enables the execution to resume in a partially restored computa- tional state. For amelioration-based recovery, the runtime signals the application, which in turn invokes the user provided reinitialization or repair functions. When the runtime is able to account for the error states in the application, it allows the process to resume execution. Based on the application construct in error state and the knowledge available in the DRM, the runtime invokes the appropriate routines that seek to compensate for the perturbations in the variable state and then roll-forward/roll-back the execution. The runtime effectively traverses the decision tree in Figure 4.5 while managing the error states in the application pro- gram execution. When no error tolerance or amelioration knowledge is available for the application-level construct in the DRM, the runtime gracefully terminates the application process. 72 Figure 4.5: Decision tree for error management by runtime inference system 73 Figure 4.6: Overview of application compilation and execution with RoLex 4.5.3 Workflow of a Resilient Execution Environment With the incorporation of Rolex, we allow several changes to the programming model and the execution environment. These are captured by Figure 4.6. With the annotation of HPC application codes with the RoLex qualifiers and pragma directives, the compiler needs to parse these extensions and generate source code that is compliant with the base language (C/C++ or FORTRAN). The front-end compiler restructures the application source code to incorporate the Rolex-driven resiliency features. This includes additional declarations of redundant variables, outlining of blocks of code and creation of additional functions, and the installation of handler functions. Therefore, the program control flow and function call graph may be different from that intended by the application programmer, but these modifications are transparent to the user. 74 When the application program is executed, we include a pre-execution stage where the linkages of the application-level constructs from the compiled binary are discovered through a binary disassembly library. During this phase the DRM is also populated with the address offsets and error-handling actions. This stage is also transparent to the system user. In the current HPC execution models, the presence of a hardware detected error causes a machine check exception which raises an interrupt to the operating system. When the error state is uncorrectable, the kernel enters panic mode which leads to node shutdown. Therefore, all errors lead to failure and these are dealt with in failstop manner which is catastrophic to the HPC application processes running on the node. In the context of parallel application based on the bulk synchronous programming paradigm, the failure of single process prevents the remaining processes from synchronizing causing the application to hang. In support of the Rolex-based programming model, we require our execution environment to include a runtime inference system. The runtime is linked with application code and therefore tightly coupled, providing an API for specifying the error detection, containment and masking for application data structures. In our model, the operating system contains a kernel module that intercepts the inter- rupts and passes them into the user space, i.e., to the runtime system through the signaling mechanism. The runtime contains a signal handler that contains the logic to query the DRM and to determine the best recourse for dealing with the error state. When the error state can be tolerated or ameliorated, the run- time allows the application execution to resume using the knowledge in the DRM. When no knowledge can be inferred, the runtime framework will terminate the application, as is the norm for unrecoverable errors in current systems. Further- more, the Rolex-based execution model is not solely reliant on hardware based 75 error detection. The robustness-based extensions and related compiler-driven code transformations enable the application to actively search for errors in the address space and inform the runtime system. Since the error-handling component of the runtime system is interrupt-driven, the runtime system does not add significant overhead to the application performance during error free execution. For certain Rolex extensions, the runtime relies on hardware-based detection mechanisms to be informed of the presence of errors. The runtime system is notified by the operating system via a kernel signal. For certain Rolex extensions, the detection and notification capability is embedded in the application code by the compiler transformation and communicated to the runtime using the library routines. The Rolex-based programming model makes the HPC applications fault-aware as well as fault-tolerant (for certain error types). The resulting execution model also allows for an interchange of error information between layers of the system stack. This prevents each error instance from causing a fatal application crash by reasoning about the significance of the error using the programmer’s knowledge on the application’s correctness expectations. 4.6 Experimental Evaluation In this section we seek to experimentally evaluate the benefits of using Rolex to describe the resilience properties of scientific application codes. For each cate- gory of resilience knowledge, we demonstrate at least two application codes. We describe the inherent resilience properties of each code and explain how the Rolex constructs expose these to the compiler and runtime system. We also evaluate the 76 performance properties of our implementation in terms of the overhead introduced by the runtime inference system. 4.6.1 Fault Injection Framework WhileRolexofferscapabilitiesforerrortoleranceandamelioration, theseexten- sions rely on hardware-based detection mechanisms that cause an interrupt to be raised to the OS. The Rolex robustness-based extensions provide application- level detection, and for such robust annotated application constructs we make no assumptions about hardware-level detection and notification mechanisms. There- fore, we require a flexible fault injection framework that simulates the different error behaviors. We have developed a software-based fault injection framework that runs as an independent process which does not interfere with the applica- tion execution between injection intervals. The framework is based on the ptrace library. The faults are delivered to the application process via signals. The advan- tage of this stand-alone fault injection methodology is that it avoids intrusive injection which typically involves modification of the application program code, or compiler-based insertion of additional instructions. For simulating hardware-detected errors such as ECC SECDED errors, the fault injection framework sends the SIGINT signal to the target process. Any aspect of the active program state may be perturbed by the framework. The fault injections are bit-flip perturbations which may be introduced in any region of the application’s address space. Since the framework sends the signal to a user- space handler routine, this mimics the behavior of hardware detected errors whose interrupt notifications are passed to the user space via a kernel handler. The fault injection framework can also simulate silent corruptions (SDC). For these injections, the target application process is intercepted via a SIGINT and 77 bit-flip perturbations are introduced in the address space. No notification is raised to the runtime system, and the fault injection framework allows the application process to resume execution. With certain debug information included in the application executable binary, we are also able to tie the fault injections to specific data variables and functions in the application source code. The fault injection framework does not inject errors into the system state since our emphasis is on evaluating the impact of the language extensions on the application program state. We also do not inject errors into the register state or any library headers. The framework can generate the fault notification signals at arbitrary instances and intervals during the execution of the application process. 4.6.2 Accelerated Fault Injection Experiments We evaluate the application resilience of a range of scientific codes when their code is equipped with Rolex by injecting errors into their execution. The faults manifest themselves as corruptions in the address space of the application. We inject the faults at arbitrary instances during the execution of the application and observe the propagation of the resulting error through the rest of the program execution and the impact on the outcome of the execution. The application execution runs are subjected to accelerated errors rates. For each application code, we use five fault injection rates: 1 fault/15 minutes, 1 fault/10 minutes, 1 fault/5 minutes, 1 fault/2 minutes and 1 fault/1 minute. By adjusting the input problem sizes, the execution time of each is adjusted to be greater than 20 minutes; this ensures that the application execution experiences at most 1, 2, 4, 10 and 20 faults per run. The nature of the fault injected, i.e., whether it results in a detected memory error or a silent data corruption, depends on the type of Rolex extension being evaluated. 78 The fault rates that we selected for these experiments are extremely high. For the selected rates, we have effectively set the mean-time-to-failure of the compute node to be 15, 10, 5, 2, and 1 minute(s). Although unrealistic, these accelerated fault rates allow us to validate the efficacy of our language extensions and the runtime inference system. The fault injection runs may also serve to provide application programmers with an intuition into what regions of their application are particularly vulnerable to errors, paving the way for further code annotations. EachapplicationcodeisopportunisticallyannotatedwiththeRolex-typequali- fiers, directiveandruntimelibraryroutinestosuittheinherentresilienceproperties of the code. The code is compiled with our ROSE-based front-end compiler and thenwiththeGCCcompilerinfrastructuretogeneratetheexecutablebinary. Each experiment involves 10,000 runs for each fault injection rate. The location of the errors is randomized for each run. Although the average fault rate for the applica- tion run is selected to be among those mentioned above, the interval between any two consecutive faults is also randomized. Enabling Tolerance Using Rolex Extensions To demonstrate error tolerance through Rolex, we select three codes: • HPCC Random Access: The benchmark [121], which was originally designed to model a vectorized application on a Cray Y/MP, allowed the same address to appear twice in a gather/scatter operation and therefore failed to retain sequential consistency. Due to this property, the benchmark is explicitly tolerant to the presence of errors in its HPCC Table array. The computationalkernelperformsrepeatedpseudorandomupdates. Weallocate the HPCC Table array structure using the resilience_malloc_tolerant() runtime library routine. 79 Figure 4.7: Evaluation of tolerance Rolex extensions: Accelerated fault injection results • 3D Rendering Application: The application converts a 3D model of a scene into a 2D screen representation. The final rendered scene is written to a frame buffer which is declared a 2-D array in our test code. We qualify the declaration with the tolerant type qualifier. For these application runs, the measure of correct completion is an execution that completes and renders the scene with no discernible visual discontinuities. • Molecular Dynamics Simulation: The floating-point array structures for the particle position, velocity and acceleration are qualified with the tolerant type qualifier to allow lower precision when their mantissa bits are perturbed. The injected faults for these experiments lead to system memory errors that manifest themselves as ECC SECDED errors. Since the errors are detected but unrecoverable by the hardware-based ECC, their notification is passed into the runtime system. Based on the location of the error, there are only two possible 80 outcomes: compensation for the presence of the error (by ignoring the error, mask- ing the affected bits of the variables, or roll-forward/roll-back of the execution), or termination of the application to prevent further corruption. Figure 4.7 summarizes the results of these experiments. The HPCC Random Access is an inherently resilient benchmark. The memory footprint of the com- putational kernel that performs the pseudorandom updates is extremely small in comparison to the HPCC_Table array, which occupies 50% of the system memory and allocated with the tolerant version of the malloc routine. Therefore, upto 99% of the execution runs converge - even for an error rate as high as 1 fault per minute. Similarly, a dominant portion of the active memory footprint of the 3D rendering application contains the integer array, which is also allocated with the tolerant version of the malloc routine. While as many as 85% of the application runs of the molecular dynamics code converge for a fault rate of 1 fault/5 minutes, the survival rate drops in the presence of higher fault rates. Since the only inherent resilience property we expose with Rolex is the relaxed precision on the position, velocity and acceleration arrays, the application tends to terminate when other regions of the active memory footprint are affected at higher fault rates. Enabling Robustness Using Rolex Extensions We enhance the robustness of the following two codes using RoLex: • Graph500: The benchmark is representative of a class of emerging super- computing workloads that focus on data analytics. The computation and memory access patterns are radically different from 3D physics simulation applications since these applications are unstructured, integer-oriented and out-of-core. Irrespective of the representation of the graph structure, the 81 Figure 4.8: Evaluation of robustness Rolex extensions: Accelerated fault injection results code is rich in pointer references that represent the graph edges and ver- tices. In the Graph 500 Breadth-First Search (Kernel 2) code, we qualify the pointer references for the graph edges using the robust qualifier. • Algebraic Multigrid Solver: The AMG is a linear solver widely used in scientific simulations for solving large linear systems. When used to solve a linear system Ax = b, it iteratively refines an initial approximate solution for x until the error is refined below a certain bound. Each multigrid iteration, referred to as a "V-cycle," consists of smoothing, restriction and interpolation stagesduringwhichthealgorithmstartswithafinegrid, restrictstoacoarser grid and then interpolates to a fine grid again. The intermediate solution gridscantolerateerrorsatthecostofneedingadditionalV-cyclestoconverge tothecorrectsolution. However, thealgorithmissensitivetopointervariable corruptions, and bit-corruption errors on these variables often lead to an 82 application crash. We apply the robust qualifier for each pointer variable declaration in code and allocate the intermediate solution grids using the resilience_malloc_tolerant() routine. For these experiments, we consider four possible outcomes of a bit corruption injected in the application address space: • Silent data corruptions that are detected using the redundancy injected into the application code. • Benign faults that remain in the program state until the conclusion of the execution, but do not affect the correctness of the outcome. • Undetected faults in the application state. • Application crash that occurs when the injected perturbation affects part of program state mapped to the computational environment. Figure 4.8 summarizes the results of these experiments. The Graph 500 breadth-first-searchalgorithmcontainsseveralpointer-relatedcomputationstotra- verse the graph edges. Therefore, almost 50% of the execution runs can detect and correct the corruptions in the pointer arithmetic. However, since the other parts of the computational environment as well as the graph vertex data elements con- tain no error management knowledge, the application fails more often at very high fault rates. The AMG code is naturally resilient to errors since the memory allo- cated to the intermediate solution grids at each level in the V-cycle can ignore the presence of the errors. Therefore, in the presence of silent corruptions injected in the address space, a dominant portion is treated as benign (have no effect on the correctness of the outcome). The pointer arithmetic is the most sensitive to silent corruptions, and the robust qualifiers aid in the detection and correction for these pointer variables. 83 Figure 4.9: Evaluation of amelioration Rolex extensions: Accelerated fault injec- tion results Enabling Amelioration Using Rolex Extensions For the following codes, well-known algorithmic methods for error detection/- correction, which may be applied through Rolex: • Matrix-Matrix Multiplication: For the general DGEMM code, which performs the matrix multiplication A x B = C, we define functions that maintain the row and column checksums for the operand matri- ces A and B. The function reference is passed as argument to the resilience_malloc_repairable() library routine. • Conjugate Gradient Solver: We define a function that performs the checksum operation for the operand matrix and we pass its reference to the resilience_malloc_repairable() library routine. Also, we include the CG iteration step in the #pragma resilience roll-forward amelioration 84 directive and include the checksum routine to validate the correctness of the operand matrix. • Self-Stabilizing Conjugate Gradient: The CG iteration steps are included in the amelioration directive #pragma resilience roll-back. The self-stabilizing version of CG offers a correction step that restores the stability of the algorithm in the presence of errors. This correction step is included in a function whose reference is included in the ameliorate clause of the directive. Figure 4.9 summarizes the results of these experiments. For the DGEMM code, the checksum-based amelioration is applicable for only the static data, i.e., the operand matrices that are initialized at the beginning and whose values do not change throughout the execution. We have not applied any Rolex construct on the dynamic state, i.e., the result matrix. Therefore, 75% of all executions converge correctly for the fault rate that injects an error every 5 minutes, but only 27% complete correctly at the accelerated rate of 1 error per minute. For the CG computation, we leverage the iterative nature of the algorithm that allows the execution to ignore errors on the solution vector. For any errors on the operand matrices, we rely on detection/correction based on the checksums. Since a larger fractionoftheaddressspaceisprotectedusingRolex,thisapplicationdemonstrates a better completion rate than DGEMM, even with higher fault rates. The SS- CG contains the correction step that is designed to restore the stability of the algorithm. This permits further relaxation in the tolerating errors for the regular CG iterations. Therefore, the execution of SS-CG converges correctly more often than CG for similar fault rates. 85 4.6.3 Performance Evaluation We evaluate the overhead of embedding the resilience knowledge using Rolex foreachoftheapplicationcodes. Wecompileeachapplicationcodetotwodifferent binaryversions: abinarywithRolex, compiledusingourfront-endsource-to-source compiler followed by a regular GCC compiler; and a version using only a GCC compiler. The binary version without Rolex is executed in a fault-free environment to measure the baseline execution time. The version containing Rolex is subjected to fault injection for which we measure the application’s time to solution for runs that survive all the faults and reach correct completion. This allows examination of the overhead incurred by the compiler-based transformations as well as the overhead incurred by the runtime inference system. We calculate the workload efficiency as the ratio of the ideal time-to-solution on a fault-free system to the actual running time in the presence of faults. The execution times for each fault are averaged for the fraction of the 10,000 application runs that complete correctly. For the tolerance-based extensions, the workload efficiencies for HPCC Ran- dom Access for the rates of 1 fault per 15, 10, 5, 2 and 1 minute are 98.2%, 97.5%, 96.3%, 90.1% and 83.3% respectively. For the 3D rendering application, the workload efficiencies are 98.2%, 97.5%, 96.3%, 90.1% and 83.3% while the molecular dynamics code has workload efficiencies of 96.5%, 95.1%, 92.1%, 89.7% and 81.4% respectively. For the tolerance-based extensions, the compiler adds very few additional statements and the runtime inference must only determine whether the error location is mapped to an application construct that has been annotated as tolerant. Therefore, the workload efficiencies are in the excess of 80% for fault rates as high as 1 fault per minute. For the robustness-based exten- sions, the workload efficiencies for the Graph500 and AMG are greater than 90% 86 for all the above fault rates. For these application codes, the compiler inserts statement-level redundancy that performs the error detection/correction during the application execution. The amelioration-based extensions, the workload effi- ciencies for DGEMM are 84.6%, 80.2%, 72.1%, 65.9% and 58.4% and for the CG code, they are 85.5%, 75.1%, 66.7%, 58.3% and 51.0% for the fault intervals of 15, 10, 5, 2 and 1 minutes respectively. Since these codes utilize checksum operations which operate on the complete problem matrix, the workload efficiencies are much lower due to the additional stress exerted on the memory hierarchy for operating on the full matrix. The workload efficiencies for the SS-CG are 94.2%, 88.4%, 83.1%, 75.9% and 68.2% which are significantly better than the DGEMM and CG codes because the amelioration entails invocation of an iteration step that restores the stability of the algorithm. The main advantage of the application-level ame- lioration approaches over standard checkpoint and roll-back methods is that they don’t incur additional overhead to the computational cost when no errors occur. 4.7 Summary This chapter presented a set of Resiliency-Oriented Language Extensions (Rolex) for expressing the error resilience properties of scientific HPC application codes at the language level. These extensions are developed to succinctly capture a programmer’s knowledge on the fault tolerance features of the application through type qualifiers, directives and library routines. The semantics of these language extensions enable application-level error detection, containment and masking. We have presented concrete examples of widely used scientific computational kernels in which encoding the resilience knowledge using Rolex enhances the application’s 87 error resilience. We described the compiler transformations that leverage the lan- guage extensions to incorporate further error resilience features in the application codes. These transformations are enabled by a front-end source-to-source compiler infrastructure. We described the compiler-runtime interface and the design and implementation of the runtime inference system. We demonstrated that the com- bination of the language-level programming model extensions, which are tightly integrated with the compiler infrastructure and runtime system, provides an exe- cution environment that facilitates cross-layer efforts for error detection, masking and recovery. For HPC applications, this translates to the survival of more errors and therefore a longer mean-time-to-failure. 88 Chapter 5 Application-Level Fault Detection and Correction Through Adaptive Redundant Multithreading 5.1 Overview In the presence of accelerated fault rates that are projected to be the norm on future exascale systems, it will become increasingly difficult for HPC applications to accomplish useful computation. HPC applications are fault-oblivious due to the limitations of current HPC programming paradigms and execution environ- ments. Among the existing fault-tolerance techniques, C/R is transparent to the application users but may not be viable when the interval to checkpoint or roll- back becomes proportional to the system’s mean-time-to-interrupt (MTTI). On the other hand, ABFT techniques are very application-specific and often require 89 significant refactoring of the application code to provide error detection and cor- rection capabilities. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. By maintaining repli- cas of the program state, or the computation, or both, the use of redundancy provides error detection and correction capabilities without relying heavily on the algorithmic features of the application. However, the use of redundancy incurs significant overhead to the application performance. Redundant multithreading (RMT) is a promising approach for HPC applications because it uses multiple thread contexts within a single application process, which offers lightweight redun- dant execution streams of program instructions. The output values produced by the redundant thread copies may be compared to detect the presence of errors in the computation. However, in the context of long-running scientific applica- tions that will seek to scale and harness the capabilities of millions of cores in future exascale-class systems, even thread-level redundant execution of complete programs will be prohibitively expensive in terms of application performance and energy costs. In this chapter we present an application-level fault detection and correction approach that is based on RMT, but applies the thread-level redundancy adap- tively. We extend the semantics of Rolex to support partial RMT on the regions of code delineated by the Rolex directives. The approach combines the use of a programming model extension with a source-to-source compiler infrastructure and runtime system to enable adaptive use of RMT. The runtime system permits the use of RMT to be modulated (dynamically enabled or disabled) in order to exploit the trade-off space between performance overhead and application resilience. 90 The remainder of this chapter is organized as follows: Section 5.2 describes the conceptsandterminologyusedinredundantcomputing, includingthoserelevantto RMT. Section 5.3 makes the case for leveraging the Rolex language-level directive to gain the application programmer’s insights on defining the scope of redundant computation. Section 5.3 also describes the syntax and semantics for the Rolex robust directive. Section 5.4 describes the compiler-based construction of RMT executedcodeblocks, andSection5.5detailsthenuancesoftheruntimeadaptation algorithm. Section 5.6 provides example codes through which we explain how the adaptive RMT enables application-level error detection/correction. In Section 5.7 we discuss optimization strategies for adaptive RMT. Section 5.8 presents the experimental infrastructure, evaluation methodology and the performance results. 5.2 The Redundancy Solution for Fault Detection/Correction Redundancy is a well-studied canonical solution [24], which is widely used in mission-critical applications, where avoiding the human and economic costs and consequences of failure justifies the cost (in terms of additional hardware, power and performance) of replication. Triple modular redundancy (TMR) is a well- known approach that entails creation of three replicas and uses majority voting to detectandcorrecterrors,whilethemoregeneralconceptisreferredtoasn-modular redundancy. Redundant multithreading (RMT) approaches create multiple thread contexts using which identical copies of partial or complete program code are executed. For error detection, the output values produced by duplicate threads are compared, and upon a mismatch, the checker flags an error. This notification can be used to initiate a hardware- or software-based recovery sequence. The TMR 91 equivalent of RMT entails execution of the program code by three independent thread contexts. Detection and correction are made possible by selecting the result through majority voting on the results produced by each of the redundant thread copies, which filters out the incorrect results. Hardware implementations leverage simultaneous multithreading (SMT) capa- bilities where the processor architecture is designed to fetch instructions from mul- tiple thread contexts and their execution is interleaved. The threads may even be mapped to independent cores in the context of chip multiprocessors (CMP); this is called chip-level redundant multithreading (CRT). The primary advantage of hardware-based RMT is that it is transparent to the operating system and the application software. However, hardware-based RMT approaches incur sig- nificant overhead due to redundant instructions flowing through complex proces- sor pipelines, as well as increased contention for shared resources such as caches. Software-based approaches replicate the program using independent OS processes or utilize threading libraries to create redundant threads within a single OS pro- cess. One of the key concepts in RMT is the sphere of replication [122], which repre- sentsalogicalboundarythatincludesaunitofcomputationthatmaybereplicated. Any fault that occurs within the sphere of replication propagates to its boundary. The concept is illustrated in Figure 5.1. The computation enclosed within the sphere of replication is executed by the redundant threads. The inputs may be replicated, or a single copy may be shared between the redundant threads if the inputs are "read-only." Faults are detected by comparing specific outputs produced by the redundant execution of the spheres of replication. Any faults that do not manifest themselves at the boundary of the sphere in the output values get masked and are treated as benign in terms of their impact on the application outcome. 92 Figure 5.1: Sphere of replication The process of managing the size and extent of the sphere of replication is important because it affects the overhead to application performance and extent of fault coverage offered. The design considerations are: 1. For which parts of the application will the redundant execution mechanism detect faults? The primary concern with using program-level RMT at a system-wide scale in HPC systems is the potentially massive overhead to application perfor- mance and energy. Based on the fault coverage requirements of the users and algorithmic features of the code, the sphere of replication may include thecompleteprogramexecution, oronlythecodeforaspecificfunctionbody, or a basic block of instructions. 2. What are the inputs to the sphere of replication, and do they need to be repli- cated? The selection of inputs has important implications for the application perfor- mance since it increases the number of memory accesses by the program. On 93 the other hand, the failure to replicate any inputs may potentially lead the redundant threads on divergent execution paths. Therefore, inputs need to be selectively replicated, or they should be protected using other redundancy mechanisms such as checksums, parity, etc. 3. What are the output values from the sphere of replication that need to be compared in order to detect the presence of faults in the computation? In general, any values that are computed within the sphere of replication must be compared in order to check for a mismatch. The failure to compare meaningfulapplicationvaluespotentiallycompromisesfaultcoverageandthe usefulness of the redundant execution. However, unnecessary comparisons of output values increases the overhead on the application performance without improving fault coverage. 5.3 Programmer Managed Scoping of Redundancy Scientific applications often operate on large problem sizes, and therefore the size of the sphere of replication is a critical consideration. The use of complete replication for fault detection/correction is often not a feasible solution due to the large overheads to the application performance. The selection of inputs and outputs for the spheres of replication also impacts the application performance becausetheirreplicationexertspressureonthememoryhierarchy, leadingtolonger access latencies. The fault coverage, however, need only be extended to specific code regions whose execution outcome will indicate the presence of errors in the program state of the application. 94 Our approach extends the philosophy of Rolex which empowers scientific appli- cation programmers and library developers to leverage their domain knowledge and understanding of their codes to affect the application resiliency. The Rolex- based programming model extensions serve as mechanisms that permit application programmers to manage the scope of the spheres of replication. The basis of our approach is the notion that even if an error affects the architectural program state, the application correctness is impacted only when it manifests in the user-visible program state. Therefore, using the Rolex directives, the programmers may selec- tively apply redundancy on specific code regions to detect/correct errors in the computation of the corresponding application phases. Our approach also enables the application programmers to identify application-level variables that should be treated as inputs to the programmer-defined spheres of replication, as well as the output variables that need to be compared for error detection and correction. The syntax of the Rolex robust directive to specify spheres of replication is as follows: #pragma resilience robust detect share( variable_list ) private( variable_list ) compare ( variable_list ) { /* statement list */ } #pragma resilience robust correct share( variable_list ) private( variable_list ) compare ( variable_list ) { /* statement list */ } The directive starts with #pragma resilience and is followed by the robust keyword. The directive follows the conventions of the C and C++ standards for compiler directives. The directive applies to at most one succeeding statement, which must be a structured block that is enclosed in a pair of ’{’ and ’}’. The structured block is defined as a C/C++ executable statement which may be a compound statement, but has a single point of entry at the top and a single exit point at the bottom. The compound statement is enclosed by a pair of { and }. Its 95 pointofentrycannotbethetargetofabranchandnobranchisallowedfromwithin thestructuredblock, exceptforprogramexit. Instancesofthestructuredblockcan be compound statements including iteration statements, selection statements, or try blocks. The robust directive also includes data scoping clauses which are used to impose rules on whether the data variables are replicated among the redundant executions of the structured block, as well as to select variables produced by the sphere of replication that need to be compared in order to detect the presence of errors. The algorithms in scientific applications tend to demonstrate the persistence of computational patterns with respect to time [123]. The application code may be partitioned into phases, which may be mapped to spheres of replication. For these code regions, we can employ redundancy to detect/correct errors in the computation. These phases may be scoped using the Rolex robust directive which offerssyntacticstructuretothenotionofspheresofreplication. Thecodecontained within the structured block following the robust directive corresponds to a sphere of replication. The compiler processes the directive and uses it to create explicitly multi- threaded code. Each redundant thread executes the code statements contained in the structured block. The STRENGTH attribute specifies whether a detection or correction capability is required. This implicitly indicates the number of redun- dant thread copies of the structured block that the runtime needs to create. The directive uses data management clauses share and private to specify the data sharing attributes for the variables listed in the respective clauses. The compare clause is used to specify the list of variables produced by the structured blocks that need to be compared/majority voted on to detect/correct an error in the compu- tation. There is an implied barrier at the end of a robust region to synchronize 96 the threads and compare values produced by each thread copy. The threads are assigned identifiers that range from zero (which is the identifier for the primary thread) up to a value of one less than the value specified by the strength macro for the redundant threads. When the strength macro is set to DETECT, thread identifiers are 0 and 1 for dual-multithreaded execution (DMT), whereas thread identifiers are 0, 1, 2 for triple-multithreaded execution (TMT). 5.4 Compiler Support for Adaptive Redundant Multithreading The compiler plays the role of a key intermediary that allows propagating the knowledge expressed by the programmer in supporting the creation and adaptive execution of spheres of replication in concert with the runtime system. In this section we describe the compiler transformations. 5.4.1 Compiler Support for Application Programmer Scoped Spheres of Replication The creation of spheres of replication through the language-level pragma direc- tive permits the programmer to make strategic decisions on their scope while the compiler processes the directive and uses it to create explicitly redundant multi- threaded code with error checking. Therefore, the compiler shields the details of the creation of the redundant multithreading from the application developer. The compiler infrastructure is responsible for the redundant thread creation, execution and termination. It also manages the shared, private data variables and inserts 97 Figure 5.2: Compiler infrastructure for redundant multithreading code statements that compare the data variables listed in the compare clause. However, the compiler does not check for synchronization or correctness. We have extended the compiler front-end for Rolex to support redundant mul- tithreaded execution of the spheres of replication through source-to-source code transformations. The generated source code only uses base language constructs and calls to runtime library routines so that a native C/C++ compiler can be used to generate code for the target platform. The overview of the compilation process is illustrated in Figure 5.2. The front-end source-to-source translators are built using the ROSE compiler infrastructure [119]. Using such a two-stage compilation process enables programs to be compiled using a standard C/C++ compiler infrastructure and leverages various implemen- tations of threading libraries. The program may also be compiled without the redundant multithreading-based error detection/correction capabilities, in which case the compiler front-end simply ignores the pragma directives and generates sequential code. In order to implement the sphere of replication from the structured block spec- ified after the directive, the compiler outlines the code in structured block. Out- lining entails extraction of the code segment, i.e., the statement list from a host 98 function, and creation of a new function referred to as the outlined function. The original code segment in the host function is typically replaced with a call to the outlinedfunction[120]. Figure5.3(a)and(b)showthetransformationinwhichthe pragma-scoped code block body is outlined from the main() function and replaced with a call to the function __robust_region_main(). To support redundant multithreading, the original code segment in the host function needs to be replaced by a call to a threading library routine to which the outlined function pointer is passed as argument. The code transformation is illustrated in Figure 5.3(c). Each redundant thread executes the outlined function, and the data variables specified in the share clause are passed as arguments to this procedure. The data variables specified in the private clause are stored on the thread stack. Outlining introduces some overhead to the application execution, but it makes the translation of the structured block into redundant multithreaded code and the management of the data scoping straightforward. 5.4.2 Compiler Support for Adaptive Execution of Spheres of Replication In addition to outlining the structured block into spheres of replication, the compiler generates the code in the host function that enables conditional execu- tion of the outlined function either serially or in redundant multithreaded mode. The adaptive execution of the spheres of replication using RMT is based on inter- nal control variables (ICV) that affect the application program behavior but are manipulated by a runtime system. Figure 5.4 shows an example of the code trans- formation that enables the conditional execution. The ICVs are given values at various times during the execution of the pro- gram, and therefore their values control when the runtime enables/disables the 99 Figure 5.3: Code outlining for redundant multithreaded execution redundant multithreading-based error detection/correction. The ICVs also control the number of redundant threads created. The ICVs are initialized by the runtime, and the application is signaled when their value is modified. When the execution encounters the outlined sphere of replication, whether the execution is performed sequentially or whether the code block is bound to dual threads (for error detec- tion) or triple redundant threads (for correction via majority voting) depends on the current value of the ICV. 5.5 Runtime Adaptation The role of the runtime system is to support the adaptive application of RMT to provide error detection/correction semantics for the Rolex directives. To this 100 Figure 5.4: Code outlining and transformations for adaptive redundant multi- threaded execution end, the runtime monitors the fault rates in the system and decides whether to enable/disable the execution of the spheres of replication in RMT mode. In this section we describe the compiler-runtime interface. We also provide the details of the system indicators and the algorithm that the runtime uses to determine when fault detection/correction through redundant computation is required. We also explain how the execution of an application code that includes the robust directive proceeds from initialization to solution. 101 5.5.1 Compiler-Runtime Interface The runtime provides an interface to the compiler front-end. This interface includes runtime library routines that are visible only to the compiler framework and may not be invoked through simple function calls in the user code. The inter- face routines complement the front-end outlined code blocks so that the runtime may selectively apply RMT for their execution. These routines effectively instanti- ate calls to corresponding threading library routines to which the outlined function pointer is passed as argument. When the robust directive is encountered, the com- piler inserts calls to these routines along with the call to the outlined function. This shields the details of the replicated execution of the outlined functions using redundant threads and error checking from the programmer. The runtime library includes routines (__rolex_initialize()) and (__rolex_finalize()) that initialize the threading environment and event mon- itoring and subsequently manage the shutdown and clean-up of the runtime sys- tem. The runtime library routines for thread creation, termination and syn- chronization include the __rolex_fork(), and __rolex_join(). These routines wrap the corresponding POSIX-compliant threading library routines. The routine __rolex_compare() synchronizes the redundant threads and compares the values produces by the redundant threads. There are also routines available to get thread information (__rolex_thread_num()), as well as routines for setting the number of redundant threads based on the STRENGTH (__rolex_num_threads()) in the Rolex robust directive. 102 5.5.2 System Fault Event Indicators for Runtime Adaptation In order to evaluate the fault-tolerance state of the system, we rely on advanced error detection and reporting features available in modern architectural specifica- tions, such as those defined by the Advanced Configuration and Power Interface (ACPI) [124] which includes the ACPI Platform Error Interface (APEI) specifica- tion. The architectures that are compliant to these standards transfer hardware platform error knowledge to the software stack for recovery or logging purposes. Broadly, the anomalous events in systems may be classified into three categories: 1. Uncorrected errors: These are unrecoverable errors from the hardware point of view. The system software may potentially recover from such errors through software-based correction or masking. 2. Corrected errors: These are potentially correctable in hardware, without support from the software layers. These error events can raise notifications to the system software for logging purposes. 3. Fatal errors: These events are not correctable by either hardware-based mechanisms or through system software intervention. These errors are usu- ally catastrophic for the system and the applications running on them. Our runtime system monitors and logs events from the first two categories. Examples of such events are ECC errors on DRAM DIMMs, PCI bus parity errors, cache ECC errors, DRAM scrubbing notifications, TLB errors, memory controller errors, etc. The errors from all subsystems are communicated to the host firmware or to the operating system directly via the signaling interface. Based on the con- tents of the error registers, the severity of the anomalous event is analyzed. The 103 firmware creates detailed error logging information and notifies the OS of the event’s occurrence. 5.5.3 Redundancy Adaptation Algorithm Since the basis of our opportunistic fault detection/correction strategy is the well-reasoned use of redundant computation, we require that the runtime contin- uously monitor and understand patterns in the anomalous events in the system. Based on the types of events and interval between them, the runtime system makes a quantitative assessment of the vulnerability of the execution environment. The emphasis of our runtime system is not on accurate fault prediction but rather on simplypreemptivelyenablingapplication-leveldetectionforerrorsinprogramstate when there are events suggesting that the fault-tolerance state of the underlying system is volatile. This knowledge is used to determine whether RMT should be enabled for the current program execution and the duration of that enablement. Various studies that seek to statistically model the empirical distribution of the timedurationbetweenfaulteventshaveobservedastrongcorrelationbetweenerror events, bothspatiallyandtemporally. Thissuggeststhatenablingtheredundancy- based detection/correction when the fault rate rises above a certain threshold pre- vents any further error events from impacting the correctness of the application. This approach provides detection/correction once the runtime activates the RMT. It requires support from other Rolex recovery techniques or rollback recovery to deal with prior error states that were activated in the program state before the runtime enabled the RMT-based detection/correction. In our adaptation algorithm, the runtime system defines a metric called the time-between-events(TBE)(whichisequivalenttothetime-between-failure(TBF) metric, except that we are interested in anomalous events rather than failures). 104 The runtime also tracks the time-since-last-event (TSLE). For simplicity, we do not differentiate between the types of fault events and base the adaptation on all events logged by the operating system. When application execution begins, the runtime initializes the TBE for the exe- cution environment. Any instance of the pragma-defined structured code encoun- tered is executed in serial mode. Upon the occurrence of the first event after execution begins, the runtime enables redundancy through the ICV. This enables duplicate threaded execution for all subsequent pragma-defined structured code blocks. The output values are compared to detect the presence of errors in the computation. The TSLE for the system is initialized and is updated with the application exe- cution time elapsed since the latest event. The TBE is also updated as subsequent events occur. When the TSLE exceeds the TBE, the redundant execution is turned off. Further instances of the structured code blocks are executed serially, i.e., the fault detection/correction for the pragma-contained computation is disabled. This enables more effective management of the application performance overhead such that fault detection is enabled over a limited interval following anomalous events, in anticipation of further events. When the system resources are deemed stable (when no events are experienced over a period of time), the redundant execution is disabled. During periods of execution that experience intermittent events or bursts of errors, the fault detection remains enabled for extended durations. Figure 5.5 illustrates the timeline view of the execution of a code that contains programmer scoped spheres of replication. The code represents an iterative algo- rithm in which the loop body is contained within the structured block following the Rolex robust directive. Each trapezoidal structure in Figure 5.5 represents the execution of the structured code block by duplicate threads and the comparison of 105 Figure 5.5: Timeline view of adaptive redundant multithreading the output values to detect errors. We illustrate two scenarios: Figure 5.5(a), rep- resents an execution that experiences a single event during the execution. When the execution commences, the structured block instances, i.e., the loop body, are executedinserialmodebyasinglethreadcontext. Whenthefirsteventoccurs, the interrupt to the OS causes a signal to be passed to the runtime system, which in turn causes the runtime to signal the application to enable RMT. The subsequent loop iterations execute the compiler-inserted code that forks duplicate threads and compare the values produced by each thread. When the runtime observes no events for an interval longer than TBE (i.e., TSLE >TBE), it disables the RMT by passing a signal to the application. Figure 5.5(b) illustrates an execution run during which the runtime observes the occurrence of a single transient event during execution of the first iterations and a burst of several events later on. After the occurrence of the first fault event, a limited number of the following iterations are executed by duplicate redundant threads. The burst of events causes the observed TSLE to remain smaller than the TBE. Therefore, the RMT execution is not disabled until the application com- pletes. If there is a mismatch in the output values during any loop iteration, the runtime is notified. When the RMT strength supports only error detection, 106 runtime may initiate recovery, task migration, or terminate and restart the appli- cation. When the triple-modular version of RMT is applied, it may be possible to select the correct value through a process of majority voting. 5.6 Examples: Error Detection/Correction Through Adaptive Redundant Multithreading In this section we demonstrate example codes that use the pragma directive to outline spheres of replication. We describe how application-level error detec- tion/correction is possible in the computational kernels of double-precision general matrix-matrix multiplication (DGEMM) and sparse matrix vector multiplication (SpMV). 5.6.1 Double Precision General Matrix-Matrix Multiplication(DGEMM) Dense matrix operations are widely used in scientific and engineering sim- ulation applications. The double-precision general matrix-matrix multiplication (DGEMM) is a key BLAS routine used in linear algebra problems. DGEMM oper- ates on matrices A, B to produce matrix C where C =β∗C +α∗op(A)∗op(B) in which α and β are scalars and op(*) means that one can insert the transpose, conjugate transpose, or use the matrix as it is. The basic matrix-matrix multiplication is a straightforward algorithm that con- sists of three loop levels. The innermost loop steps through one row of matrix A and one column of matrix B over a loop variable k (shaded row and column in 107 Figure5.6: UseofadaptiveRMTfordoubleprecisionmatrix-matrixmultiplication Figure 5.6). The inner loop calculates a dot product of the row of A and the column of B and generates one element of result matrix C (shaded element of C in Figure 5.6). We define the pragma block to include the inner dot product of the matrix multiplication, i.e., the dot product computation resulting from the multiplication of a single row and single column of the operand matrices. The code with the inclusion of the robust directive is shown in Figure 5.7. The computation included in the structured block in this illustration is not necessarily the only way to scope the redundant computation. The programmer may choose to scope the compu- tation based on the problem size, structure and optimization that affects cache performance. 108 Figure 5.7: Application of RMT directive for DGEMM 5.6.2 Sparse Matrix Vector Multitplication (SpMV) Sparse matrix-vector multiplication (SpMV) is widely used in iterative solution methods that solve large-scale linear systems and eigenvalue problems, such as the conjugate gradient (CG) and generalized minimal residual (GMRES) methods. The SpMV computation consists of iteratively multiplying a constant sparse matrix by an input vector. We assume that the sparse matrix is represented by the compressed storage row (CSR) format which enables better memory access latencies, memory bandwidth and lower cache interference. IntheCSRformat, eachrowofthesparsematrixAispackedinadensearrayin consecutivelocations(arrayval). Thecolumnindicesofstoredelementsareheldby another integer type array col_index. Therefore, in a naive implementation of the 109 Figure 5.8: Use of adaptive RMT for sparse matrix vector multiplication SpMV that uses the CSR format (illustrated in Figure 5.8), the inner iteration uses the col_index and ptr to access the sparse matrix elements such that elements of result vector y are accessed only once. The code for this implementation is shown in Figure 5.9. We place the body of the outer for loop, i.e., the inner product, in the #pragma robust structured code block. The reduced vector element y[i] for every row is the output value generated at the boundary of the sphere of replication that can be compared when the iteration of the outer loop is executed by duplicate threads (therefore included in the compare clause). Given the memory-bound nature of SpMV, we only include the variables in the private clause while the remaining data variables are shared between the redundant threads. 110 Figure 5.9: Application of RMT directive for SpMV 5.7 Optimization Strategies for Adaptive RMT The benefit of a software-based RMT approach is that the threads tend to be loosely coupled since their intermediate state need not be synchronized at every clock cycle. This enables flexibility in the scheduling of the redundant threads. Our runtime system supports scheduling policies for the redundant threads which enablefurtheropportunitiestominimizetheperformanceoverheadofRMT.Inthis section we explore two scheduling policies: one that relies on lazy fault detection and another that creates groupings of threads assigned to each processor core in the system. 5.7.1 Lazy Fault Detection In the adaptive RMT approach, the semantics of the Rolex directive place an implicit barrier at the end of the structured code block. Therefore, when the code block is executed by redundant thread copies, the threads must synchronize 111 and compare the values produced by each thread before the execution proceeds. However, theamountofcomputationthatmaybeenclosedwithintheprogrammer- defined sphere of replication is highly application dependent. For applications in which long phases are enclosed in the pragma scoped code block, or in the case of compute-bound applications, the imposition that the output values be compared after each instance of the structured blocks by duplicate threads places an unnecessary constraint that incurs nontrivial overhead to the overall application performance. To address this issue, we developed a lazy fault detection approach in which we relax the requirement that the redundant threads synchronize immediately after theexecutionofthestructuredcodeblock. Thelazy strategyisbasedontheinsight that the error detection need not necessarily lie on the application’s critical path of execution. The runtime system provides buffer space into which the duplicate threads can write their respective output values as they execute each block. A separate lightweight fault detection thread performs the value comparison. Figure 5.10 illustrates the execution timeline for an iterative code using the lazy detection approach. The occurrence of an anomalous event initiates the RMT execution. However, in this approach, the duplicate threads can execute subsequent iterations independently, without the need to synchronize the threads and perform output value comparison. The key benefit of removing the implicit synchronization barrier is that the threads may be assigned scheduling priorities. An elevated priority assignment to one of the redundant threads helps mitigate part of the overhead of RMT. By elevating the scheduling priority of the primary thread, all subsequent structured block instances are allowed to run without stalling on the redundant thread to complete their execution of the same code blocks and at higher priority than the 112 Figure 5.10: Timeline view of adaptive redundant multithreading with lazy evalu- ation redundant thread. The redundant thread performs the duplicate computation for the same code structured block instances, but is assigned a lower scheduling priority. The output value comparison is done by a dedicated thread which is also assigned a lower scheduling priority than the primary thread. The compiler inserts code for the allocation of buffer space into which the duplicate threads can commit their respective outputs. Each buffer entry contains a pair of elements. The buffer is implemented as a cyclic FIFO, and the size of each buffer entry is determined based on the data types of the variable list specified in the compare scoping clause. Every circular buffer entry is protected by a lock since it is accessed by redundant threads as well as the detection thread that is tasked with the value comparison. For each individual element in the buffer entry there is a flag to set its validity. When the detection thread is scheduled, it performs a value comparison for all buffer entries for which both elements in an entry pair are in valid state. Also associated with each buffer entry is an active flag that tracks whether either element in the buffer entry is currently in use. This helps the runtime reclaim buffer space when the value comparison is complete. We have also implemented a lock-free variant in which the code to compare the values is 113 included within the outlined function body. The comparison is only performed when the thread is the last to update the buffer entry. This guarantees that the comparison is performed only when both elements of the buffer entry are valid. This variant does not require a separate detection thread. 5.7.2 Thread Clustering In our environment, we have much more control over which threads, and how many, are co-scheduled on the processor cores shared memory multiprocessor (SMP) system. The thread scheduling algorithms for traditional multiprocessors seek to evenly distribute the application workload over the available processors in order to maximize resource utilization. The relaxation of the implicit synchro- nization barrier for every code block enables clustering the primary and redundant threads to specific processor cores. This enables minimizing the negative interfer- ence between the primary and redundant threads. Our runtime system creates unbalanced schedules that supports the lazy fault detection model. However, the core grouping is performed statically by the com- piler by setting the CPU affinity for the primary and redundant threads. All the redundant threads, as well as the detection thread, are clumped together and assigned to a single processor core. The primary threads are evenly balanced among the remaining processor cores on the SMP. The runtime system sets the CPU affinity of the primary, redundant and detection threads based on these static assignments. The implication of such an unbalanced thread-to-core mapping strat- egy is that it reduces the processor core resources available to the primary com- putation. Also, the core which is assigned the redundant threads tends to be oversubscribed. However, when RMT is necessary for fault detection/correction, 114 this strategy helps reduce the interference between the primary and redundant threads, which amortizes some the overhead incurred by the redundancy. 5.8 Experimental Evaluation In this section we describe our fault injection infrastructure and evaluation methodology. We discuss our experiences with applying the robust directive on a range of common computational kernels based on their respective algorithmic properties. We present the results of our fault injection experiments that evaluate the performance implication of embedding the RMT-based detection/correction for each of these codes. 5.8.1 Fault Injection Framework We have developed a software-based fault injection framework that runs inde- pendently of the application program process. The faults are delivered to the application process via signals. The advantage of this stand-alone fault injec- tion methodology is that it avoids intrusive injection which involves modifying the application program code, or compiler-inserted instructions that perform the fault injection. The fault injection framework is based on the ptrace library and there- fore does not interfere with the application execution between injection intervals. The type of fault events that the runtime logs and uses to quantify the vulnera- bility of the execution environment includes error notifications. Such notifications are raised due to corrected machine check interrupts and errors that are recovered by the OS. Since the fault events are mere notifications, the fault injection frame- work does not perturb any aspect of the application program state. The framework generates fault events by sending USR1 signals to the application process. The 115 runtime library, which is linked to the application process, contains the interrupt handler for this signal. The fault injection framework can generate the fault noti- fication signals at arbitrary instances and intervals during the execution of the application process. The runtime-contained signal handler is a blocking routine which catches and logs the fault event notification. 5.8.2 Application Codes Double-Precision Matrix-Matrix Multiplication (DGEMM) For the double-precision matrix-matrix multiplication (DGEMM) kernel, we use the naïve implementation that is described in Section 5.6. We define the scope of the pragma block to include the inner dot product of the matrix multiplication, i.e., the dot product computation resulting from the multiplication of a single row and single column of the operand matrices. Sparse Matrix Vector Multiplication (SpMV) In the case of the sparse matrix vector multiplication (SpMV), we enclose the body of the outer for-loop, i.e., the inner product into the #pragma robust struc- tured code block. For these experiments, we use the implementation previously described in Section 5.6. The reduced vector element y[i] for every row is the out- put value that is compared to detect the presence of errors when an iteration of the outer loop is executed by duplicate threads. 116 Conjugate Gradient (CG) The conjugate gradient (CG) method is an iterative algorithm that solves a system of linear equations and is implemented such that the initial solution is iter- atively refined. The iterations provide a monotonically decreasing residual error, and with every iteration, produce an improving approximation to the exact solu- tion. By enclosing each iteration in the block contained by the #pragma robust directive, the error value is compared to detect the presence of errors in the solu- tion. Self-Stabilizing Conjugate Gradient (SSCG) The self-stabilizing approach to the conjugate gradient method [114] contains a correction step that ensures that the algorithm remains in valid state i.e., con- verges to a correct solution in a finite number of iterations. We only include the codeforthecorrectionstepinthe#pragma robust structuredcodeblock. Nofault detection/correction is necessary for the remaining CG iterations since the correc- tion step accounts for any computational errors in those iterations and restores the stability of the algorithm. Multigrid Solver Thealgebraicmultigridsolverisahierarchicalalgorithminwhichthesolutionis acceleratedbysolvingacoarseapproximationoftheoriginalproblem, interpolating it back to a finer level and then refining that solution until it forms a sufficiently precise solution. We use a V-cycle depth of 8, and the #pragma robust directive is used to provide fault detection coverage for the code blocks corresponding to the restriction, relaxation and interpolation phases of the V-cycle. 117 5.8.3 Performance Evaluation We evaluate the performance impact of the fault detection/correction through adaptive RMT on the application’s time to solution through experimental runs with accelerated fault event injections. For each application code described above, the #pragma robust directive is embedded to define the scope of the sphere of replication. These are outlined by our ROSE-based front-end compiler in order to support their mulithreaded execution. Each experiment involves 10,000 runs for each fault injection rate. The interval between any two consecutive faults is randomizedforeachapplicationrun. Usingalargenumberofexecutionrunsallows us to observe the average application overhead for a variety of fault patterns. Figure 5.11: Results: Comparison of performance overhead of adaptive redundant multithreading with process replication 118 Figure 5.11 shows the comparison between the serial implementation of each application code, the use of process-level duplication, and the application of redun- dant multithreading. The purpose of these experiments is to demonstrate the via- bility of applying the language directive to scope the spheres of replication for each of these codes. For this experiment, we do not enable the adaptive execution, and therefore all iterations of the code regions included in the robust directive are executed in redundant mode. Figure 5.11 shows the average normalized execution time for each of the codes to be 10 to 15% lower than with the use of process replication. This may be attributed to the fact that the redundant threads share a common address space and mutually access the operand data. Figure 5.12 shows a summary of the results of the performance evaluation experiments with fault injections, where the strength of RMT is set to provide fault detection using dual modular computation. Each data point is the average normalizedexecutiontimeforthe10,000applicationruns. Theerrorbarsillustrate the spread among the maximum and minimum overhead for each fault injection rate. For the fault injection rate of 1 event per execution run, the adaptive RMT strategy incurs 7% to 24% overhead over the fault-free execution. For the higher fault injection rates, the spread among the execution times is much greater since the experiments cover a range of fault patterns. However, the maximum average overhead incurred is in the range of 78% to 85% which is still lower than an overhead in the excess of 100% that is typical for full-process replication. Figure 5.13 shows a summary of the results of the performance evaluation experiments with fault injections, where the strength of RMT is set to provide fault detection using triple modular computation. For the low fault rates, the overhead to the time to solution is in the range of 40% to 74%. For a fault rate of up to 3 faults per execution run, the overhead incurred for triple adaptive RMT 119 is lower than or at least proportional to the overhead due to dual redundant full- processreplication, formostoftheapplicationcodesweevaluated. Forhigherfault rates, the overheads while less than the 3x overheads typical to full-process TMR, are substantial (in the range of 2.5x to 2.9x). These results suggest that dual RMT-based detection in collaboration with a different recovery or amelioration strategy might be more suitable for long-running computations. In the results summarized in Figure 5.14, the benefit of the fault detection based on lazy evaluation is demonstrated. Here, relaxation of the requirement that the redundant threads synchronize and compare values after every instance of the structured code block allows the primary thread to race ahead while the redundant thread lags behind. In comparison to the time to solutions for the codes in Figure 5.12, the lazy evaluation provides a further 10 to 18% better time-to-solution. Figure 5.15 provides a summary of the results that utilize the thread clustering optimization. These results show that assigning a higher priority value and dedi- cating core resources to the primary computation yields a better time-to-solution for all the test application codes. 120 Figure 5.12: Results: Fault detection with adaptive redundant multithreading 121 Figure 5.13: Results: Fault detection and correction with adaptive redundant multithreading 122 Figure 5.14: Results: Fault detection with aRMT with lazy evaluation 123 Figure 5.15: Results: Fault detection with aRMT with thread clustering 124 5.9 Summary As faults will become increasingly prevalent in future exascale-class HPC sys- tems, maintaining error resilience during the execution of long-running scientific applications is critical. Techniques that enable error detection and correction with- outburdeningtheapplicationperformancewillbeimportantfortheseapplications. This chapter presented a scheme for application-level fault detection and correc- tion based on redundant multithreading (RMT). We described the language-level directive for scoping the computation into spheres of replication by the application developers. This provides mechanisms to detect and potentially correct errors at the application-level for regions of computation and for program data variables that are deemed significant towards application correctness. The compiler front- end outlines these code blocks as well as provides a compiler-runtime interface that is designed to support opportunistic execution of the code blocks using RMT. We also described the design and implementation of the runtime system that contin- uously learns about the fault-tolerance state of the execution environment. This prepares the runtime system to dynamically enable/disable the redundant multi- threading to match the vulnerability of the execution environment. We also explored a lazy detection strategy in which the application’s primary computation is prioritized over the duplicate computation. This is accomplished by relaxing the requirement that the redundant threads synchronize and com- pare results following completion of each structured code block. We have pre- sented concrete examples of widely used scientific computational kernels where our opportunistic application-level error detection/correction strategy is effective. Ourresultsdemonstratethatthecombinationofthelanguage-leveldirective,which is tightly integrated with the compiler infrastructure and runtime system in the 125 implementation of redundant multithreading, provide error detection and correc- tion with substantially lower overheads to application performance than with the use of macroscale redundancy. 126 Chapter 6 An Introspective Runtime Framework for Resilience 6.1 Overview Historically, the primary objective of designers and users of HPC systems has been to maximize the application’s performance. However, the emerging HPC system architectures are likely to be quite different from those currently in use. The compute nodes in future exascale-class systems are predicted to evolve as they strive to satisfy performance, productivity, reliability, and energy efficiency in the face of divergent computational requirements. The compute nodes will be increas- inglycomplexasthenumberofcorespersocketcontinuestogrow,thecoresbecome more heterogeneous, employ deeper memory hierarchies and are constructed from less reliable components [7]. Therefore, achieving application scalability and per- formancewillbeamultidimensionalchallenge. However,theconventionalsolutions that seek to address these issues often contradict each other. The strategies that 127 seek to optimize the application performance might adversely affect the resilience or energy efficiency, and vice-versa. For example, dynamic voltage and frequency scaling (DVFS) solutions modulate the power consumption based on the applica- tion’s activity patterns. When processor cores are placed in higher DVFS states, it increases the application performance for compute-bound phases of the applica- tion, but at the cost of increased power consumption. An increased frequency and reduced voltage has been known to increase the vulnerability of chips to transient errors [125]. Conversely, the price paid for reducing the power consumption (by scaling down the DVFS state) is a potential loss in application performance, but this also lowers the risk of transient faults. The use of redundant computation provides error detection/correction capabilities but the additional computation increases the application’s time-to-solution as well as energy expended. Similarly, data redundancy schemes such as ECC and Chipkill protect the data integrity of DRAM memory lines, but incur overheads to data access latencies and increase the power consumption of the DRAM chips. In order to achieve balance between the performance optimizations and the issuesofenergyefficiencyandresilience, werequireagilesolutionsthatwillmanage these often divergent objectives. The execution environments are critical interme- diaries that can play a key role in managing these objectives. In this chapter we present an introspection-based runtime framework that observes and reflects upon hardware-based fault indicators. By understanding patterns in the fault events, the runtime reasons about and affects resource management decisions in the sys- tem; therefore, it enables the execution environment to exploit the trade-off space between the application performance and resilience. We explore two introspection- based strategies: a resilience-aware DVFS approach and a resilience-aware thread 128 assignment. Since the runtime system is well-positioned to understand the com- putational requirements of the HPC applications, we also use the Rolex extensions to integrate programmer-specified fault-resilience features of the application with the runtime’s introspection capability. The remainder of this chapter is organized as follows: Section 6.2 describes the goals and design trade-offs in the design of system software stacks on the compute nodes of HPC systems. Section 6.3 makes the case for a unified runtime system and explains the key features of our introspection-based runtime system. Sections 6.4 and 6.5 elaborate on the design and implementation of the resilience-driven introspection strategies. In Section 6.6 we detail the integration of the RoLex features from Chapter 4. Section 6.7 describes our experimental evaluation. 6.2 The State of the Art in HPC Execution Environments The execution environments on HPC compute nodes are constructed from lightweight system software stacks, which are designed with the bare minimum capabilities necessary to support the scientific application execution. They are constructed using lightweight operating system kernels (LWK) and complemen- tary runtime systems. Unlike full-featured OS distributions, the LWKs contain limited features: the compute nodes often contain no local disk, and therefore the kernel does not provide support for file system operations. Since the com- pute node is usually dedicated to a single application job, the OS is not required to support multiprogramming or context switching between multiple jobs. The OS does not provide capabilities for static memory allocation for RAM-disks and there is no provision for disk swap-space. The OS also does not contain daemons or 129 background monitoring processes. Additionally, the scheduling and resource allo- cation algorithms are deliberately simple. This permits short, predictable response times for system services and therefore maximizes the compute cycles and mem- ory resources made available for the application under execution. But since the lightweight kernels do not support monitoring services, the system requires out-of- band detection that is remotely managed. Due to the absence of native monitoring services, the OS services contain no feedback loop to actively respond to changes in the execution environment. The HPC execution environments are also supported by runtime systems that are built as libraries of routines. These routines provide the compiler with access to system-level management capabilities. The runtime systems support high-level language features for task and data parallelism, managing data namespaces, local- ity, etc. Therefore, the runtimes are tightly coupled to the language features and abstractions. The capabilities offeredby runtime systems include memory manage- ment (Jigsaw [126], TCMalloc [127]); support for creating parallel tasks and man- aging their scheduling, synchronization, and load-balancing (Charm [128], Scioto [129], DagUE [130]); communication in distributed memory systems (SHMEM [131], GASNet [132]); threading (qThreads [133]); and even power management (CPUMISER[134], GreenQueue[135]). However, theseruntimesystemsareinde- pendently designed with little interoperability among them. Therefore, rarely do theirimplementationsexplorethetrade-offspacebetweentheapplication’scompu- tational requirements and the performance, power and fault-tolerance implications of runtime decisions. The proponents of the lightweight system software stacks argue that the sim- plicity and reduced memory footprint due to the reduced number of software com- ponents reduces the likelihood of system software errors and increases the system 130 MTBF [136]. However, the implication of building a system software stack that is completely fault-oblivious is that all error states in the execution environment lead to catastrophic failure for the compute node and the application processes running on it. This will be problematic for future exascale-class systems, which will experience accelerated error rates from diverse sources. 6.3 Introspection-Based Runtime System Our introspection-based runtime system allows the execution environment to self-reflectandmakeobservationsonthevulnerabilityofthesystemresourcesusing hardware-based fault indicators. This enables the execution environment to make more reasoned decisions on resource management that balances the application’s resilience and performance goals. The runtime is also tightly integrated with the programming model to capture the computational requirements of the scientific applications. The key features of our introspection-based runtime system are: • Active monitoring: The runtime system actively logs fault events in the system and seeks to understand patterns to assess the vulnerability of the resources. • Integrated reasoning: The runtime examines the impact of the fault events in the system on the application resilience in relation to its performance and power consumption goals. • Tight integration and interoperatibility with Rolex language features: The application resilience knowledge and expectations of the programmer enable cooperation between the application and system management layers while making the resource management decisions. 131 We believe that an introspection-based runtime system that contains an active feedback loop will permit the dynamic reconfiguration of the execution environ- ment to enhance its error resilience and maximize the likelihood of the application converging to correct solution. In this section we describe the indicators that per- mit active monitoring of the fault vulnerability of the execution environment and the introspection algorithm. We also explain how these inferences can influence the resilience-oriented resource management. The discussion on the integration and interoperability with the Rolex extensions is deferred to Section 6.6. 6.3.1 Reflecting on Faults Studies on error characteristics in large-scale systems have noted that the dis- tribution patterns of error events tend to be skewed. At the node level, 5% of nodes account for more than 95% of all errors. Furthermore, many of the errors tend to be correlated rather than independent events, such that a single error raises the probability of future errors by more than 80%, and this probability rises to 95% after a handful of errors at the same location [137]. Factors such power supply variations, thermal variations caused by cooling mechanisms such as board-level fans, orvariationsinambienttemperaturesandpowermanagementfunctionscause errors in the same vicinity. Therefore, there is some spatial and temporal corre- lation between consecutive error events. Often the correlation between repeated error events is not immediately apparent unless carefully analyzed. The resource management strategies that are based on statistical models of failure mechanisms only use historical log analyses. Therefore the resource management decisions tend to be based on probabilistic models. These approaches cannot factor in the excep- tional error patterns, which may occur due to variations in operating conditions, due to the absence of a feedback mechanism. 132 Our introspection-based runtime system relies on a variety of hardware-level fault indicators including error detection and reporting features that are available in the ACPI’s APEI [124] specification. Examples of such events are: ECC errors on DRAM DIMMs, PCI bus parity errors, cache ECC errors, DRAM scrubbing notifications, TLB errors, memory controller errors, etc. The errors from all sub- systems are communicated to the host firmware or to the operating system directly via the signaling interface. Based on the contents of the error registers, the severity of the anomalous event is analyzed. The firmware creates detailed error logging information and notifies the OS of the occurrence of the event. The notification is also logged by the introspection-based runtime system. The runtime also logs errors that are discovered by the Rolex mechanisms. 6.3.2 Algorithm for Self-Reflection In order to understand the correlation between fault events on a compute node, we model the fault events as a time-series. The objective of such a time-series analysis is not to create an accurate fault event forecasting/prediction model, but rather to create a feedback loop for the introspective runtime framework. The algorithm permits the runtime system to understand the vulnerability of resources. The observed trends in fault occurrences guide the resource management decisions in order to enhance the resilience of the execution environment and the application processes. We apply exponential moving averages (EMA) on the time intervals between fault events to model the trend in fault intervals. The EMA assigns exponentially decreasing weights as the fault instances get older. The exponential smoothing is given by the formula: x t =α·f t +(1−α)·x t−1 (6.1) 133 whereα is the smoothing factor in the range {0, 1}. The value of x t is a weighted average of the latest fault interval f t and the previous smoothed statistic x t−1 . The moving average model conceptually provides a linear regression of the fault intervals. It also filters the effects of noise introduced by random bursts of errors. Theobjectiveofthisintrospectiveanalysisistounderstandthecorrelationbetween the stream of error events in the system and generate an appraisal of the reliability profile of individual resources. Through this real-time assessment, the runtime makesresourceassignmentdecisionsthatseektominimizetheimpactofvulnerable system resources on the application program state. 6.4 Resilience-Aware Scheduling Thread assignment strategies seek to enable concurrent execution of several threads and maximize the performance of the system. The various policies dynam- ically adapt the thread assignment to match the varying computational behavior of the threads over time. The policies rely on metrics such as the instructions per cycle (IPC) to quantify phases of thread behavior [138]; these metrics drive the dynamic thread-to-core assignments. Various reliability-driven policies have previously been studied. These tech- niques are designed and implemented at the architecture-level to remain trans- parent to the system software. Their primary objective is to assign threads that minimize the impact of errors due to process variations and manufacturing defects [139] [140] [141], or mitigate the effects of wear-out related stress induced errors [142] [143]. These approaches rely on-chip thermal sensors to determine the impact of process variations, or to estimate the impact of wear-out related stresses. 134 6.4.1 Fault-Aware Thread Assignment Based on Runtime Introspection Ourproposedthreadassignmentstrategyisbasedontheruntimeintrospection- based algorithm that models the fault events as a time series. When a processor core experiences a sequence of correlated errors, the application thread assignment is modified such that the threads are distributed among the remaining healthy cores. In the context of a SMP-based multiprocessor node, this results in an unbalanced thread scheduling that creates a heterogeneous or asymmetric thread- to-core assignment. However, from the perspective of an HPC application, this strategy yields opportunistic fault avoidance that reduces the probability that the application’s state experiences these faults. Since the search space for multithreading schedules is very large, we rely on the runtime introspection algorithm to learn about the source and intervals between fault events. Therefore, our fault-aware thread assignment is based on the vul- nerability assessment of each processor core in the compute code. The runtime migrates to a schedule that minimizes the faults affecting the application program state. This is accomplished by affecting adjustments to the current schedule. Both the performance and fault resilience of the application are important considerationsforthescheduler. Therefore, theruntimedynamicallyadaptsduring the application execution to apply the assignment that provides the maximum performance for the lowest predicted fault rate. The resulting unbalanced schedule causesanunevendistributionofthreadsamongtheprocessorcores, butpotentially minimizes the faults experienced by the application threads. While designing our resilience-based scheduling strategies, we assume that the thread scheduling is managed by an operating system scheduler that decides the 135 assignments and dynamically balances them among the cores in the system. We also assume that all application threads are of equal weight in terms of priority. We also assume that the cost of thread migration is amortized over the duration of the program execution, since the introspection algorithm does not initiate thread migration at every scheduling quantum. Using the introspection algorithm, the vulnerability of the individual cores at any instant during the execution of the application is evaluated as a function of the intervals between previous fault events experienced by individual cores. This quan- tifies an instantaneous MTTF based on the weighted average of previous MTTF for the cores. When the instantaneous MTTF for a processor core exceeds the EMA, the runtime redefines the allocation bound to exclude the faulty processor cores. This creates unbalanced thread assignment schedules that isolate the vul- nerable processor cores that experience frequent fault events, while oversubscribing the remaining healthy cores. The faults we consider are transient. Therefore, the thread mapping decisions are periodically reassessed, and based on the present state of the cores, the thread assignment is reconfigured. When the error interval of a previously tagged vulnerable core falls below the average, it is brought back into service and assigned program threads once again. 6.4.2 Implementation We use the Pthreads model to execute the program threads. Although the thread scheduling is managed by an OS-level scheduler, the Pthreads standard defines a thread-scheduling interface that allows programs to control how threads share the available processor cores in a SMP system. The API permits the priority and core affinity of individual threads to be controlled at the user level. 136 The introspective runtime system maintains the allocation domain of all threads, which at the time of application initialization includes all the CPU cores in the SMP. Based on the vulnerability assessment provided by the introspection algorithm, the runtime system grows or shrinks the allocation domain to include only the healthy cores. Theruntimelibrary interface provides routinestomodifytheallocation domain for the threads. The library routines modify the affinity mask for the application threads using the pthread_setaffinity_np() routine in the Pthread standard. ThisroutinesetstheaffinitymaskofthethreadstothegroupofCPUcorespointed to by cpuset. The call to the routine also migrates the threads that are currently mapped to CPU cores outside the allocation domain to a new core that falls within the newly modified domain. The pthread_getaffinity_np() routine is available to query the current CPU affinity mask for each thread. The kernel handler for the system-level fault indicators passes the signals into the user space to the runtime system which logs the fault events. When the fault introspection algorithm indicates that a specific core needs to be excluded from the current allocation domain, the runtime library routine is invoked and the runtime system also signals the user application process. The allocation domain is modified by the runtime through the use of internal control variables (ICV). 137 6.5 Resilience-Aware Dynamic Voltage and Frequency Scaling 6.5.1 The Dynamic Voltage and Frequency Scaling Solution DVFS is a technique that seeks to lower the power drawn by the system by placing the processor(s) in lower voltage/frequency states. The price paid for low- ering power by scaling down the clock frequency is a potential loss in application performance. Similarly, placing the processors in higher DVFS states increases the performance. In fact, there is a linear relationship between power consumption and apparent application performance [144] for the compute-bound phases of the application execution. DVFS is also used to reduce the energy cost of the applica- tion execution. The selection of the DVFS state that offers the lowest power-delay product permits reduction of the energy cost without compromising the applica- tion’s performance. The key challenge in the use of DVFS is the selection of phases during applica- tion execution when the voltage/frequency scaling should be applied. For example, placing the processors in lower DVFS states during application phases which are memory-bound, i.e., the CPU is idle and waiting for data transfer between levels of the memory hierarchy, reduces the power drawn. Therefore, the success of DVFS solutions is strongly dependent on matching the application characteristics to the DVFS states. For specific phases of the application execution, we can statically assign fre- quency states [145]. Such techniques are based on compiler analyses to identify code phases on which to apply DVFS [146] [147], while user-level APIs are also 138 available [148] to enable the application developers to define the policies for their applications. DVFS policies can also be inferred by automatic online identifi- cation of the phases of the application based on computational signatures [149] such as cache behavior, CPU activity, etc., that are obtained from representative microbenchmarks. The online learning is usually based on system-level indica- tors such as [150], [151]. Phases suitable for applying DVFS scaling can also be identified through binary analysis [135] and tracing [152]. 6.5.2 Reliability Impact of DVFS While the conventional objectives of DVFS are minimization of the energy consumed during the application execution by regulating the power-delay product to the application’s computation and memory access characteristics, the voltage- frequency scaling also impacts the resilience of the system to errors. Studies have indicated that the use of DVFS causes the fault rate to increase linearly with the frequency scaling (with fixed voltage), whereas reducing the supply voltage for lower frequency results in exponentially increased fault rates [125]. When the supply voltage is scaled down, it makes the device nodes more sensitive to soft errors [153]. Studies project a five-fold increase in system error rates when transistor devices are operated at near threshold voltage (NTV) [154]. Also, the reliabilityofthedevicesdegradesoverlongerperiodsoftime. Forexample, ahigher supply voltage at higher temperatures for short periods of time also accelerates the process of device aging and leads to gradual failures [155]. Analytical and simulation studies that characterize spatial and temporal varia- tions in the thermal features across chips caused by constant voltage and frequency changes show an increased vulnerability to transient errors [153]. This leads to a 139 rapid increase in the failure probability of a processor core during phases with higher DVFS states [156]. 6.5.3 Resilience-Driven Policies for DVFS We propose a methodology that applies the DVFS based on the expected impact on application resilience. The key idea is to tune the DVFS state based on monitoring the error rate for the processor core during application execution. From the application perspective, this regulates the probability that the applica- tion experiences errors. We use our runtime introspection-based algorithm to guide the transition between DVFS states. The introspection-based policy seeks to balance the per- formance and fault resilience of the application. Therefore, the runtime system dynamically scales the DVFS states during the application execution. We assume that there is a linear relationship between the DVFS state and the fault rate in the system. Accordingly, we present two DVFS policies: 1. Living dangerously: The runtime raises the DVFS states which effec- tively maximizes the performance of the application process; however, it also increases the probability that the application experiences a proportionally higher fault rate. 2. Risk aversion: In this mode, the CPU frequency is scaled down to a con- servative level which potentially lowers the performance of the application process but also minimizes the faults experienced by the application program state. When the runtime introspection algorithm observes application phases during which few faults are observed, it lives dangerously by raising DVFS states. When 140 the fault event rate is higher than the recent moving average over a period of time, the runtime places the processor cores in a more conservative risk aversion mode. 6.5.4 Implementation The runtime system uses the cpufreq-utils [157] package to manage the fre- quency scaling. The cpufreq-utils API permits modulation of the frequency but not the voltage levels when transitioning between DVFS states. This utility is available on several Linux distributions. Since DVFS is a privileged operation, there are kernel-level governors that handle the scaling requests. The cpufreq-utils provides various pre-defined DVFS policies: • User-space: The CPU frequency is controlled through user input. • On-demand: CPU frequency is scaled based on load. • Conservative: CPUfrequencyisscaledbasedontheloadinincrementalsteps up and down, but the CPU is allowed to stay longer at each frequency step. • Power-save: CPU frequency is set to run at minimum frequency state regard- less of load. • Performance: CPU only runs at maximum available frequency state regard- less of load. Since we would like more fine-grained control on the DVFS states, we have extended the runtime library interface to include routines that select the scaling policy. We have also developed a kernel module that receives the scaling requests from the runtime system and acts upon them by issuing frequency change requests to the CPU cores. 141 6.6 Integrating the Rolex Programming Model Figure 6.1: Integration of Rolex with introspective runtime system In Chapter 4 we demonstrated the use of Rolex to create a programming model in which the fault-resilience features may be embedded in the application source code. This permits scientific application developers to explicitly express their error management knowledge to the runtime system. The runtime system utilizes the programmer knowledge to provide error resilient operation for the application pro- cesses. This section explores whether the runtime introspection capability may be enhanced using the knowledge provided by the resilience features expressed through Rolex in order to provide better resource management policies. There- fore, the policies are informed by the vulnerability trends of resources and the algorithmic resilience features of the application code. This section also describes 142 the extensions to the compiler-runtime interface that are necessary to facilitate the knowledge transfer from the application code to the runtime system. 6.6.1 Leveraging Rolex Features The Rolex extensions provide mechanisms for error detection, containment and amelioration. These static language-based annotations provide the basis for man- agingtheimpactoferrorstatesontheapplicationoutcomeusingcompilerandrun- time techniques. The Rolex extensions enable the runtime to identify application phases and describe their reliability characteristics. This Rolex-based knowledge is maintained in the DRM in the runtime system. While the introspection algorithm provides various fault-aware resource man- agement policies based on hardware-based indicators, the knowledge base of the application’s resilience features may be leveraged to assert more fine-grained con- trol on the resource management policies. This enables cross-layer resilience man- agement in the execution environment. The overview of the introspection system that includes DRM knowledge base is illustrated in Figure 6.1. The tolerance-based extensions allow certain errors in the application data vari- ables and computations to be absorbed. For such errors, the runtime ignores their presence and allows the application execution to resume. The tolerance directives even support limited localized rollback recovery. When such tolerant phases of the application execution are explicitly communicated to the runtime system, the resource management policy decisions may be deferred. When a tolerant appli- cation phase is under execution, the modified DVFS policy allows the high risk living dangerously DVFS mode to persist. This is based on the insight that further errors will be absorbed by the tolerance-based extensions. The runtime may also initiate the living dangerously mode when the code region is entered and revert to 143 a conservative DVFS state upon exit. For application phases where such knowl- edge is not available, the runtime executes the application phase in a conservative DVFS state. The application phases identified by the tolerance extensions may also be used to affect the thread scheduling. When the application execution is in the midst of a tolerant phase, the runtime may resist modification of the thread allocation bound. By resisting thread migration, which potentially reduces the application performance, the runtime persists with the application threads, whose computation is algorithmically error tolerant, on vulnerable processor cores. The robustness-based extensions support error detection and correction at the application level. These capabilities are supported through the use of redundancy in the data variable state or through limited redundant execution. The policy for modification of the allocation bound of the application threads is based on whether the computation is redundantly executed, i.e., based on whether error detection/correction is available. When the redundant execution is performed by a trailing thread, the runtime redefines the allocation bounds to enable the computation to be performed on a healthy core. However, when the redundant execution streams are assigned to separate processor cores, the runtime persists withtheexistingthreadassignment; theruntimereliesontheRolexmechanismsto detect and correct any potential error states. The availability of Rolex-based error detection/correction, which is based on redundant execution, also guides the policy to modulate the DVFS states. The runtime chooses to accelerate the execution of one of the redundant computations by raising the DVFS state of the processor cores. The DVFS states are raised for all of the redundant computations while the comparison of the output values is performed at a conservative DVFS level. The amelioration-based extensions are supported by application-level methods to heal the impact of errors in the program state. The availability of explicit fault 144 amelioration knowledge allows the introspective runtime system to intervene in the decision to modify the allocation bounds for the application threads. The run- time system may initiate thread migration following roll-forward/roll-back. This allowsthehealroutineandthesubsequentcomputation(withamelioratedstate)to resume on a healthy core. The presence of amelioration knowledge for application constructs in the runtime DRM also guides the DVFS policies. The runtime may choose to live dangerously in terms of DVFS state for the execution of application phases for which recovery routines are available. When recovery is performed, the execution of the recovery computation may be executed in a risk averse DVFS state. 6.6.2 Extending the Compiler-Runtime Interface In order to provide the introspective runtime system with the Rolex-based application-level knowledge, we also extend the Rolex compiler-runtime interface. This interface includes routines which are visible only to the compiler front-end framework and may not be invoked through simple function calls in the user’s code. The compiler front-end applies these routines to functions outlined from the pragma-structured blocks. These routines are invoked prior to entry into the out- lined functions and communicate the resilience property of the application phase to the runtime system. Since the runtime uses the fault introspection to quantify the vulnerability of the processor cores, these resilience property of the applica- tion phase communicated by the routine serve as hints rather than assertions for resource management policies. 145 The interface routines __rolex_dvfs_tolerant_start() and __rolex_dvfs_tolerant_end() inform the runtime system about the spe- cific bounds of the tolerant regions. During the execution of these phases, the runtime may choose to raise the DVFS state to enable the application execution to live dangerously. Similarly, the compiler-runtime interface includes __rolex_dvfs_robust_start() and __rolex_dvfs_robust_end() to specify the scope of the application phase where redundancy-based error detection seman- tics are available. The routines __rolex_dvfs_robust_compare_start() and __rolex_dvfs_robust_compare_end() identify the phases in which the results of the redundant state are compared. The comparison and majority voting of the output results of the redundant computation usually warrants more conservative risk averse DVFS states. We also provide __rolex_dvfs_healable_start() and __rolex_dvfs_healable_end() for regions for which amelioration methods are available. The routines __rolex_dvfs_repair_start() and __rolex_dvfs_repair_end() communicate the start and end of the computation that heals any corruptions in the application program state. Similarly for the thread scheduling, we provide routines __rolex_sched_*_start() and __rolex_sched_*_end() where the * includes tolerant, robust, compare, healable and repair. These routines indicate phases for which the runtime may initiate or modify thread scheduling changes based on the type of available resilience knowledge for the application phase. 146 6.7 Experimental Evaluation 6.7.1 Fault Injection Framework The type of fault events that the runtime observes and uses to quantify the vulnerabilityoftheexecutionenvironmentincludecorrectederrornotificationsand uncorrected errors that are recoverable by OS intervention. We have developed a software-based fault injection framework that runs as an independent process from the target application program’s process. The faults are delivered to the application process via signals. The advantage of this standalone fault injection methodology is that it avoids intrusive injection which involves modifying the application program code, or compiler inserted instructions that perform the fault injection. The fault injection framework does not interfere with the application execution between injection intervals. It generates the fault events by sending a USR1 signal to the application pro- cess. The runtime system, which is linked to the application process, contains the interrupt handler for this signal. Therefore, we simulate the fault event notifica- tioninterruptanditshandling. FortheDVFS-basedexperiments, wemodulatethe fault rate based on the frequency. Since the fault events serve as mere notifications, they do not perturb any aspect of the application program state. The fault events are randomly generated during the application execution. The runtime contained handler is a blocking routine that catches and logs the fault event notification. The fault injection framework can generate the fault notification signals at arbitrary instances and intervals during the execution of the application process. 147 Figure 6.2: Results: Fault-aware thread scheduling 6.7.2 Evaluation of Resilience-Aware Thread Assignment We perform quantitative evaluation for the introspection-based thread schedul- ing using a simulated fault injection-based approach. For these experiments, we use a dual-socket 8-core node. The fault injections are concentrated on a single core at a time, and we target at most two cores over the course of the applica- tion run, i.e., the compute node capacity is degraded by 25%. The results of our experiments with fault event injections demonstrate that the presence of faults can have a significant impact on the application’s time-to-solution. Figure 6.2 shows a comparison of overhead to application performance with the introspection-based thread scheduling. The benchmark application for these experiments is the HPCC 148 Random Access. The fault traces contain many instances of multiple fault events, and we inject up to 100 faults per execution run. When the algorithm offers an accuracyof10%, thecoresareremovedandbroughtbackintotheallocationbound; the resulting thread migrations introduce significant overhead to the application’s time-to-solution. For an accuracy of 90%, the cores that experience the faults are taken out early in the execution. As more faults are introduced into the system, the performance drops sig- nificantly at two distinct points where the targeted core is taken out from the allocation bound for the application threads, as shown in Figure 6.2. However, beyond a certain point, the impact of further faults in the trace has no impact on the application’s time-to-solution, as the faulty cores are removed from the affinity mask, application threads are migrated away, and no newly created threads are assigned to these cores. 6.7.3 Evaluating Resilience-Aware DVFS For the evaluation of the resilience-driven DVFS that is managed through the runtime introspection algorithm, we also use an Intel 8-core compute node. The DVFS states available through cpu-frequtils are in the range of 800MHz to 2.4GHz. Therefore, we select 1.6GHz as the standard operating frequency and assign it a default rate. The fault rate is modulated linearly when the experiment raises the DVFS state to 2.4GHz for the living dangerously mode of operation and lowered when the DVFS state is lowered to 800MHz for the conservative mode. We eval- uate the effect of CPU speed variation by placing the cores in the system in the highest available DVFS state, i.e, the application execution lives dangerously upon initialization. The introspection algorithm is allowed to modulate the DVFS states based on the observed fault rate and we measure the impact on the application’s 149 Figure 6.3: Results: Fault-aware dynamic voltage frequency scaling time-to-solution. We observe that at low fault rate, the application is allowed to execute at a riskier DVFS state and converges to solution faster than at the moderate DVFS state despite the elevated fault rate. 6.7.4 Evaluation of Rolex-Based Runtime Introspection FortheHPCCRandomAccess, weallocatetheHPCCTablearrayusingthetol- erant version of malloc and include the loop that performs pseudorandom updates onthetableinsidethetolerantdirective. Accordingly, theruntimeallowstheappli- cation to run in the living dangerously DVFS mode for a majority of its execution time. 150 Figure 6.4: Results: Fault-aware thread scheduling with Rolex Figure 6.4 compares the introspection algorithm at a 90% accuracy level with the effect of exposing the application resilience phases to the introspective runtime through the Rolex extensions. Due to the resilience features of HPCC Random Access, the runtime system resists reducing the allocation domain during the appli- cation execution, despite the presence of a high fault rates. Therefore, the appli- cation performance overhead is much lower than with the use of the introspection algorithm alone. The collaboration of the Rolex identified application phases and the introspec- tion on the fault indicators allows the DVFS state to be modulated more intel- ligently. Since the HPCC Random Access benchmark is inherently resilient, the 151 Figure 6.5: Results: Fault-aware dynamic voltage frequency scaling with Rolex runtime system maintains the DVFS state at a riskier level, exposing the applica- tion program state to a higher fault count since the Rolex feature explicitly places the pseudorandom update phase of the application in the tolerant directive. 6.8 Summary In this chapter we demonstrated an introspection-based runtime system, whose objective is to seek a balance between the reliability and performance targets of the application. We demonstrated an algorithm that uses trend analyses on hardware-based fault indicators to quantitatively assess the vulnerability of the systemresources. Theseanalysesareusedtoguideresourcemanagementdecisions, 152 includingstrategiesforresilience-awarethreadschedulinganddynamicvoltageand frequency scaling. The introspection-based thread scheduling approach isolates the processor cores that experience a sustained stream of fault events. The runtime system manages the thread assignment in the presence of enforced heterogeneity between processor cores. We also demonstrated resilience-aware policies for DVFS that are informed by the introspection-based algorithm and permit the application to live dangerously. We also leveraged the Rolex programming model to explic- itly identify the resilience features of the application phases, which allows more finegrained control over the introspection-based resource management decisions. 153 Chapter 7 Conclusions and Future Work Advances in high-performance computing capabilities enable research and dis- covery in a variety of scientific and engineering disciplines, through modeling and simulation of complex systems and data analysis. More elaborate, multiscale sim- ulations and advanced data analyses can be accomplished by using next generation exascale-class capability systems. The performance demands will require an ever- increasing number of compute and memory elements. Unfortunately, the inherent unreliability of semiconductor devices due to transistor feature size scaling will be amplified by the sheer scale of exascale-class systems. Therefore, long-running scientific applications, that will use these systems, will need to contend with accel- erated rates of faults and errors. Yet the execution environments in current HPC systems are designed to be fault oblivious. This dissertation seeks to address this resilience challenge and presents an approach that incorporates resilience capabilities in the execution environment. Our approach is based on resilience-oriented programming model extensions that are tightly coupled with a compiler infrastructure and an introspective runtime 154 system. We summarize the contributions of this approach in Section 7.1. How- ever, this work is an initial step in the creation of a fault-aware and fault-tolerant execution model for future exascale-class HPC systems. Based on the techniques developed in this work, there are several aspects of the programming model, com- piler infrastructure and the runtime framework that may be extended in order to investigate further optimizations in the trade-off space between HPC application performance, resilience, and power and energy efficiencies, as well as complement other resilience strategies. In Section 7.2, we briefly describe the most promising directions for future work. 7.1 Contributions 7.1.1 Resilience-Oriented Programming Model HPC application developers understand the fault-resilience features of their application codes, but lack portable, precise and succinct mechanisms for describ- ing the resilience properties. We present Rolex, a set of language extensions for C/C++ that facilitate the specification of error detection, containment and recov- ery properties as intrinsic features of scientific application codes. These extensions allow the runtime system to reason about the significance of each error instance to the application outcome. We demonstrated the viability of this approach for a range of scientific application codes under accelerated fault injection rates. For codes such the HPCC Random Access, we demonstrated that as many as 98% of injected errors can be tolerated using Rolex. For matrix-based computations such as the DGEMM code, a simple checksum method affords the detection and correction of up to 75% of errors which would otherwise lead to a fatal application crash. 155 7.1.2 Adaptive Redundant Multithreading for Error Detection and Correction We developed a replication-based strategy that provides application-level fault detection and correction. We extended the semantics of the robust Rolex directive to enable redundant multithreading on the structured blocks in the application code. We implemented the source-to-source compiler transformations that enable transparent redundant computation for these programmer-scoped code blocks. This strategy is supported by a runtime system that signals the application to dynamically enable/disable the RMT. The runtime system makes the decision to enable/disable the error detection/correction on the basis of a recent history of fault intervals. We presented concrete examples of scientific kernels where this adaptive strategy allows error detection and correction and examined the asso- ciated overhead cost. When the application-level detection/correction is applied to match the fault event rate in the execution environment, we achieve as much as 25% to 70% savings in performance overhead cost, in comparison to complete redundancy-based approaches. 7.1.3 Introspective Runtime Framework for Resilience-Aware Execution Model The execution environments of HPC systems are neither fault-aware nor fault- tolerant. Since they do not contain a feedback loop to reason about correlation between error events the execution environment cannot adapt itself to take proac- tiveactionforfuturefaultavoidance. Wedesignedanintrospectiveruntimesystem 156 that monitors system-level fault indicators and generates a vulnerability assess- ment of the system resources. Based on these evaluations, we demonstrated fault- aware dynamic resource management capabilities that include thread scheduling and dynamic voltage and frequency scaling. We also demonstrated how the Rolex features may be leveraged to enable the introspective runtime system to make better resource management decisions by exploiting the application’s resilience features. 7.2 Recommendations for Future Work 7.2.1 Enhanced Programming Model Features In Chapter 4 we presented our resilience-oriented language extensions which included static type qualifiers and directives. These language extensions could be complemented by constructs which express the notion of reliability lifetime. The significance of various program-level constructs for application correctness is also a function of time. Language-level extensions which succinctly express the error tol- erance, containment and amelioration as a function of time will allow the runtime systemtoenforceresilientbehaviorfortheconstructsoverspecificintervalsoftime, rather than over the entire lifetime of the application process. Additionally, the language extensions could include predicates for data variables which may detect and express the extent of data corruption. Language-level constructs which apply predicate conditions on data variables and regions of computation that operate on them will prevent wasteful computation and possibly initiate early recovery before the error propagates to other data variables in the program state. 157 7.2.2 Expansion of Runtime Management Capabilities The integration of accurate energy metrics into the introspective runtime sys- tem will present an interesting path forward for more intelligent runtime introspec- tion. Since energy consumption is another significant concern for future exascale- class systems, the use of additional system-level indicators will enable the runtime system to explore resource management strategies that provide balance between the energy consumption and the need for resilient operation. For example, DVFS states are presently selected for application phases to maintain the power-delay product within certain bounds. In Chapter 6 we demonstrated a strategy for DVFS that seeks optimization in the trade-off within the resilience-performance continuum. This strategy may leverage a variety of additional system-level indi- cators to expand the search space for DVFS settings that meet the energy and resilience goals for application phases. 7.2.3 Integration with Checkpoint and Roll-back Libraries While the combination of Rolex and the introspective runtime system can man- agetheapplicationresilienceforcertainfaultmodels, therewillalwaysbescenarios where process failures are unavoidable. The primary advantage of the lossy algo- rithmic amelioration approaches available through Rolex over the checkpoint/roll- back recovery is that the former do not increase the computational cost when no failure occurs. However, occasional error states may require restoration of the program state through roll-back recovery. The frequency of checkpoints and recov- ery, being expensive operations, may be modulated by the introspective runtime system. The incorporation of C/R library routines will provide the introspec- tive runtime system with a richer choice of actions among techniques that will 158 ensure resilient operation for the HPC application. The runtime system may select between recovery from disk and algorithmic amelioration based on the computa- tional costs involved and resilience requirements of the application code. 159 Reference List [1] The Opportunities and Challenges of Exascale Computing. Technical report, SummaryReportoftheAdvancedScientificComputingAdvisoryCommittee (ASCAC) Subcommittee, 2010. [2] Eugene Brooks. The Attack of the Killer Micros, Teraflop Computing Panel. 1989. [3] MPI Forum. MPI: A Message-Passing Interface Standard. Technical report, Knoxville, TN, USA, 1994. [4] Top500:. Top500 supercomputer sites. [5] M. Bohr. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter, 12(1):11–13, Winter 2007. [6] Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xue- bin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lich- newsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolf- gang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. The International Exascale Software Project Roadmap. International Journal on High Performance Computing Applications, pages 3–60, February 2011. [7] PeterKogge, KerenBergman, ShekharBorkar, DanCampbell, WilliamCarl- son, William Dallya, Monty Denneau, Paul Franzon, William Harrod, Kerry 160 Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. Exascale Computing Study: Technology Challenges in Achieving Exascale systems. Technical report, DARPA, September 2008. [8] R.Baumann. Theimpactoftechnologyscalingonsofterrorrateperformance andlimitstotheefficacyoferrorcorrection. InInternational Electron Devices Meeting (IEDM), pages 329–332, December 2002. [9] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers. The impact of technology scaling on lifetime reliability. In International Conference on Dependable Systems and Networks, pages 177–186, June 2004. [10] Shekhar Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 25(6):10–16, November 2005. [11] E.N.Elnozahy, Ricardo Bianchini, Tarek El-Ghazawi, Armando Fox, Forest Godfrey, Adolfy Hoisie, Kathryn McKinley, Rami Melhem, James Plank, Partha Ranganathan, and Josh Simons. System Resilience at Extreme Scale. Technical report, DARPA, 2008. [12] Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward exascale resilience. International Journal on High Per- formance Computing Applications, 23(4):374–388, November 2009. [13] Saurabh Hukerikar, Pedro C Diniz, and Robert F Lucas. Rolex: Resilience Oriented Language Extensions for Exascale Computing. Journal (under review), 2015. [14] Saurabh Hukerikar, Pedro C Diniz, and Robert F Lucas. A programming model for resilience in extreme scale computing. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1–6, June 2012. [15] SaurabhHukerikar,PedroCDiniz,andRobertFLucas. Programmingmodel extensions for resilience in extreme scale computing. In Euro-Par 2012: Parallel Processing Workshops, volume 7640 of Lecture Notes in Computer Science, pages 496–498. Springer Berlin Heidelberg, 2013. [16] Saurabh Hukerikar, Pedro C Diniz, and Robert F Lucas. Robust graph traversal: Resiliency techniques for data intensive supercomputing. In High Performance Extreme Computing Conference (HPEC), 2013 IEEE, pages 1–6, Sept 2013. 161 [17] Saurabh Hukerikar, Keita Teranishi, Pedro C Diniz, and Robert F Lucas. Application Level Fault Detection and Correction through Adaptive Redun- dant Multithreading. Journal (under review), 2015. [18] SaurabhHukerikar, PedroCDiniz, andRobertFLucas. ACaseforAdaptive Redundancy for HPC Resilience. In Euro-Par 2013: Parallel Processing Workshops, volume 8374 of Lecture Notes in Computer Science, pages 690– 697. Springer Berlin Heidelberg, 2014. [19] Saurabh Hukerikar, Keita Teranishi, Pedro C Diniz, and Robert F Lucas. Opportunistic application-level fault detection through adaptive redundant multithreading. In International Conference on High Performance Comput- ing Simulation (HPCS), pages 243–250, July 2014. [20] Saurabh. Hukerikar, Keita. Teranishi, Pedro C. Diniz, and Robert F. Lucas. An evaluation of lazy fault detection based on adaptive redundant mul- tithreading. In IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6, Sept 2014. [21] Saurabh Hukerikar. Introspective Resilience for Exascale High Performance Computing Systems. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC14), November 2014. [22] Vivek Sarkar, Saman Amarasinghe, Dan Campbell, William Carlson, Andrew Chien, William Dally, Elmootazbellah Elnohazy, Mary Hall, Robert Harrison, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Charles Koelbel, David Koester, Peter Kogge, John Levesque, Daniel Reed, Robert Schreiber, Mark Richards, Al Scarpelli, John Shalf, Allan Snavely, and Thomas Sterling. Exascale Software Study: Software Challenges in Extreme Scale Systems. Technical report, DARPA, September 2009. [23] Daniel J. Sorin. Fault tolerant computer architecture. Synthesis Lectures on Computer Architecture, 4(1):1–104, 2009. [24] JvonNeumann. ProbabilisticLogicsandtheSynthesisofReliableOrganisms from Unreliable Components. Automata Studies, pages 43–98, 1956. [25] Adam Oliner and Jon Stearley. What supercomputers say: A study of five system logs. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 575–584, 2007. [26] B. Schroeder and G.A. Gibson. A Large-Scale Study of Failures in High- Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, 7(4):337–350, 2010. 162 [27] John Daly, Bill Harrod, Thuc Hoang, Lucy Nowell, Bob Adolf, Shekhar Borkar, Nathan DeBardeleben, Mootaz Elnozahy, Mike Heroux, David Rogers, Rob Ross, Vivek Sarkar, Martin Schulz, Marc Snir, Paul Woodward, Rob Aulwes, Marti Bancroft, Greg Bronevetsky, Bill Carlson, Al Geist, Mary Hall, Jeff Hollingsworth, Bob Lucas, Andrew Lumsdaine, Tina Macaluso, Dan Quinlan, Sonia Sachs, John Shalf, Tom Smith, Jon Stearley, Bert Still, and Jon Wu. Report: Inter-agency workshop on hpc resilience at extreme scale. Technical report, Advanced Computing Systems, National Security Agency, February 2012. [28] N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, and B Harrod. High-EndComputingResilience: Analysisofissuesfacingtheheccommunity andpath-forwardforresearchanddevelopment. Whitepaper, December2009. [29] M.Snir,R.W.Wisniewski,J.A.Abraham,S.V.Adve,S.Bagchi,PavanBal- aji, J. Belak, P. Bose, F. Cappello, B. Carlson, Andrew A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, Sriram Krishnamoorthy, Sven Leyffer, D. Liberty, S. Mitra, T. S. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. International Journal of High Performance Computing, 2013. [30] Fundamental concepts of fault tolerance. Proceedings of the 12th IEEE Inter- national Symposium on Fault-Tolerant Computing (FTCS-12), pages 3–38, June 1982. [31] Laprie J.C. Dependable computing and fault tolerance: Concepts and ter- minology. Proceedings of the 15th IEEE International Symposium on Fault- Tolerant Computing (FTCS-15), pages 2–11, June 1985. [32] Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable Secure Computing, pages 11–33, January 2004. [33] William H. Pierce, editor. Academic Press, 1965. [34] A Avižienis, H. Kopetz, and J. C. Laparie. The Evolution of Fault-tolerant Computing. Springer-Verlag New York, Inc., New York, NY, USA, 1987. [35] Dhiraj K. Pradhan, editor. Fault-tolerant Computer System Design. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996. [36] Daniel P. Siewiorek and Robert S. Swarz. Reliable Computer Systems (3rd Ed.): Design and Evaluation. A. K. Peters, Ltd., Natick, MA, USA, 1998. 163 [37] Jerome H. Saltzer and M. Frans Kaashoek. Principles of Computer System Design: An Introduction. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2009. [38] O Serlin. Fault-tolerant systems in commercial applications. IEEE Com- puter, page 19âĂŞ30, August 1984. [39] Jean-Claude Laprie. Dependable computing: Concepts, limits, challenges. In Proceedings of the Twenty-Fifth International Conference on Fault-tolerant Computing, FTCS’95, pages 42–54, 1995. [40] Algirdas Avižienis. Toward systematic design of fault-tolerant systems. Com- puter, 30(4):51–58, April 1997. [41] Ifip 10.4 working group on dependable computing and fault tolerance. http: //www.dependability.org/wg10.4/. [42] Jon Stearley. Defining and measuring supercomputer reliability, availabil- ity, and serviceability (ras. In Proceedings of the Linux Clusters Institute Conference, 2005. [43] Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. In Proceedings of the Seventeenth Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pages 111–122, 2012. [44] Al Geist. What is the Monster in the Closet? Talk at Workshop on Archi- tectures I: Exascale and Beyond: Gaps in Research, Gaps in our Thinking, August 2011. [45] Vilas Sridharan and Dean Liberty. A Study of DRAM Failures in the Field. In Proceedings of the International Conference on High Performance Com- puting, Networking, Storage and Analysis, SC ’12, pages 76:1–76:11, 2012. [46] T. Austin, V. Bertacco, S. Mahlke, and Yu Cao. Reliable systems on unreli- able fabrics. IEEE Design Test of Computers, 25(4):322–332, 2008. [47] E.F. Moore and C.E. Shannon. Reliable circuits using less reliable relays. Journal of the Franklin Institute, 262(3):191–208, 1956. [48] Michael Litzkow and Miron Livny. Supporting checkpointing and process migration outside the UNIX kernel. In Proceedings of the Winter 1992 USENIX Conference, pages 283–290, San Francisco, CA, January 1992. 164 [49] J. Duell, P. Hargrove, and E. Roman. The design and implementation of berkeley lab’s linux checkpoint/restart. Technical report, Lawrence Berkeley National Lab (LBNL), December 2002. [50] J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent check- pointingunder Unix. InUsenix Winter Technical Conference, pages213–223, January 1995. [51] A.M. Agbaria and R. Friedman. Starfish: Fault-tolerant Dynamic MPI Pro- grams on Clusters of Workstations. In Proceedings of The Eighth Interna- tional Symposium on High Performance Distributed Computing, pages 167– 176, 1999. [52] Jeremy Casas, Dan Clark, Phil Galbiati, Ravi Konuru, Steve Otto, Robert Prouty, and Jonathan Walpole. MIST: PVM with Transparent Migration and Checkpointing. In In 3rd Annual PVM Users’ Group Meeting, 1995. [53] Juan Leon, Allan L. Fisher, and Peter Steenkiste. Fail-safe pvm: A portable package for distributed programming with transparent recovery. Technical report, 1993. [54] Nitin H. Vaidya. A case for two-level distributed recovery schemes. In Pro- ceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pages 64–73, 1995. [55] Kathryn Mohror, Adam Moody, Greg Bronevetsky, and Bronis R. de Supin- ski. Detailed modeling and evaluation of a scalable multilevel checkpointing system. IEEE Transactions on Parallel and Distributed Systems, 99:1, 2013. [56] AdamJ.Oliner, LarryRudolph, andRamendraKSahoo. Cooperativecheck- pointing: a robust approach to large-scale systems reliability. In Proceedings of the 20th Annual International Conference on Supercomputing, pages 14– 23, 2006. [57] Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Moreira. Adap- tive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual International Conference on Supercomputing, ICS ’04, pages 277–286, 2004. [58] Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bro- nisR.deSupinski, andRudolfEigenmann. Mcrengine: ascalablecheckpoint- ing system using data-aware aggregation and compression. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 17:1–17:11, 2012. 165 [59] J.S. Plank, K. Li, and M.A. Puening. Diskless checkpointing. IEEE Trans- actions on Parallel and Distributed Systems, 9(10):972–986, 1998. [60] Wei-Jih Li and Jyh-Jong Tsay. Checkpointing Message-Passing Interface (MPI) Parallel Programs. In Proceedings of Pacific Rim International Sym- posium on Fault-Tolerant Systems, pages 147–152, 1997. [61] G.Stellner. CoCheck: CheckpointingandProcessmigrationforMPI. InPro- ceedings of The Tenth International Parallel Processing Symposium, pages 526–531, 1996. [62] Graham E. Fagg and Jack Dongarra. FT-MPI: Fault Tolerant MPI, Sup- porting Dynamic Applications in a Dynamic World. In Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 346–353, 2000. [63] J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for open mpi. In IEEE International Symposium on Parallel and Distributed Processing, pages 1–8, 2007. [64] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, ACM/IEEE 2002 Conference, pages 29–29, 2002. [65] Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, and Andrew Lums- daine. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In In Proceedings of LACSI Symposium, Sante Fe, pages 479–493, 2003. [66] E.N. Elnozahy and J.S. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing, 1(2):97–108, 2004. [67] E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. John- son. A survey of rollback-recovery protocols in message-passing systems. ACM Computer Survey, 34(3):375–408, September 2002. [68] DennisMcEvoy. Thearchitectureoftandem’snonstopsystem. InProceedings of the ACM ’81 conference, New York, NY, USA, 1981. ACM. [69] D. Bernick, B. Bruckert, P.D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. NonStop Advanced Architecture. In International Conference on Dependable Systems and Networks, pages 12–21, 2005. 166 [70] T.J. Slegel, III Averill, R.M., M.A. Check, and et. al. IBM’s S/390 G5 Microprocessor Design. IEEE Micro, pages 12–23, 1999. [71] C. Engelmann, H. H. Ong, and S. L. Scott. The Case for Modular Redun- dancyinLarge-scaleHighPerformanceComputingSystems. In International Conference on Parallel and Distributed Computing and Networks, pages 189– 194, February 2009. [72] J. Stearley, K. Ferreira, D. Robinson, and et al. Does Partial Replication Pay off? In IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), 2012. [73] R. Naseer and J. Draper. Parallel double error correcting code design to mitigate multi-bit upsets in srams. In 34th European Solid-State Circuits Conference, pages 222–225, 2008. [74] T.J. Dell. A white paper on the benefits of chipkill-correct ecc for pc server main memory. Technical report, IBM Microelectronics Division Whitepaper, November 1997. [75] A. Shye, J. Blomstedt, T. Moseley, V.J. Reddi, and D.A. Connors. Plr: A software approach to transient fault tolerance for multicore architectures. IEEE Transactions on Dependable and Secure Computing, pages 135–148, 2009. [76] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. SWIFT: SoftwareImplementedFaultTolerance. InInternational Symposium on Code Generation and Optimization, 2005, pages 243–254, 2005. [77] Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August. DAFT: Decoupled Acyclic Fault Tolerance. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10, pages 87–98, 2010. [78] Cheng Wang, H. Kim, Y. Wu, and V. Ying. Compiler-Managed Software- basedRedundantMulti-ThreadingforTransientFaultDetection. InInterna- tional Symposium on Code Generation and Optimization, 2007, pages 244– 258, 2007. [79] Kurt Ferreira, Jon Stearley, James H. Laros, III, and et al. Evaluating the viability of process replication reliability for exascale systems. In Pro- ceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2011. 167 [80] David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Fer- reira, and Ron Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the Inter- national Conference on High Performance Computing, Networking, Storage and Analysis, pages 78:1–78:12, 2012. [81] Kuang-Hua Huang and J.A. Abraham. Algorithm-based fault tolerance for matrixoperations. IEEE Transactions on Computers, C-33(6):518–528, june 1984. [82] Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: a fault tolerant implementation with- out checkpointing. In Proceedings of the international conference on Super- computing, ICS ’11, pages 162–171, 2011. [83] D. Hakkarinen and Zizhong Chen. Algorithmic cholesky factorization fault recovery. In IEEE International Symposium on Parallel Distributed Process- ing, pages 1–10, 2010. [84] Jing-Yang Jou and Jacob A. Abraham. Fault-tolerant matrix operations on multiple processor systems using weighted checksums. pages 94–101, 1984. [85] J. Rexford and N.K. Jha. Algorithm-based fault tolerance for floating-point operations in massively parallel systems. In Proceedings of IEEE Interna- tional Symposium on Circuits and Systems, volume 2, pages 649–652 vol.2, 1992. [86] J.S. Plank, Youngbae Kim, and J.J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, pages 351–360, 1995. [87] A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE Trans- actions on Computers, 45(11):1239–1247, 1996. [88] Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault, and Jack Dongarra. Algorithm-basedFaultToleranceforDenseMatrixFactorizations. In Proc. of the 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 225–234, 2012. [89] J. Sloan, R. Kumar, and G. Bronevetsky. Algorithmic approaches to low overheadfaultdetectionforsparselinearalgebra. In Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, pages 1–12, 2012. 168 [90] J. Sloan, R. Kumar, and G. Bronevetsky. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–12, 2013. [91] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th international symposium on High performance distributed computing, pages 73–84, 2011. [92] GregBronevetskyandBronisdeSupinski. Softerrorvulnerabilityofiterative linear algebra methods. In Proceedings of the 22Nd Annual International Conference on Supercomputing, pages 155–164, 2008. [93] A. Mishra and P. Banerjee. An algorithm-based error detection scheme for the multigrid method. IEEE Transactions on Computers, 52(9):1089–1099, 2003. [94] A.L.N. Reddy and P. Banerjee. Algorithm-based fault detection for signal processing applications. IEEE Transactions on Computers, 39(10):1304– 1308, 1990. [95] Sying-Jyan Wang and N.K. Jha. Algorithm-based fault tolerance for fft net- works. InIEEE International Symposium on Circuits and Systems, volume1, pages 141–144 vol.1, 1992. [96] J.-Y. Jou and J.A. Abraham. Fault-tolerant fft networks. IEEE Transactions on Computers, 37(5):548–561, 1988. [97] S.YajnikandN.K.Jha. Synthesisoffaulttolerantarchitecturesformolecular dynamics. In Proceedings of the IEEE International Symposium on Circuits and Systems, volume 4, pages 247–250 vol.4, 1994. [98] Hubertus J. J. van Dam, Abhinav Vishnu, and Wibe A. de Jong. A case for soft error detection and correction in computational chemistry. Journal of Chemical Theory and Computation, 9(9):3995–4005, 2013. [99] Marco Vassura, Luciano Margara, Pietro Di Lena, Filippo Medri, Piero Fariselli, and Rita Casadio. Ft-comar: Fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24(10):1313–1315, 2008. [100] Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, and Jack Dongarra. Post-failure recovery of mpi communication capability: Design and rationale. International Journal of High Performance Computing Applications, 27(3):244–254, 2013. 169 [101] Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon, Larry Kaplan, and Mattan Erez. Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In Pro- ceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 58:1–58:11, 2012. [102] Marc A. de Kruijf, Karthikeyan Sankaralingam, and Somesh Jha. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pages 475–486, 2012. [103] Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. Relax: an architectural framework for software recovery of hardware faults. In Proceed- ings of the 37th annual international symposium on Computer architecture, ISCA ’10, pages 497–508, 2010. [104] Gulay Yalcin, Osman Unsal, Ibrahim Hur, Adrian Cristal, and Mateo Valero. FaulTM: Fault-Tolerance Using Hardware Transactional Memory. In Work- shop on Parallel Execution of Sequential Programs on Multi-core Architec- ture, Saint Malo, France, 2010. [105] Hajime Fujita, Robert Schreiber, and Andrew A. Chien. It’s time for new programming models for unreliable hardware, provocative ideas session. In International Conference on Architectural Support for Programming Lan- guages and Operating Systems, 2013. [106] Patrick G. Bridges, Mark Hoemmen, Kurt B. Ferreira, Michael A. Heroux, Philip Soltero, and Ron Brightwell. Cooperative application/os dram fault recovery. In 4th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Bordeaux, France, September 2011. [107] J. Dinan, A. Singri, P. Sadayappan, and S. Krishnamoorthy. Selective recov- ery from failures in a task parallel programming model. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pages 709–714, 2010. [108] J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iter- ative methods in a parallel unstable environment. SIAM Journal Scientific Computing, 30:102–116, November 2007. [109] Mark Hoemmen and Michael A. Heroux. Fault-tolerant iterative methods via selective reliability. Technical report, 2011. [110] Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th 170 ACM International Conference on Supercomputing, ICS ’12, pages 91–100, 2012. [111] Anqi Zou, Tyson J. Lipscomb, and Samuel S. Cho. Single vs. double preci- sion in md simulations: Correlation depends on system length-scale. GPU Technology Conference, 2012. [112] Y. Aumann and M. A. Bender. Fault tolerant data structures. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, pages 580–589, Washington, DC, USA, 1996. IEEE Computer Society. [113] S. Yajnik and N.K. Jha. Synthesis of fault tolerant architectures for molec- ular dynamics. In IEEE International Symposium on Circuits and Systems, volume 4, pages 247–250, May 1994. [114] Piyush Sao and Richard Vuduc. Self-stabilizing iterative solvers. In Proceed- ings of the Workshop on Latest Advances in Scalable Algorithms for Large- Scale Systems, ScalA ’13, pages 4:1–4:8, 2013. [115] Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, Mawussi Zounon, Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, Mawussi Zounon, and Bordeaux Sud-ouest. Towards resilient parallel linear krylov solvers: recover-restart strategies. Technical report, INRIA, 2013. [116] Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In Proceedings of the 2013 43rd Annual IEEE/IFIP Interna- tional Conference on Dependable Systems and Networks (DSN), pages 1–12, 2013. [117] Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. Algorithmic approaches to low overhead fault detection for sparse linear algebra. In Proceedings of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), DSN ’12, pages 1–12, 2012. [118] Hubertus J. J. van Dam, Abhinav Vishnu, and Wibe A. de Jong. A case for soft error detection and correction in computational chemistry. Journal of Chemical Theory and Computation, 9:3995–4005, 2013. [119] Dan Quinlan et al. Rose Compiler. [120] Chunhua Liao, Daniel J Quinlan, Richard Vuduc, and Thomas Panas. Effec- tive source-to-source outlining to support whole program empirical optimiza- tion. pages 308–322, 2010. 171 [121] HPC Challenge Random Access. [122] Shubhendu S. Mukherjee, Michael Kontz, and Steven K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. SIGARCH Computer Architecture News, pages 99–110, May 2002. [123] L. V. Kale. The virtualization model of parallel programming: Runtime optimizations and the state of art. In LACSI, October 2002. [124] Advanced configuration and power interface (ACPI). http://www.uefi. org/acpi/specs, 2013. [125] D. Zhu, R. Melhem, and D. MosseÌĄ. The effects of energy management on reliability in real-time embedded systems. In IEEE/ACM International Conference on Computer Aided Design, pages 35–40, 2004. [126] Nathan Beckmann and Daniel Sanchez. Jigsaw: Scalable software-defined caches. In Proceedings of the International Conference on Parallel Architec- tures and Compilation Techniques, PACT ’13, pages 213–224, 2013. [127] S. Ghemawat and P. Menage. TCmalloc: Thread-Caching malloc. http: //goog-perftools.sourceforge.net/doc/tcmalloc.html. [128] B Ramkumar, AB Sinha, and VA Saletore. The charm parallel programming language and system: Part ii-the runtime system. 1994. [129] J. Dinan, S. Krishnamoorthy, D.B. Larkins, Jarek Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In International Conference on Parallel Processing, pages 586–593, Sept 2008. [130] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. Dague: A generic distributed {DAG} engine for high performance computing. Parallel Computing, 38(1âĂŞ2):37 – 51, 2012. [131] Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. Introducing openshmem: Shmem for the pgas community. In Proceedings of the Fourth Conference on Parti- tioned Global Address Space Programming Model, PGAS ’10, pages 2:1–2:3, 2010. [132] Dan Bonachea. Gasnet specification, v1.1. Technical report, Berkeley, CA, USA, 2002. 172 [133] K.B. Wheeler, R.C. Murphy, and D. Thain. Qthreads: An api for program- ming with millions of lightweight threads. In IEEE International Symposium on Parallel and Distributed Processing,, pages 1–8, April 2008. [134] Rong Ge, Xizhou Feng, Wu-chun Feng, and Kirk W. Cameron. Cpu miser: A performance-directed, run-time system for power-aware clusters. In Pro- ceedings of the International Conference on Parallel Processing, pages 18–, 2007. [135] Joshua Peraza, Ananta Tiwari, Michael Laurenzano, Laura Carrington, and Allan Snavely. PMaC’s Green Queue: A Framework for selecting energy optimal DVFS configurations in large scale MPI Applications. Concurrency and Computation: Practice and Experience, 2013. [136] James H Laros III, Cynthia A Segura, and Nathan Dauchy. A minimal linux environment for high performance computing systems. 2006. [137] Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design. In Proceedings of the Seventeenth Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, pages 111–122, 2012. [138] Michela Becchi and Patrick Crowley. Dynamic thread assignment on hetero- geneous multiprocessor architectures. In Proceedings of the 3rd Conference on Computing Frontiers, CF ’06, pages 29–40, 2006. [139] J.DonaldandM.Martonosi. Powerefficiencyforvariation-tolerantmulticore processors. In Proceedings of the 2006 International Symposium on Low Power Electronics and Design (ISLPED), pages 304–309, October 2006. [140] Eric Humenay, David Tarjan, and Kevin Skadron. Impact of process varia- tions on multicore performance symmetry. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’07, pages 1653–1658, 2007. [141] R. Teodorescu and J. Torrellas. Variation-aware application scheduling and power management for chip multiprocessors. In International Symposium on Computer Architecture (ISCA), pages 363–374, June 2008. [142] Abhishek Tiwari and Josep Torrellas. Facelift: Hiding and slowing down aging in multicores. In Proceedings of the 41st Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, pages 129–140, 2008. 173 [143] Mehmet Basoglu, M. Orshansky, and M. Erez. Nbti-aware dvfs: A new approach to saving energy and increasing processor lifetime. In ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pages 253–258, Aug 2010. [144] Marc Fleischmann. Longrun Power Management. 2001. [145] Rong Ge, Xizhou Feng, and Kirk W. Cameron. Improvement of power- performance efficiency for high-end computing. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005. [146] Chung-Hsing Hsu and Ulrich Kremer. Single region vs. multiple regions: A comparison of different compiler-directed dynamic voltage scheduling approaches. In Power-Aware Computer Systems, volume 2325 of Lecture Notes in Computer Science, pages197–211.SpringerBerlinHeidelberg, 2003. [147] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum, and A. Nicolau. Profile-based dynamic voltage scheduling using program check- points. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’02, pages 168–, 2002. [148] Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, and K.W. Cameron. Powerpack: Energy profiling and analysis of high- performance systems and applications. Parallel and Distributed Systems, IEEE Transactions on, pages 658–671, May 2010. [149] Michael A. Laurenzano, Mitesh Meswani, Laura Carrington, Allan Snavely, Mustafa M. Tikir, and Stephen Poole. Reducing energy usage with memory and computation-aware dynamic frequency scaling. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part I, Euro- Par’11, pages 79–90, 2011. [150] Kihwan Choi, R. Soma, and M. Pedram. Dynamic voltage and frequency scaling based on workload decomposition. In Proceedings of the 2004 Inter- national Symposium on Low Power Electronics and Design, pages 174–179, Aug 2004. [151] G. Dhiman and T.S. Rosing. Dynamic voltage frequency scaling for multi- tasking systems using online learning. In ACM/IEEE International Sympo- sium on Low Power Electronics and Design (ISLPED), pages 207–212, Aug 2007. [152] Laura Carrington, Michael Laurenzano, and Ananta Tiwari. Characteriz- ing Large-Scale HPC Applications through Trace Extrapolation. Parallel Processing Letters, 2013. 174 [153] Shengqi Yang, Wenping Wang, Tiehan Lu, W. Wolf, Vijaykrishnan N., and Yuan Xie. Case study of reliability-aware and low-power design. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 16:861–873, July 2008. [154] R.G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near-threshold computing: Reclaiming moore’s law through energy efficient integrated circuits. Proceedings of the IEEE, 98(2):253–266, Feb 2010. [155] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (ntv) design: Opportunities and challenges. In ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1149– 1154, June 2012. [156] William Song, Saibal Mukhopadhyay, and Sudhakar Yalamanchili. Reliabil- ity implications of power and thermal constrained operations in asymmetric multicore processors. In Dark Silicon Workshop, 2012. [157] CPUFreqUtils. https://wiki.archlinux.org/index.php/CPU_ frequency_scaling, 2015. 175
Abstract (if available)
Abstract
Future exascale high-performance computing (HPC) systems will be constructed using VLSI devices with smaller feature sizes that will be far less reliable than those used today. Furthermore, in the pursuit of higher floating point operations per second (FLOPS), these systems are projected to exponentially increase the number of processor cores and memory chips used. Unfortunately, the mean time to failure (MTTF) of the system scales inversely in relation to the number of components. Therefore, faults and resultant system-level failures will become the norm, not the exception. This will pose significant problems for system designers and programmers who, for half a century have enjoyed an execution model that assumed correct behavior by the underlying computing system. However, not every error detected needs to result in catastrophic failure. Many HPC applications are inherently fault resilient, but lack convenient mechanisms to express their resilience features to the execution environments that are designed to be fault oblivious. ❧ In this dissertation work, we propose an execution model based on the notion of introspection. We develop a set of resilience-oriented language extensions that facilitate the incorporation of fault resilience as an intrinsic property of scientific application codes. These extensions are supported by a simple compiler infrastructure and a runtime system that reasons about the context and significance of faults to the outcome of an application's execution. We extend the compiler infrastructure to provide an application-level methodology for fault detection and correction that is based on redundant multithreading (RMT). We also propose an introspective runtime framework that continuously observes and reflects upon system-level fault indicators to assess the vulnerability of the system's resources. The introspective runtime system provides a unified execution environment that reasons about the implications of resource management actions for the resilience and performance of the application processes. Our results, which cover several high-performance computing applications and different fault types and distributions, demonstrate that a resilience-aware execution environment is important in order to solve the most demanding computational challenges using future extreme scale HPC systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Resiliency-aware scheduling
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Model-guided performance tuning for application-level parameters
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Improving efficiency to advance resilient computing
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Lower overhead fault-tolerant building blocks for noisy quantum computers
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Model-guided empirical optimization for memory hierarchy
PDF
Communication mechanisms for processing-in-memory systems
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
A resource provisioning system for scientific workflow applications
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Towards efficient fault-tolerant quantum computation
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Accelerating scientific computing applications with reconfigurable hardware
Asset Metadata
Creator
Hukerikar, Saurabh
(author)
Core Title
Introspective resilience for exascale high-performance computing systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/04/2015
Defense Date
03/31/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
compiler techniques,exascale computing,extreme scale computing,fault tolerance,high performance computing,HPC,introspection,OAI-PMH Harvest,programming models,resilience,runtime systems,supercomputing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey (
committee chair
), Lucas, Robert F. (
committee chair
), Nakano, Aiichiro (
committee member
), Prasanna, Viktor K. (
committee member
)
Creator Email
hukerika@usc.edu,saurabh.hukerikar@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-626119
Unique identifier
UC11305467
Identifier
etd-HukerikarS-3797.pdf (filename),usctheses-c3-626119 (legacy record id)
Legacy Identifier
etd-HukerikarS-3797.pdf
Dmrecord
626119
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hukerikar, Saurabh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
compiler techniques
exascale computing
extreme scale computing
fault tolerance
high performance computing
HPC
introspection
programming models
resilience
runtime systems
supercomputing