Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Resiliency-aware scheduling
(USC Thesis Other)
Resiliency-aware scheduling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RESILIENCY-AWARE SCHEDULING by Jeremy Abramson A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2013 Copyright 2013 Jeremy Abramson For Dr. Morris Abramson. It is an honor to follow in your footsteps. ii Acknowledgments As much as it may feel like it, an undertaking as large as a dissertation is never accom- plished alone. A number of people were essential to the completion of my degree, and deserve recognition. First and foremost is my advisor, Dr. Pedro C. Diniz, for the untold hours spent meeting, discussing, and otherwise helping me through this process. Words cannot express my gratitude for the mentorship and guidance he’s provided. None of this would have been possible without him. I also want to thank the members of the Computational Systems division at Infor- mation Sciences Institute (ISI). The division director, Dr. Robert Lucas, was especially instrumental, and was always available with advice, technical expertise, and a good story. Dr. Jeff Draper and Dr. Jacqueline Chame also provided much wisdom and assistance. I am forever indebted to a number of people in the Computer Science Department at the University of Southern California, especially Lizsl Deleon and Steve Schrader. That they still found time to advise other students after dealing with me continues to amaze. I also thank Dr. Michael Crowley, for continually selecting me as his teaching assistant. I wouldn’t have made it without the support. iii I want to recognize Dr. Dipak Ghosal and Dr. Norman Matloff at the University of California, Davis. Their faith and confidence in me and my abilities as a scientist (if not a student) started me on my graduate school journey. Lastly, I want to thank those closest to me. My girlfriend, Lisa Mataczynski, my parents, Alina Abramson and Dr. Edward Abramson and my sister, Anne Madden. Even when I didn’t, they always knew I could do it. I am blessed to be able to prove them right. iv Contents Dedication ii Acknowledgments iii List of Figures viii List of Tables xi Abstract xii 1 Introduction 1 1.1 Soft Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Current Soft Error Mitigation Techniques . . . . . . . . . . . . . . . . 3 1.2.1 Software Approaches . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Hardware Approaches . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 The RaS Infrastructures . . . . . . . . . . . . . . . . . . . . . 9 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Resiliency-aware Scheduling 14 2.1 Intrinsic Resiliency and the ILP Gap . . . . . . . . . . . . . . . . . . . 14 2.2 Hybrid TMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Critical Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Example RaS Application: FPGA Configuration . . . . . . . . . . . . . 21 2.5 Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 RaS Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 v 3 Applications of RaS 29 3.1 Open64 Compiler Infrastructure . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Open64 Front End . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.2 Data Flow Graph Construction . . . . . . . . . . . . . . . . . . 32 3.1.3 RaS Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.4 FPGA Device Area Modeling . . . . . . . . . . . . . . . . . . 35 3.1.5 Hardware Support for hTMR . . . . . . . . . . . . . . . . . . . 39 3.1.6 Code Generation for FPGAs . . . . . . . . . . . . . . . . . . . 41 3.2 Vex VLIW Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 The Vex Toolchain . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Assembly-Level Hardening . . . . . . . . . . . . . . . . . . . 46 3.2.3 RaS for VLIW Assembly . . . . . . . . . . . . . . . . . . . . . 48 3.3 RaS for VLIW Hardening Variants . . . . . . . . . . . . . . . . . . . . 50 3.3.1 Replication Schemes . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 V oting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Dynamic Scheduling . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Experimental Results 57 4.1 Open64 Infrastructure: FPGA Target . . . . . . . . . . . . . . . . . . . 57 4.1.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . 58 4.1.2 Test Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1.3 FPGA Design Results . . . . . . . . . . . . . . . . . . . . . . 64 4.1.4 Design Space Exploration . . . . . . . . . . . . . . . . . . . . 79 4.1.5 Expected Latency Analysis . . . . . . . . . . . . . . . . . . . . 82 4.2 VEX Infrastructure: VLIW Target . . . . . . . . . . . . . . . . . . . . 86 4.2.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . 86 4.2.2 Case Study: UMT . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.3 Expected Latency Analysis . . . . . . . . . . . . . . . . . . . . 92 4.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5 Related Literature 96 5.1 Resiliency and SEU Mitigation . . . . . . . . . . . . . . . . . . . . . . 97 5.1.1 Software Approaches . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.2 Hybrid Hardware/Software Approaches . . . . . . . . . . . . . 99 5.1.3 Selected Hardware Approaches . . . . . . . . . . . . . . . . . 102 5.2 Relationship to RaS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6 Conclusion 107 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.1.1 Resiliency-aware Scheduling . . . . . . . . . . . . . . . . . . . 108 vi 6.1.2 Intrinsic Resiliency . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.3 Hybrid TMR . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.4 The RaS Infrastructures . . . . . . . . . . . . . . . . . . . . . 109 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Reference List 112 vii List of Figures 1.1 A taxonomy of hard and soft errors from single effects. (source [43]) . . 3 1.2 Different approaches to resilient computing . . . . . . . . . . . . . . . 4 1.3 Scalability concerns for checkpoint and restart [64]. . . . . . . . . . . 5 1.4 Fine grain source-level code transformation for resiliency [55]. . . . . 6 1.5 A simplified version of Triple Modular Redundancy (TMR). . . . . . . 8 1.6 A simplified version of Temporal Redundancy (TR). . . . . . . . . . . 9 2.1 An example schedule for a multiply-accumulate kernel (shown in Fig. 3.3). The shaded cells are replica operations inserted into ILP gaps by RaS. . 16 2.2 Two schedules for the multiply-accumulate kernel (shown in Figure 3.3). The top schedule depicts TMR hardening for adder and multiplier units. The bottom schedule depicts the equivalent Hybrid TMR configuration. 18 2.3 Pseudo-code for determining the critical path of a DFG. . . . . . . . . . 21 2.4 Operational coverage, execution latency and relative area consumptions of various architecture configurations for the MAC (2) kernel code given in Figure 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Basic Block diagram of the RaS Open64 infrastructure toolchain. . . . . 30 3.2 Very High level WHIRL Abstract Syntax Tree (a) and originating FOR- TRAN source (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 An example dataflow graph, obtained from the WHIRL given in Fig- ure 3.2. The critical path, indicated by the shaded nodes, is6 execution cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Pseudo-code for the list scheduling algorithm. . . . . . . . . . . . . . . 34 3.5 Discrete implementations of a 2-to-1 1-bit multiplexer (left) and 4-to-1 1-bit multiplexer (right). . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Traditional TMR output, with in-line, bit-wise voting logic (left). Trip- licated registers and checker logic supporting hTMR (right). . . . . . . 40 3.7 Pseudo-code for the data path code generation algorithm. . . . . . . . . 42 3.8 Generated data-path for a multiply-accumulate computation supporting RaS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 A block diagram outlining the main components of the RaS Vex target infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 viii 3.10 A sample of the resultant Vex assembly from a simple division expres- sion,a =b=c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.11 A 4-wide VLIW instruction stream over 8 cycles [23]. The shaded squares in (a) correspond toNOPS, where no operations are being exe- cuted. These ”holes” in the schedule are used by RaS to schedule resiliency operations. This example has 17 Vex operations (b). . . . . . . . . . . 48 3.12 Pseudo-code for the assembly-level hardening algorithm for the lazy voter scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.13 Example VLIW instruction stream hardened via RaS, using the triplica- tion (a) and pairing (b) schemes. The shaded boxes indicate operation slots that wereNOPs, and were replaced with replica or voting operations. 52 3.14 The effect of XOR placement on execution latency. The ”high impact” error on the left is not detected for 6 cycles, whereas the ”low impact” error on the right is detected within1 cycle. . . . . . . . . . . . . . . . 54 4.1 The ”long and thin” dataflow graph for the GMPS code kernel. The shaded nodes are on the critical path. . . . . . . . . . . . . . . . . . . 62 4.2 The CGM (1) kernel code dataflow graph. The shaded nodes are on the critical path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 The ”short and wide” dataflow graph of CGM (4). Loop unrolling causes four equivalent critical paths. . . . . . . . . . . . . . . . . . . . 64 4.4 Area consumption for RaS-OPT and CP-TMR design variants for the codes in the test suite. . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Device area consumption for the UMT kernel code. . . . . . . . . . . . 68 4.6 Device area consumption for the GMPS kernel code. . . . . . . . . . . 70 4.7 Device area consumption for the MAC (2) kernel code. . . . . . . . . 72 4.8 The CGM (1) kernel code schedules for the RaS-OPT (top) and CP- TMR (bottom) design variants. The TMR units of the CP-TMR schedule have been expanded for clarity. The node numbers correspond to the DFG in Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.9 Device area consumption for the four unrolled variants of the CGM ker- nel code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.10 The effect of loop unrolling on the CGM kernel code for the six design variants under test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.11 Full design space for UMT kernel code. . . . . . . . . . . . . . . . . . 81 4.12 Performance and estimated area results for10;502 design configurations for the UMT code kernel. The dash symbols () are potential TMR configurations, and cross symbols (+) are potential RaS configurations. 82 4.13 Expected latency for the RaS-OPT and AE-TMR designs shown in Table 4.7 as the per-slice error probability increases. . . . . . . . . . . . . . . . . 85 4.14 Latency, in cycles, of UMT2k kernel as issue width (and accompanying functional units) are increased. . . . . . . . . . . . . . . . . . . . . . . 89 ix 4.15 Expected performance of equivalent TMR as compared to RaS and source- level schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.16 Expected latency in the face of errors for configurations #1, #5 and #6 from Table 4.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 x List of Tables 2.1 Latency, coverage and area estimates for MAC (2x) kernel code given in Figure 3.2 for various architecture configurations. . . . . . . . . . . . 23 3.1 Area and latency metrics for various operators implemented on a Xilinx Virtex-5 FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Critical path latency and operation breakdown for the computational kernels under test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Summary of design implementation results for the Open64-based FPGA target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Design configurations and area consumption for the UMT kernel code. . 69 4.4 Design configurations and area consumption for the GMPS kernel code. 71 4.5 Design configurations and area consumption for the MAC (2) kernel code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6 Design configurations and area consumption for the unrolled variants of the CGM kernel code (an * indicates the number of nominal functional units. For TMR units, divide by3). . . . . . . . . . . . . . . . . . . . 78 4.7 Latency, area consumption and design configurations for the RaS-OPT and area-equivalent TMR designs of the UMT kernel code. . . . . . . . 83 4.8 Performance for selected configurations of the UMT Kernel. . . . . . . 92 xi Abstract Hostile environments, shrinking feature sizes and processor aging elicit a need for resilient computing. Traditional course-grained approaches, such as software Checkpoint and Restart (C/R) and hardware Triple Modular Redundancy (TMR), while exhibiting accept- able levels of fault coverage, are often wasteful of resources such as time, device/chip area or power. In order to mitigate these shortcomings, Resiliency-aware Scheduling (RaS), a source-level approach is introduced and described. Resiliency-aware Schedul- ing combines traditional compiler techniques such as critical path and dependency anal- ysis with the ability to potentially modify the target architecture’s resource configu- ration. This new approach can, in many cases, offer operational coverage similar to traditional schemes, while enjoying a performance advantage and area savings. To support Resiliency-aware Scheduling, several novel concepts and contributions are introduced. First, this thesis introduces the concept of Intrinsic Resiliency (IR), a measure of available Temporal Redundancy (TR) due to a computation’s lack of instruction-level parallelism for a fixed set of resources. Second, Hybrid TMR (hTMR), a flexible hardware design scheme that allows for trade-offs between performance and resiliency is presented. By implementing hTMR units, an RaS-hardened design main- tains operational coverage while providing more scheduling flexibility, and thereby potentially better performance, than designs using traditional monolithic TMR units. Lastly, two distinct Resiliency-aware Scheduling infrastructures are described: the first xii based on the Open64 compiler, targeting Field Programmable Gate Arrays (FPGA), and the second based on the Vex compiler toolchains, targeting reconfigurable softcore Very Long Instruction Word (VLIW) architectures. This thesis also presents promising experimental results from the use of these two infrastructures. For the FPGA target, when using slice Look Up Tables (LUT) as the design metric, the RaS designs synthesized from a test suite of realistic kernel codes were, on average, 19% smaller than the TMR designs with equivalent performance and operational coverage. For the softcore VLIW target, the RaS-hardened executables developed for a case study perform between15% and40% faster than a competing soft- ware hardening approach. The RaS-derived VLIW executables are also shown to scale better than TMR, with up to a40% performance improvement over an equivalently sized TMR computation. While mainstream compilation and synthesis tools exclusively focus on raw execu- tion time or silicon area usage, the increasing rates of soft errors will likely prompt tool designers to focus on error mitigation as an integral part of their design methodologies. The ability to handle resiliency in the same automated fashion as other design attributes, such as area or power consumption, is greatly needed as architectures evolve. The work described here is a first step in this direction. xiii Chapter 1 Introduction Many application domains, such as military, avionics and medical, require programs with high reliability and availability. Application codes for these domains must deliver correct computation within specific (and often constrained) time bounds while tolerat- ing faults or unavailability of the underlying computing systems. The codes that support these applications must therefore exhibit high levels of correctness in addition to termi- nating in the face of internal errors, in order to minimize the number and frequency of externally visible faults. In this context, programs that can detect, if not correct, internal errors are often described as resilient 1 . The resiliency of computations is an increasingly important problem. A number of different factors can create an environment where computations are subjected to condi- tions that can cause erroneous results. Shrinking transistor feature sizes make transis- tors – and subsequent logic structures such as registers and simple combinatorial logic – more susceptible to electrically charged particle strikes in terrestrial computations. High altitude or spaceborne applications only increase this susceptibility, as the lack of a protective atmosphere potentially leads to more damaging particle strikes. In addition, material aging leads to the decay of electrical properties and to increasing susceptibil- ity to errors in both processor logic as well as memory cells. In other settings, such as high-performance and massively distributed collaborative computations, implemen- tations must tolerate even minute errors rates so as not to hamper global computation progress. The need for efficient and automated program resiliency has never been higher. 1 This document will use the term ”hardened” to mean ”made resilient to errors” 1 The next section details the class of errors that are considered in this thesis, and Sec- tion 1.2 presents a number of approaches to program resiliency, highlighting specific illustrative techniques. These approaches range from high-level software-only tech- niques to low-level VLSI device and process development schemes. This description will help us to appropriately place the research presented in this thesis within the context of established solutions, and illustrate how Resiliency-aware Scheduling helps address some of their limitations. 1.1 Soft Errors This work considers a class of events called Single Event Upsets (SEU). SEU are a particular concern, as they are a common cause of soft errors. This work does not focus on permanent hardware faults, such as single event latch-ups or hard errors, such as gate ruptures. For these classes of errors, there are a number of techniques, such as built-in self-testing (BIST) which can identify, and in some cases correct, faults by permanently disabling the faulty hardware component [2, 18]. A taxonomy of single event effects is depicted in Figure 1.1. A Single Event Tran- sient (SET) is the impact of energetic particles (such as alpha particles) on an electri- cal component. When this impact changes the state of an electronic component, it is described as an SEU (or bit-flip). Then, if this change of state effects program output and correctness, the SEU is classified as a soft error. It’s important to note that SEU are themselves ephemeral. The affected circuit is not permanently damaged, and the component can potentially function as desired upon its next usage. 2 Fault Tolerance Techniques and Reliability Modeling 251 10.2 FPGA RADIATION EFFECTS Energetic particles, for example, protons trapped in the Van Allen radiation belts, can deposit unwanted charge in a microelectronic device. Excess charge can cause transient faults or even permanent damage. Figure 10.1 lists the different types of transient and permanent faults, commonly called soft and hard errors, respectively. The set of all soft and hard errors collectively is known as single-event effects (SEEs). Note that the name single-event effect implies that typically a single particle causes an effect at a speci!c instant in time. Other time-integrated effects also occur as the result of exposure to energetic particles over longer periods. This chapter focuses exclusively on single-event effects. 10.2.1 DESTRUCTIVE SINGLE-EVENT EFFECTS The most common destructive single-event effect is single-event latchup (SEL). SEL is an unwanted short circuit caused by ionizing radiation that can destroy a device from the resulting overcurrent situation if the device is not power cycled. Other destructive SEEs include single-event gate rupture (SEGR) and single-event burnout (SEB). 10.2.2 NONDESTRUCTIVE SINGLE-EVENT EFFECTS 10.2.2.1 Single-Event Upsets A single-event upset occurs when deposited charge directly causes a change of state in dynamic circuit memory elements (e.g., "ip-"op, latch). In other words, an SEU occurs when a charged particle changes the stored value in a memory element from logic “1” to logic “0”, or vice versa. The change in state of one element is a single-bit upset (SBU). The change in state of more than one element is a multiple-bit upset (MBU) [6]. Both SBUs and MBUs are caused by a single particle. Coincident SEUs (either SBU or MBU) that occur in ! FIGURE 10.1 The different classes of soft and hard errors collectively known as single- event effects. (From K. Morgan, “SEU-induced persistent error propagation in SRAM-based FPGAs.” Brigham Young University Masters Thesis, 2006). Cross-reference: R. Baumann, Single-event effects in advanced CMOS technology, in 2005 IEEE NSREC Short Course, Seattle, WA, July 2005, pp. II–1– II–59.) Figure 1.1: A taxonomy of hard and soft errors from single effects. (source [43]) 1.2 Current Soft Error Mitigation Techniques Soft errors are traditionally mitigated in either hardware or software. With few excep- tions 2 , most approaches share a number of fundamental properties: replication of rel- evant data or computations, and comparison or voting to detect (and perhaps correct) an error. While the implementation details of these various approaches may differ widely, the most fundamental difference lies at the level of granularity and application of the replication. Coarse grain software approaches may replicate the entire program, whereas finer grained approaches may replicate operations at the source level. In hard- ware, coarse grain approaches are implemented at the circuit (or RTL) level, with finer grained approaches implemented at the transistor level. These distinctions are displayed in Figure 1.2. At the intersection of fine-grain software approaches and coarse grained hardware approaches lies Resiliency-aware Scheduling (RaS). It is important to note that this 2 Hardening by process – manufacturing circuit components in such a way as to intrinsically prevent errors – is an example of a method that does not follow this approach. 3 Resilient( Compu.ng( Hardware( Hardware( Coarse( Grain( Fine( Grain( Coarse( Grain( Fine( Grain( EDDI,(ThOR,( etc.( ( TMR,(TR,( DWCBCED( ( ResiliencyB aware( Scheduling( Less(.me( Less( hardware( SoFware( Checkpoint( and(Restart( SOI,(DICE,( HardeningB byBprocess( 1( Figure 1.2: Different approaches to resilient computing approach – using compiler insights and knowledge/modification of the underlying archi- tectural infrastructure – is not specific to resiliency. That is to say, it is common to have ”Performance-aware Scheduling” or ”Power-aware Scheduling”. The lack of tech- niques (and their accompanying tools) that implement such an approach in the context of resiliency is a motivating drive behind this work. The next two sections describe common software and hardware approaches to soft error mitigation. 4 Elapsed time Number of nodes Work Checkpoint Rework Restart 0.0 h 200.0 h 400.0 h 600.0 h 800.0 h 1.0 kh 1.2 kh 1.4 kh 1.6 kh 10 20 50 100 200 500 1,000 2,000 5,000 10,000 20,000 50,000 100,000 200,000 Figure 1. Example of overhead for a 168-hour application run. can be coordinated, uncoordinated, or communication induced. Log-based methods can be pessimistic, optimistic, or causal [10]. In coordinated CR all processes save their checkpoint state synchronously creating a consistent global state, which facilitates recovery. Its main drawback is the strain this synchronization puts on the stable storage system, making storage system bandwidth the bottleneck of the operation. Also, even if only one process fails, the entire application needs to restart. In contrast, uncoordinated CR only restarts the failed pro- cesses.UncoordinatedCRlogsmessagesatthesendersideand in case of failure, replays the logged messages necessary to bring the restarted processes to a consistent state with respect totherestoftheprocesses.Itsmaindisadvantageisthestorage space required to log the messages, the possibility of rollback propagation in optimistic messaging protocols – which means in the worst case all processes need to restart, a considerably more complex restart, and the need for garbage collection of the message logs as the application progresses. Redundant computation increases the resilience of applica- tions by increasing the application’s mean time to interrupt (MTBI). In redundant computing, each process is replicated a number of times across the system. An application can make uninterrupted progress towards useful work provided that at least one process in the replication pool is functional. In con- trasttorollback-recoveryprotocolswereasinglefailurecauses an interrupt, a redundant application tolerates multiple failures until redundancy is exhausted, avoiding recovery overhead. This overhead reduction has the potential of increasing the application MTBI and the checkpoint interval, reducing time to solution and increasing system throughput. At large node counts, these gains can offset the cost of extra nodes, and the additional power and cooling requirements [2]. Disk-less checkpointing techniques have been proposed as well to store checkpoints more efficiently [16], [8], [17]. One example is a memory distributed mechanisms that saves checkpoint state redundantly across a distributed system in a RAID-like manner, and writes it to stable storage only when a failure occurs [9]. Its main drawback is the nearly doubled amount of memory required to store both the application state and the checkpoint file. The main benefit is the ability to tolerate multiple failures as long as at least one of the nodes that stores the checkpoint file and its copy remains functional. III. APPROACH Our goal is to learn how the various methods to improve application resiliency perform at exascale. Since these types of systems are not available yet, we use the Structural Simulation Toolkit(SST)[18]torunexperimentsatthemillion-corelevel. In order to run such large simulations, the network and the compute core components integrated into the SST framework need to be simple and frugal in memory consumption. For the network we limit ourselves to a two-dimensional torus with the requirement that the number of columns and rows are powers of two. Instead of simulating a router, we created a simple model that forwards messages based on a source route computed at the sending core. The routers model internal congestion at the output ports, but there is no flow control between routers. Therefore, network congestion cannot be modeled. Each core is represented by a message pattern generator. The number of cores per router must also be a power of two. This makes implementing collective algorithms much simpler. Simulating a full-fledged CPU core at the exascale would be prohibitively expensive. For our goals we can get by with a much faster approach. We are creating so called mes- sage pattern generators. These generators transmit messages and implement the resiliency methods described earlier. The message patterns are important because the performance of some resiliency methods depends on message exchanges. For Figure 1.3: Scalability concerns for checkpoint and restart [64]. 1.2.1 Software Approaches A common software approach for mitigating soft errors is checkpoint and restart (C/R) [14], which is currently the de-facto standard in distributed and parallel high- performance computing. With C/R an entire application state is saved at specific exe- cution points. When an error is detected (often by hardware or system exceptions), the computation can be restarted from the last known safe configuration or state. This approach, however, comes at a steep cost. Saving global application state to reliable storage has a significant performance overhead in terms of execution time and energy, as it often exposes I/O bottlenecks. More critically are scenarios where the checkpointing interval will exceed the mean time between failures (MTBF), which means no effective work or progress is accomplished by the system. Figure 1.3 depicts the workload breakdown of a C/R scheme as errors mount [64]. As can be seen even for a computation involving50;000 nodes, more that50% of the elapsed time is spend not 5 variable x must also be duplicated in order to update the two copies of the variable. Every fault that occurs in any variable during the program execution is detected as soon as the variable is the source operand of an instruction, i.e., when the variable is read, thus resulting in minimum error latency, which is approximately equal to the period between the fault occurrence and the first read operation. Errors affecting variables after their last usage, and thus not modifying the program behavior, are not detected. Original code Modified Code int a,b; a=b; int a 0 ,b 0 ,a 1 ,b 1 ; a 0 =b 0 ; a 1 =b 1 ; if (b 0 != b 1 ) error(); a=b+c; a 0 =b 0 +c 0 ; a 1 =b 1 +c 1 ; if((b 0 !=b 1 )||(c 0 !=c 1 )) error(); Fig. 1: Modification for errors affecting data. Two simple examples are reported in Fig. 1, which shows the code modification for an assignment operation and for a sum operation involving the variables a, b and c. Original code Modified code int res,a; res = search (a); … int search (int p) {intq; … q=p+1; … return(1); } int res 0 ,res 1 ,a 0 ,a 1 ; search(a 0 ,a 1 , &res 0 ,&res 1 ); … void search(int p 0 ,int p 1 ,int *r 0 ,int *r 1 ) {intq 0 ,q 1 ; … q 0 =p 0 +1; q 1 =p 1 +1; if (p 0 != p 1 ) error(); … *r 0 =1; *r 1 =1; return; } Fig. 2: Transformation for errors affecting procedure parameters. The parameters passed to a procedure, as well as the returned values, should also be considered as variables. Therefore, the rules defined above can be extended as follows: • Every procedure parameter is duplicated • Each time the procedure reads a parameter, it checks the two copies for consistency • The return value is also duplicated (in C, this means that the addresses of the two copies are passed as parameters to the called procedure). Fig. 2 reports an example of application of Rules #1 to #3 to the parameters of a procedure. 2.3. Errors in the code The proposed approach addresses errors affecting the code of instructions no matter whether these are stored in memory, in cache, in the processor instruction register, or elsewhere. Several processors have built-in Error Detection Mechanisms (EDMs) able to detect part of these errors, e.g., by activating Illegal Instruction Exception procedures. Other faults can be detected by software checks (implementing non-systematic additional EDMs) introduced by the programmer. We propose a set of transformation rules to make the code able to detect most of the faults not detected by the other EDMs (if any) existing in the system. A representation of the possible transformations caused by errors is reported in Fig. 3 in which arrows represent a transformation of a statement into a different one. For the purpose of this paper, statements can be divided in two types: • Type S1: statements affecting data, only (e.g., assignments, arithmetic expression computations, etc.) • Type S2: statements affecting the control flow (e.g., tests, loops, procedure calls and returns, etc.). On the other side, errors affecting the code can be divided in two types, depending on the way they transform the statement whose code is modified: • Type E1: errors changing the operation to be performed by the statement, without changing the code control flow • Type E2: errors changing the control flow. E1 errors transform a S1 statement into a statement of the same class (e.g., by changing an add operation into a sub). They are represented by upper left arrow in Fig. 3. Errors transforming a S2 statement into another S2 statement (e.g., by transforming a jump operation into a conditional jump), transforming a S1 statement into a S2 statement (e.g., by transforming an add operation into a jump) or vice versa are all E2 errors. 2.3.1. E1 errors affecting S1 statements As far as errors of type E1 affecting the statements of type S1 are considered, they are automatically detected by simply applying the transformation rules introduced above for errors affecting data. For example, if we consider a statement executing an addition between two operands, Rule #2 and #3 also guarantee the detection of any error of Figure 1.4: Fine grain source-level code transformation for resiliency [55]. doing any useful computing work. Beyond that, the ratio of productive work for unit of time becomes increasingly small. In order to mitigate these shortcomings, finer-grained source-level approaches have been developed (e.g. [11, 49, 53, 55]). Rather than relying on an application-level approach such as C/R, these fine-grain approaches analyze the computation at a source level, replicating memory references and arithmetic operations, and explicitly insert voting instructions to ensure the replicated operations produce output consistent with the originally expected results. Figure 1.4 displays an example of such a transforma- tion. While scalability is not a primary concern, source-level schemes often exhibit performance issues similar to those of C/R. In most cases, source-level approaches stat- ically and completely apply their replication transformations. These schemes offload the responsibility of efficiently scheduling these additional operations to the compiler or the underlying architecture 3 . Because, in general, all memory references and arith- metic operations are at least duplicated – if not triplicated, if error correction is desired – this often creates codes that are [at least] twice as large and can take over twice as long to run. The impact of these effects is often nonlinear, as the insertion of duplicate load/store instructions at the source level can lead to rapid increases in register pressure, which in turn necessitates even more memory references (register spills) at the machine 3 A multi-issue superscalar architecture, in most cases. 6 level [49]. As the results reported in this thesis will suggest, however, these impacts can be limited by intelligent placement of replicated instructions. This is detailed in Chapter 2. 1.2.2 Hardware Approaches As with software approaches, hardware approaches vary by granularity level. At the lowest granularity is radiation hardening by process (RHBP). This approach attempts to provide protection from errors by altering the materials manufacturing process itself [22]. This technique is economically very expensive, as few manufacturing facilities are equipped to produce process-hardened circuitry, and the additional complexity required for such circuitry leads to lower yields. Due to high cost, hardening by process has fallen out of favor, often replaced by radiation hardening by design (RHBD) [46]. RHBD is another fine-grained approach wherein the layout, architecture and circuit design them- selves are engineered to provide resiliency. An example of such an approach is the dual interlocked storage cell or DICE [15]. Naturally, these approaches are beyond the scope of this work. Often, when the cost of fine-grained approaches is not feasible, coarse-grained approaches are used. Two common coarse-grained approaches are Triple Modular Redundancy (TMR) [65] and Temporal Redundancy (TR) [47]. TMR, as the name implies, triplicates an operation such as an addition or multiplication. The operation is then executed three times on three independent pieces of hardware. The results are then fed to a voter, which selects the appropriate output by majority vote. If there is no more than one error in the system, the output will be correct. A simplified version of TMR is shown in Figure 1.5. The foremost advantage of TMR is that it can be implemented on any commer- cial off-the-shelf (COTS) hardware, such as an application-specific integrated circuit 7 Input& ini(al& Func(onal&Unit&2& Func(onal&Unit&3& Func(onal&Unit&1& V& o& t& e& r& Output& final& Input& replica&1& Input& replica&2& Input& replica&3& Output& replica&1& Output& replica&2& Output& replica&2& Figure 1.5: A simplified version of Triple Modular Redundancy (TMR). (ASIC) or reconfigurable device such as an FPGA. Applying TMR to a design allows for resiliency without the added cost of implementing techniques like RHBP and RHBD. Current implementations of TMR have their disadvantages, however. TMR- hardened designs are fixed, in that regardless of the prevailing error rate, they consume the same amount of area and power, which can be over three times that of their un- hardened counterparts. TMR, then, is a good, low-cost alternative to RHBP and RHBD, but its global application to circuits can be wasteful of resources. Temporal Redundancy exhibits the same ”replicate and test” properties of TMR, but does not necessarily require triplicated hardware, nor does it require the executions of the operation to be in parallel. As the name implies, temporal redundancy replicates operations in time. Instead, a TR-hardened operation will execute the three replica operations at different times, potentially on the same hardware. A simplified example of TR is given in Figure 1.6. In this example, the operation Op1 is executed on three separate occasions during the computation (in this instance, on two different functional units). When the third execution is completed, the outputs of all the executions are sent to the voter, similar to TMR 8 Cycle& &&&&&3&&&&&4&&&&&5&&&&….&&10&&&11&&&12&&….&&&&24&&&25&&26&&&27& Voter& Func8onal&Unit&2& Func8onal&Unit&1& Op&1& Op&1& Op&1& Op&2& Op&3& Op&4& Vote& Figure 1.6: A simplified version of Temporal Redundancy (TR). The main advantage of TR over TMR is area overhead. Because operations are replicated in time, not in space, a TR-hardened program does not require the extra repli- cated functional units of TMR. This benefit comes with a potential performance penalty, however, as the redundant operations of a TR-hardened computation are not executed concurrently. Lastly, in contrast to TMR, because the output of a TR-hardened opera- tion might be used before all of the replica operations are executed (and the proper result voted upon) typically a TR-hardened operation cannot correct errors, only detect them 4 . Evaluating the space/time trade-off between the replication of functional units and the scheduling flexibility of TR is a key contribution of Resiliency-aware Scheduling and is addressed in Section 4.1.4. 1.2.3 The RaS Infrastructures In order to evaluate Resiliency-aware Scheduling, two distinct application infrastruc- tures were developed. The first, a compiler analysis tool built on Open64 [28] targets reconfigurable devices, such as an FPGA. 4 It is possible for a TR-hardened operation to be error-correcting if the output is not used until all executions of the operation are completed, although this greatly increases design complexity. 9 The Open64 infrastructure takes a high-level source such as C or FORTRAN, and an architecture description (containing the number, size and type of available functional units and operational latencies), and schedules the computation, exploiting the available Intrinsic Resiliency 5 (IR). The system then outputs a performance estimate, expected operational coverage percentage, and a hardware design configuration in a hardware description language targeting an FPGA device. In addition, a C-to-RTL system was developed, in order to validate and evaluate the derived hardware designs on an FPGA. The second infrastructure is based on the Vex VLIW compiler [34], and targets (reconfigurable) VLIW machines. This system takes a high-level C source, and com- bined with a description of the target VLIW machine, uses the VEX compiler to gener- ate the static VLIW schedule. It then inserts replica operations according to a specified replication and voting strategy, and shows the performance impact of hardening the computation for a given architecture configuration. Both RaS infrastructures can very quickly evaluate a number of proposed designs, allowing the programmer to explore trade-offs between performance, area consumption and resiliency. These infrastructures are detailed in Chapter 3. 1.3 Contributions The key contributions of this work are as follows: 1. Resiliency-aware Scheduling: By analyzing the properties of a high-level computation and possible hardware configurations, Resiliency-aware Schedul- ing determines the optimal resource configuration and operation assignment for a given level of operational coverage, area, or performance. 5 Intrinsic resiliency is discussed in detail in Section 2.1. 10 2. Intrinsic Resiliency: A computation’s Intrinsic Resiliency (IR) is a measure of ”slack” – or available Temporal Redundancy – in a schedule due to dependencies between operations, or a lack of ILP, for a given set of resources. Leveraging a computation’s Intrinsic Resiliency shows which operations can be covered via TR – with no execution latency penalty – and which will require either dedicated hardware or extra compute cycles. 3. Hybrid TMR: By decoupling TMR resources and allowing the inputs of indi- vidual TMR functional units to be accessed independently, Hybrid TMR (hTMR) allows for scheduling flexibility, performance and conservation of resources ver- sus traditional TMR schemes. A hardware implementation of hTMR was devel- oped, as well as a code generation algorithm that can convert a high-level C or FORTRAN source into a HDL (Verilog) description of the computation’s data path that utilizes these hTMR units. 4. The RaS infrastructures: Two distinct software systems, one based on the Open64 compiler and the other based on the Vex VLIW toolchain were developed to explore Resiliency-aware Scheduling. The RaS infrastructures quickly deter- mine the level of intrinsic resiliency of a computation for a specified architecture configuration, providing operational coverage, performance and area consumption estimates. If a reconfigurable architecture is specified, the RaS infrastructures can quickly bound the design space, and provides insights into the trade-offs between area, performance and resiliency. 1.4 This Thesis Resilient computation will continue to be an important area of research as VLSI device feature sizes decrease and error rates increase. What stays constant, however, are the 11 dependencies present in many computations. The many years of research undertaken to reveal and exploit ILP, as well as newer techniques like multi-core and thread-level par- allelism, cannot remove the intrinsic properties that inhibits parallelism in some codes. In addition, these research efforts are usually made in the name of decreasing runtime (that is to say, performance), while ignoring the potential benefit of improving a compu- tation’s resiliency. As increasing transistor densities lead to idle resources that existing computations simply cannot exploit – due to the inherent ILP wall – a new approach to evaluating parallelism, where a computation’s resiliency is addressed, in addition to its performance, is needed. This is especially true in the context of power consumption, which is likely to be the compelling benchmark in future system design [17, 38]. In the future, simply adding more hardware may become undesirable, as that may lead to unwanted power consumption. Approaches that make better use of existing resources will likely be more attractive, as energy concerns increase. A technique such as Resiliency-aware Scheduling, that can leverage existing resources for resiliency, allows the programmer to make informed decisions about the design and execution of their computation. As these design decisions become more complex, the need for auto- mated software techniques will increase. The results presented in this thesis show that RaS can help address these concerns. For an FPGA target, the RaS designs synthesized from the test suite are, on average, 19% smaller than the TMR designs with equivalent performance and operational cover- age. Additionally, the RaS-hardened executables for a case-study, targeting a reconfig- urable VLIW platform, perform between15% and40% faster than a competing software hardening approach. These RaS-based VLIW executables are also shown to scale better than TMR, with up to a40% performance improvement over an equivalently sized TMR computation. 12 1.5 Organization The remainder of this thesis is organized as follows: Resiliency-aware Scheduling is discussed in Chapter 2. A motivating example is provided, and the fundamental con- cepts and algorithms of RaS are discussed. Chapter 3 details two potential applications of RaS, and the software infrastructures that were constructed to support them. First, an Open64-based infrastructure, targeting an FPGA is discussed, followed by a Vex- based infrastructure that targets a reconfigurable VLIW softcore processor. Experimen- tal results are presented in Chapter 4, for both infrastructures. For the Open64-based infrastructure, FPGA design results from a test suite of seven realistic computational kernels are presented, comparing RaS-hardened designs to traditional TMR. For the Vex-based infrastructure, the results of a case study using a code from the aforemen- tioned test suite is detailed, comparing RaS to both traditional software and hardware hardening approaches. Related work is discussed in Chapter 5. Finally, concluding remarks and potential avenues for future work are presented in Chapter 6. 13 Chapter 2 Resiliency-aware Scheduling By combining well known compiler techniques such as dependency and critical path analysis with an intimate knowledge of target architecture parameters, Resiliency-aware Scheduling can mitigate some of the inefficiencies incurred by the indiscriminate appli- cation of the techniques mentioned in Section 1.2, such as TMR or coarse-grain software transformations. This chapter presents the Resiliency-aware Scheduling technique in detail. The con- cepts of application Intrinsic Resiliency (IR) and Hybrid-TMR (hTMR) are presented in Sections 2.1 and 2.2 respectively. A discussion of the compiler-based critical path analysis that RaS uses is presented in Section 2.3. Finally, an example application is presented in Section 2.4. In this example, the Open64-based RaS infrastructure is used to determine the optimal resource configuration for a multiply-accumulate kernel, tar- geting an FPGA device. 2.1 Intrinsic Resiliency and the ILP Gap Resiliency-aware Scheduling exploits a concept this thesis describes as Intrinsic Resiliency (IR). Intrinsic Resiliency is a measure of the number of replica operations that can be scheduled on a specific set of hardware resources without impacting the overall execution latency of a computation. This is in addition to the original operations of the computation; these replica operations are scheduled in the ”gaps” of the schedule, as dependencies and resource contention allow. 14 Intrinsic Resiliency can be thought of as the logical inverse of instruction level par- allelism (ILP), but is fixed for a given set of resources. Intuitively, a parallel compu- tation, executed on an architecture that exploits that parallelism, would not have much intrinsic resiliency. The resources of the architecture would likely (but not necessarily) exhibit high utilization, as these resources would be constantly executing the operations of this parallel computation, without having to stall while waiting for data or respecting a dependency. In short, the majority of the schedule is spent doing the ”original” work of computation. Alternatively, if the same parallel computation were executed on a system with sparse resources that exhibits resource contention, the intrinsic resiliency of the com- putation would likely (but not necessarily) increase. The IR of the computation would increase if the resource contention (due to dependencies) enabled other resources to remain idle at certain points the computation’s execution. In this instance, the initial – pre-replica – utilization of the resources would likely be lower, allowing for the schedul- ing of replica operations used for resiliency. However, if the resource contention merely extended the execution time of the computation, the IR might not increase at all, and in fact might decrease. In this case, the decrease in resources might lead to an increase in utilization, which hampers a computation’s IR. It’s important to note that this relation- ship, between the parallel nature of a computation, its dependencies, and the underlying architecture on which the computation is executed, is almost impossible to determine a priori. RaS helps provide these insights. The intrinsic resiliency of a computation is an important metric, because it allows for scheduling of resiliency ”check” operations – operations which verify the results of the computation, as in TR – without affecting the execution latency of the program. While research shows that theoretical limits of ILP are very high, with much work done on exposing ILP via compiler transformations such as loop unrolling, practical limits often 15 Adder 1 Array 1 Mul,plier 1 1 ADD (7) ARRAY (10) 2 ADD (15) ARRAY (8) 3 ADD (7) ARRAY (17) 4 ADD (7) ARRAY (12) MULT (19) 5 ARRAY (10) MULT (19) 6 ADD (21) ARRAY (10) MULT (14) 7 ARRAY (8) MULT (14) 8 ADD (23) ARRAY (8) Opcode (Node #) Cycle Figure 2.1: An example schedule for a multiply-accumulate kernel (shown in Fig. 3.3). The shaded cells are replica operations inserted into ILP gaps by RaS. fall between 4 and10 [51, 66]. This fact, combined with a large number of available resources in today’s architectures, leads to contexts wherein many codes exhibit high intrinsic resiliency. Resiliency-aware Scheduling exploits this intrinsic resiliency by scheduling replica (or duplicated) operations in the ILP ”gaps” of the computation, on resources that are idle, either due to instruction dependencies or resource contention. This approach is in contrast to the coarse-grain software approaches outlined in Section 1.2.1. As stated previously, these aforementioned approaches apply transformations at the source level, relying on the underlying compiler or architecture to schedule them on appro- priate resources. Additionally, these systems are generally compiled for one particular architecture or resource configuration, making such schemes ill-suited for resiliency on reconfigurable platforms. Figure 2.1 presents an illustrative example schedule, highlighting the potential for exploiting intrinsic resiliency. For an architecture configuration with 1 adder unit; 1 multiplier unit and 1 array-access unit (as indicated along the top of the figure), the computation has a latency of8 clock (or computation) cycles as represented on the left. 16 Dependencies between operations force resources to be idle at different points in the exe- cution. Additional resources would alleviate these constraints, decreasing the execution time of the computation, until the critical path – the lowest possible execution time for a computation due exclusively to instruction dependencies – is reached. Once all resource contention is alleviated, any further resources added can be used solely for resiliency. Instruction dependence and critical-path analysis are key aspects of Resiliency-aware Scheduling, as discussed in Section 2.3. In this illustrative example, there are 3 time instants where replica instructions can be scheduled in the ILP ”gaps” as indicated by the shaded operations. The first instant corresponds to the operation numbered7 – an addition operation with a1 cycle latency – which has its replica instructions scheduled on cycles3 and4. The second is operation 10, an array computation 1 executed in cycle 1. Also taking 1 cycle, its replicas are scheduled during cycles 5 and 6. Similarly, the last instant that allows for the use of IR is operation 8, another array operation, who’s replica instructions are scheduled on cycles 7 and 8. Overall, this resilient-aware schedule exhibits an operational coverage of30% (3 out of10 original operations) for this specific set of hardware resources. 2.2 Hybrid TMR Triple Modular Redundancy (TMR) is a commonly used technique for hardware resiliency. While TMR is effective at detecting and correcting errors [44], a TMR- hardened computation can exhibit poor performance relative to a non-TMR hardware configuration with similar area. This is because the resources that are used to replicate functional units in parallel (in the case of TMR) can only execute one operation at a time. Conversely, in an equivalent non-TMR configuration, those same resources could 1 Array operations are high level description for address computations as discussed in Section 3.1.1. 17 TMR Add 1 TMR Add 1 TMR Add 1 Array 1 Array 2 TMR Mult 1 TMR Mult 1 TMR Mult 1 1 7 7 7 10 12 2 15 15 15 8 10 14 14 14 3 17 10 14 14 14 4 8 8 19 19 19 5 12 12 19 19 19 6 21 21 21 17 17 7 23 23 23 Add 1 Add 2 Add 3 Array 1 Array 2 Mult 1 Mult 2 Mult 3 1 7 15 7 10 12 2 15 7 15 8 17 14 14 3 8 8 14 19 14 4 10 10 19 5 21 21 21 12 12 14 19 19 6 23 23 23 17 17 14 19 19 Figure 2.2: Two schedules for the multiply-accumulate kernel (shown in Figure 3.3). The top schedule depicts TMR hardening for adder and multiplier units. The bottom schedule depicts the equivalent Hybrid TMR configuration. execute three different operations concurrently, albeit with no resiliency coverage. In short, TMR is very rigid in its allocation of resources, using them only for resiliency. In contrast, Hybrid TMR (hTMR) allows similar resources to be used to alleviate resource contention (and therefore decrease execution latency) as well as for resiliency. This flexibility is an important aspect of Resiliency-aware Scheduling. The impact on resource contention is shown in Figure 2.2. The top schedule in the figure employs TMR hardware for only the adder and the multiplier units, as the array computations are still hardened via TR. The bottom schedules requires roughly the same amount of hardware – in that it uses the same number of nominal functional units – and has the same100% operational coverage. However, this hTMR schedule can be scheduled in6 cycles rather than the7 cycles required for the original schedule using TMR units. The hTMR schedule alleviates the resource contention between the multipliers on cycle3. Whereas the TMR-hardened variant is inflexible – it requires each operation to be executed in parallel – the hTMR variant can delay the third execution of the multiply 18 operation 14 2 until cycles 5 and 6. This flexibility allows operation 19 to execute as soon as possible, in cycle 3, and thereby allows the hTMR-hardened schedule to be executed with latency consistent with the computation’s critical path. Note that the final two addition operations (numbered21 and23) are actually covered via TMR, as all three operations are scheduled in parallel. Resiliency-aware Scheduling exploits the flexibility of scheduling operations on resources that would otherwise be locked into monolithic TMR units. Hybrid TMR allows the integral triplicated functional units of a TMR block to be scheduled inde- pendently, which can decrease execution time. However, as the latency of a compu- tation approaches the critical path, it’s possible that the benefit of additional resources put toward resiliency will decrease. The Resiliency-aware Scheduling infrastructures described in this thesis were developed to determine when decoupling redundant units via Hybrid TMR is advantageous – if there is IR to exploit – and when traditional hard- ening schemes may be required in order to meet operational coverage requirements. 2.3 Critical Path Analysis A common way of determining the shortest time to complete a number of interdependent tasks is the Critical Path Method (CPM) [37]. In an instruction scheduling context, the critical path is the longest path from the source node of a Dataflow Graph (DFG) to the sink. Dataflow graphs are a succinct way of representing the dependencies between operations in a basic block, and are explained in detail in Section 3.1.2. The CPM method gives a lower bound on the latency of a DFG 3 . In short, the critical path is a lower bound on execution time, regardless of the resources available to a computation. 2 For this example, it is assumed that a multiply operation has a latency of 2 cycles. 3 ”The latency of the DFG” is shorthand for the latency incurred when scheduling the DFG. 19 Pseudo-code for determining the critical path is presented in Figure 2.3. The algo- rithm assumes the graph is topologically sorted (line 6), and updates the longest path from each node as it traverses the DFG (line 12). The path (and path length) is recon- structed from the predecessor attribute in the second loop (line 20). The final value of the criticalPath is the critical path length. In instruction scheduling, resource contention causes a computation to take longer than its critical path. Resource contention occurs when a node in the DFG is ready to be scheduled but cannot because there are no available resources to serve it. This causes the node to stall, potentially delaying the overall latency of the DFG. The amount of time a node can be delayed without delaying (increasing) the latency of the overall computation is called slack. For RaS, the critical path is a crucial piece of information, as it allows the designer to make sound decisions regarding the utility of a particular architecture configuration for a given computation. For traditional fixed architectures, the critical path is useful as it is an indicator of ”peak” performance. If, for a desired level of resiliency, execution latency is much longer than the critical path, a different hardware environment may be more desirable in order to provide the desired level of resiliency. Providing insight for determinations like this is an attribute of the RaS infrastructures. In addition to traditional fixed architectures, this thesis explores the benefits of RaS in the context of FPGA resource configuration. Here the critical path is also valuable, as it allows the programmer to gauge the nominal utility of adding functional units. It is possible that these additional units can relieve resource contention, thereby increasing performance (up to the critical path), can provide more resiliency, both, or neither. By allowing for the rapid evaluation of many different architecture configurations, the RaS infrastructures allow the programmer to greatly constrain their design space exploration, 20 1: G (V;E) 2: for each nodeN2G do 3: N:priority 1 4: N:predesessor ; 5: end for 6: vertexList topologicallySort(G) 7: Source:priority 0 8: whilevertexList6=; do 9: V vertexList:pop() 10: for each neighborUofV do 11: Weight Latency(V;U) 12: ifU:priority <V:priority+Weight then 13: U:priority V:priority+Weight 14: U:presesessor V 15: end if 16: end for 17: end while 18: N G:sink 19: criticalPath 0 20: whileN6=; do 21: criticalPath criticalPath+N:latency 22: ifN:predesessor6=; then 23: N N:predesessor 24: else 25: break 26: end if 27: end while Figure 2.3: Pseudo-code for determining the critical path of a DFG. and quickly choose a configuration that best suits the design parameters of a particular situation. An example of such a design space exploration is given in the next section. 2.4 Example RaS Application: FPGA Configuration To better illustrate the key concepts of RaS, an example application is presented: FPGA configuration. While Resiliency-aware Scheduling is applicable to a number of domains, the dynamic nature of an FPGA architecture provides a rich design space to explore in terms of the number and diversity of its functional unit configuration. In this example, the Open64-based RaS infrastructure will be used to determine the opti- mal resource configuration for a sample computation: a single loop nest that performs a multiply-accumulate computation over all the elements of two vectors, commonly known as a SAXPY linear algebra operation. As slice counts on FPGA continue to 21 rise – the latest Virtex-7 FPGA can have over 30;000 [69] – the potential architecture configurations for a given computation has risen as well. As such, automated tools that provide insight into the benefits and weaknesses of a given configuration will become increasingly important. The RaS infrastructures can provide performance, area and resiliency estimates for a given architecture configuration very quickly. This automation is important, as even for small examples, such as the aforementioned multiply-accumulate kernel shown in Figure 3.2 and Figure 3.3, the number of possible configurations can be overwhelming. For this example, the 3 operator types and 10 different operations generate 864 poten- tial configurations 4 . While some of these configurations are redundant (and perhaps nonsensical) it is clear that some automation is necessary to determine the impact of a particular architecture configuration for anything but the smallest of computations. As described in Section 3.1.4, functional unit parameters (latency and area) are derived from actual design implementations. For the multiply-accumulate example, these designs were generated targeting a Xilinx Virtex-5 FPGA [68]. A summary of the parameters is presented in Table 3.1. These parameters are inputs to the RaS infras- tructures and are used to generate an area estimate for the resource configuration. As new hardware designs are generated, the new metrics can be fed to the RaS infrastruc- ture, and the impact on resiliency and execution time can be evaluated. The source code depicted Figure 3.2 (b) performs an integer multiply and accumulate (MAC) computation over two data arrays. The code is structured as a single loop whose body has been unrolled by a factor of 2 (MAC 2) to increase the overall number of operations for the computation and expose additional ILP. 4 This number includes every possible configuration permutation, from a minimal configuration – one of each type of operator – to each of the 10 individual operations receiving three dedicated functional units. 22 # Architecture Configuration Latency (Cycles) Operations Covered Area (LUT) 1 1 Add, 1 Array, 1 Mult 8 3 498 2 2 Add, 1 Array, 1 Mult 8 5 626 3 1 Add, 2 Array, 1 Mult 7 5 493 4 1 Add, 1 Array, 2 Mult 7 2 871 5 2 Add, 2 Array, 2 Mult 6 8 1058 6 3 Add, 2 Array, 3 Mult 6 10 1424 7 1 TMR Add, 2 Array, 1 TMR Mult 7 10 1392 8 1 TMR Add, 1 TMR Array, 1 TMR Mult 8 10 1303 9 2 TMR Add, 2 TMR Array, 2 TMR Mult 6 10 2451 Table 2.1: Latency, coverage and area estimates for MAC (2x) kernel code given in Figure 3.2 for various architecture configurations. A summary of the metrics for a number of illustrative architecture configurations is shown in Table 2.1. This table displays one possible hardware configuration per row of the table. For each configuration the type and number of its functional units (i.e. adder, multiplier, etc) are indicated. The following columns present the overall execution time latency for the computation, the number of operations covered as well as the overall FPGA area consumption used, given in terms of slice Look Up Tables (LUT). This information is also presented graphically in Figure 2.4. In the figure, the area of the circles represents the relative area consumption. This depiction shows, at a glance, which configurations are preferable relative to each other. The ”ideal” configuration for this example is one in which latency is consistent with the critical path, all operations are covered, and area consumption is minimized. As the critical path of this computation is 6, and the number of potentially covered operations is 10, the optimal RaS configuration for this example is 3 Adders, 2 Array units and 3 Multipliers (configuration #6). This configuration provides the lowest area consump- tion, highest coverage, and lowest execution latency 5 . 5 This optimal RaS configuration differs slightly from the one presented in Chapter 4 due to the mod- eling of multiplexers (see Section 3.1.4). 23 A few other configurations merit discussion. The minimal configuration (#1) is the smallest possible, but exhibits resource contention, hampering its execution time. Con- figurations #2, #3 and #4 take this minimal configuration, and successively increase by one of each type of resource to the configuration. Intuitively, one would expect the resiliency coverage level to increase with the increase in resources. As configuration #4 shows, this is not always the case. In this instance, the added multiplier reduces resource contention, allowing for a faster execution time. This decrease in execution cycles removed some of the potential ILP gaps previously available for replica opera- tions. Configuration #3 also reduced execution time, by making more data available and exposing ILP, but unlike the previous example, does so in such a way that there is still time for replica operations. The impact of a particular resource configuration, and the difference in execution latency or resilience coverage is not always intuitive. The RaS infrastructure quickly quantifies these trade-offs, allowing for more sound design deci- sions. Lastly, consider configuration #5. This configuration is the minimum number of resources needed to completely alleviate resource contention; in other words, the latency of the computation is the length of the critical path. If such performance was required, a traditional hardware resiliency approach would be the application of TMR to those preexisting resources. This configuration is shown as #9. Note that by using hTMR and exploiting IR, as shown in configuration #6, the required area is reduced by over 42% when compared to a traditional TMR hardening approach. 24 1 Add, 1 Array, 1 Mult 1 Add, 1 Array, 2 Mult 2 Add, 1 Array, 1 Mult 1 Add, 2 Array, 1 Mult 2 Add, 2 Array, 2 Mult 3 Add, 2 Array, 3 Mult 1 TMR Add, 2 Array, 1 TMR Mult 1 TMR Add, 1 TMR Array, 1 TMR Mult 2 TMR Add, 2 TMR Array, 2 TMR Mult 0% 20% 40% 60% 80% 100% 120% 5 6 7 8 9 Opera&onal Coverage (%) Cycles Figure 2.4: Operational coverage, execution latency and relative area consumptions of various architecture configurations for the MAC (2) kernel code given in Figure 3.2. 2.5 Error Model In order to evaluate designs generated via RaS, a ”roll back” error mitigation approach is considered. This approach rolls back the state of the computation and retries it from a known state, until either a predefined number of attempts is reached or no error is detected. Unlike TMR – which detects and corrects errors in-line with the execution of the operation – an error in an RaS-hardened computation can only be detected when all the replicas have been executed, and the corresponding results compared. To support this error recovery approach, the underlying hardware only needs to support post-mortem error detection. Given a computation, the hardware implementation executes the operations, opti- mistically forwarding the results of each operation as the correct inputs to subsequent operations. Each covered operation is executed either two times (pairing) or three times 25 (triplication) with the values of each execution saved in registers (pairing and triplica- tion are discussed in Section 3.3). The results of the operation replicas are compared to detect a mismatch, with triplication requiring three comparisons, and pairing requiring only one. The span, or duration of an error is not a consideration for all but the smallest of designs 6 . If an error were to last longer than the clock time of an operation – that is, if a single error potentially impacts multiple operations – the computation will still restart when the first error is detected. In addition, this work considers a maximum of one error per operation – as more than this would overwhelm even a TMR-hardened scheme – and only considers transient, not permanent, faults. Lastly, it is assumed that the hardware blocks that implement majority voting are error free. These assumptions are consistent with related work in the literature [30, 31, 63]. Each covered operation has a roll back cost associated with it. This cost, in cycles, corresponds to the point in the program execution at which the error can be detected. For pairing, this occurs when the two replicas have been executed and compared. For triplication, this corresponds to the point at which the three replicas have been executed and the third comparison is complete. If an error is detected, the computation rolls back to its initial state – the beginning of the DFG, where none of the operations have been scheduled – and is re-executed. In the case of pairing, any error requires a roll back and retry, as there is no majority voting capability. Triplication, however, only requires a retry when an error occurs on the first execution of an operation, as this is the value that is optimistically forwarded to subsequent computations. These variations are discussed in more detail in Section 3.3.1. 6 If a design had only a single operator of a given type, an error with a long span could potentially affect multiple replicas of the same operation. In addition to violating the assumption of no more than one error per operation, it is also reasonable to assume any real-world computation would be large enough to have more than one operator of a given type, with replica operations scheduled on different physical operators. 26 Expected performance in the face of errors, then, is dictated by the distance between an operation’s execution and the point at which the error can be detected. In a patholog- ical case, any one error could effectively double the latency of the computation, if the first operation’s comparisons were not executed until the very end of the computation’s instruction stream. The other extreme, where the comparisons are executed directly after the original computation is tantamount to TMR. The properties of the computation – namely, its concurrency – and underlying architecture parameters will dictate how much ”slack” there is between an operation and the comparisons of the values gener- ated by its replicas, and therefore, the impact of an error. Finally, a computation’s actual latency is modeled as the sum of the latencies of all the aborted iterations (due to errors), plus the nominal, error-free latency. Using this error model, an expected latency analy- sis is conducted in Section 4.1.5 for the Open64-based FPGA target infrastructure, and in Section 4.2.3 for the Vex-based VLIW infrastructure. 2.6 RaS Discussion The RaS infrastructures (described in the next chapter) can quickly analyze and quan- tify the benefits of a number of design configurations for a given computation. This automated insight allows the designer to decide which metrics are important (area con- sumption, runtime latency, or resiliency), and select an appropriate configuration for the given context. In addition to quantifying the benefits of a given architecture, these RaS infrastructures provides insights into the fundamental properties of the code itself. For example, if the addition of resources does not reduce execution latency, it can be rea- soned that the computation does not exhibit resource contention. This insight is key for architecture designers as it helps them focus their effort towards more efficient (faster) 27 implementations of individual components versus the replication of standard implemen- tations. Some of these design trade-offs are explored in the choice of functional unit implementation discussion in Section 3.1.4. As error rates continue to rise, the need for automated tools that holistically consider not only traditional metrics such as performance, resource usage and power, but also resiliency, will increase. Resiliency-aware Scheduling address this need, quickly illus- trating the interplay between a given computation and its underlying hardware resources, and providing insights to which configurations are desirable in terms of performance, resource usage and resiliency in an automated fashion. 2.7 Chapter Summary In this chapter, the fundamental concepts of Resiliency-aware Scheduling (RaS) were discussed. Intrinsic Resiliency (IR) was presented, noting the intertwined relationship IR has with both the scheduling of operations and the process of resource allocation for reconfigurable architectures. To that end, Hybrid-TMR (hTMR) is presented as a poten- tial solution to the inflexibility of the traditional hardware approach to resiliency, Triple Modular Redundancy (TMR). An example application of RaS, scheduling a multiply- accumulate kernel on an FPGA was given, illustrating the benefits of the RaS approach. These concepts were implemented via the RaS infrastructures, which are detailed in the next chapter. 28 Chapter 3 Applications of RaS To implement RaS, two distinct infrastructures were implemented. The first consists of a compiler analysis toolchain based on the Open64 [28] compiler that targets Field Programmable Gate Array (FPGA) devices. The second is a Vex-based compiler [23] infrastructure, targeting (reconfigurable) VLIW machines. This chapter describes these two infrastructures in detail. 3.1 Open64 Compiler Infrastructure The RaS Open64 compiler infrastructure is organized into three major components as depicted in Figure 3.1. The Front End (FE) simply translates the input source program (either C or FORTRAN) to the Winning Hierarchical Intermediate Representation Lan- guage (WHIRL) intermediate representation. Next, for each basic block, a data flow construction module traverses the abstract-syntax-tree representation of the statements in that block and generates a data flow graph (DFG) representation of the computation (see Section 3.1.2). In order to schedule the DFG representations, the toolchain uses an architecture configuration that consists of the number, type and size of available func- tional units, and latencies of the various operations. In addition, scheduling parameters, such as the priority for the list scheduling algorithm (see Section 3.1.3) can be pro- vided. The DFGs are then scheduled using the RaS scheduling algorithm. Based on this schedule, the toolchain generates performance estimates in the form of number of clock 29 C/Fortran Open64 Front End Scheduling Algorithms Scheduler Architecture Parameters VHDL Hardware Design and Simulation FPGA Area Estimate Performance Estimate Resiliency Estimate AST traversal & Dataflow graph construction WHIRL source Figure 3.1: Basic Block diagram of the RaS Open64 infrastructure toolchain. cycles required for computation execution, resiliency coverage metrics and a device area consumption estimate. The next sections describe each of these components in more detail. 3.1.1 Open64 Front End Open64 is a state-of-the-art, industrial strength optimizing compiler, with roots in both industry and academia. Open64 originated from the SGI MIPSPro compiler, which targets the MIPS R10000 processor. MIPSPro was then open-sourced under the GNU public license in 2000 and named Pro64. In 2001 Pro64 was renamed Open64, and hosted by the University of Delaware. Since 2001, major aspects of the Open Research Compiler developed by Intel and the QLogic Pathscale compiler have been merged into the Open64 codebase [28]. 30 I4STID 0 <2,2,ACC1> T<4,.predef_I4,4> {line: 9} I4ADD I4I4LDID 0 <2,1,ACC> T<4,.predef_I4,4> I4MPY I4I4ILOAD 0 T<4,.predef_I4,4> T<57,anon_ptr.,4> U4ARRAY 1 4 U4LDA 0 <2,3,B> T<56,anon_ptr.,4> I4INTCONST 51 (0x33) I4ADD I4I4LDID 0 <2,5,I> T<4,.predef_I4,4> I4INTCONST -‐1 (0xffffffffffffffff) I4I4ILOAD 0 T<4,.predef_I4,4> T<57,anon_ptr.,4> U4ARRAY 1 4 U4LDA 0 <2,4,C> T<56,anon_ptr.,4> I4INTCONST 51 (0x33) I4ADD I4I4LDID 0 <2,5,I> T<4,.predef_I4,4> I4INTCONST -‐1 (0xffffffffffffffff) I4STID 0 <2,1,ACC> T<4,.predef_I4,4> {line: 10} I4ADD I4I4LDID 0 <2,2,ACC1> T<4,.predef_I4,4> I4MPY I4I4ILOAD 0 T<4,.predef_I4,4> T<57,anon_ptr.,4> U4ARRAY 1 4 U4LDA 0 <2,3,B> T<56,anon_ptr.,4> I4INTCONST 51 (0x33) I4I4LDID 0 <2,5,I> T<4,.predef_I4,4> I4I4ILOAD 0 T<4,.predef_I4,4> T<57,anon_ptr.,4> U4ARRAY 1 4 U4LDA 0 <2,4,C> T<56,anon_ptr.,4> I4INTCONST 51 (0x33) I4I4LDID 0 <2,5,I> T<4,.predef_I4,4> INTEGER I INTEGER acc, acc1 INTEGER B(51), C(51) acc1 = acc + ( B(I)* C(I)) acc = acc1 + ( B(I+1) * C(I+1)) (a) (b) Figure 3.2: Very High level WHIRL Abstract Syntax Tree (a) and originating FOR- TRAN source (b). The Open64 FE generates an Abstract Syntax Tree (AST) based on the WHIRL intermediate representation. WHIRL was engineered to support C/C++, FORTRAN and Java and is comprised of five levels: Very High (VH), High (H), Mid (M), Low (L) and Very Low (VL). The Open64 FE outputs VH level WHIRL. At the highest level, VH WHIRL is structurally very similar to the original source. At successively lower levels, the IR more closely resembles the targeted machine code [27]. For example, at the VH level, array references are denoted by array nodes, not the multiply, add and shift operations normally used in address computations. Similarly, structured control-flow operations, such asif-then are explicitly represented as nodes in WHIRL, as opposed to being deconstructed as a sequence ofjump operations. A sample FORTRAN code and its corresponding VH WHIRL AST are given in Figure 3.2. 31 3.1.2 Data Flow Graph Construction For each function in a computation, the RaS infrastructure constructs a Control Flow Graph (CFG). Each node in the CFG is a basic block. A basic block is a segment of code absent control flow; no control flow can enter the basic block (with the exception of the first statement) and no control flow can exit the block (with the exception of the final statement). Each basic block is traversed to form a DFG, which is the fundamental analysis and scheduling unit of the RaS infrastructure. A DFG is a succinct way of representing dependencies between operations [39]. The nodes of a DFG represent operations, with arcs between nodes representing the dependencies between them. A DFG is a directed acyclic graph by construction. Nodes connected to the source node are inputs to the graph. This means their values are not dependent on any other computations in the graph. Conversely, nodes who’s values are available at the end of the graph are consid- ered outputs and are connected to a sink node. When scheduled, a node is not ready until all of its predecessors have completed. By definition, all inputs are ready. An example DFG, from the sample code in Figure 3.2 is given in Figure 3.3. 3.1.3 RaS Scheduling RaS builds on the basic list scheduling algorithm, in itself a simple and efficient way of scheduling tasks [20]. While there are a number of other possible scheduling algo- rithms (e.g.[5, 52]) in most cases, the marginal performance benefit – if any – gained by implementing these schemes is outweighed by their added implementation complexity and overhead [3, 33, 61]. At its core, a list scheduling algorithm determines a set of ready operations, scheduling them as resources allow according to a static or dynamic scheduling priority. 32 (0) source / sink (1) OPC_U4LDA: C 0 (2) OPC_U4LDA: B 0 (3) OPC_I4INTCONST: -1 0 (4) OPC_U4LDA: C 0 (5) OPC_I4INTCONST: -1 0 (6) OPC_I4I4LDID: I {0} 0 (16) OPC_U4LDA: B 0 (20) OPC_I4I4LDID: ACC 0 (12) OPC_U4ARRAY: 0 (10) OPC_U4ARRAY: 0 (7) OPC_I4ADD: 0 (8) OPC_U4ARRAY: 0 (15) OPC_I4ADD: 0 0 0 0 0 (25) source / sink 0 (17) OPC_U4ARRAY: 0 (21) OPC_I4ADD: 0 (24) OPC_I4STID: ACC 0 (13) OPC_I4I4ILOAD: 1 (11) OPC_I4I4ILOAD: 1 1 (9) OPC_I4I4ILOAD: 1 1 (19) OPC_I4MPY: 0 2 (14) OPC_I4MPY: 0 (23) OPC_I4ADD: 2 0 1 (18) OPC_I4I4ILOAD: 1 0 (22) OPC_I4STID: ACC1 1 0 0 0 Figure 3.3: An example dataflow graph, obtained from the WHIRL given in Figure 3.2. The critical path, indicated by the shaded nodes, is6 execution cycles. Pseudo-code for a list scheduling algorithm is given in Figure 3.4. The algorithm has as input a DFG G and produces an assignment of the scheduled times for each nodes of G. A schedule is valid if for each node, the completion times of all its predecessors are lower than the scheduling time of the node itself. As depicted in Figure 3.4 the algorithm begins with a time or clock set to0 (lines 1 and 2) and uses a variable latency, which will keep track of the total running time of the DFG as initialized in line 3. The algorithm 33 1: G (V;E) 2: clock 0 3: latency 0 4: whileG6=; do 5: for each nodeN2G do 6: ifpredecessors(N) =completed then 7: priority queue N 8: end if 9: end for 10: whilepriority queue6=; do 11: N priority queue:pop() 12: for eachF =functionalUnit(N:type) do 13: ifF:available() ==true then 14: N setScheduled() 15: F busy(N:latency()) 16: ifclock+N:latency >latency then 17: latency clock+N:latency 18: end if 19: G GnfNg 20: break 21: end if 22: end for 23: end while 24: clock clock+1 25: end while Figure 3.4: Pseudo-code for the list scheduling algorithm. proceeds as long as there are nodes left to process (line 4). The first step is to compute the ready queue, inserting ready nodes in to a priority queue data structure. The priority function of the queue can be varied to produce different schedules. For the particular implementation of RaS described in this thesis, shortest distance to critical path was chosen as the priority function, as this choice leads to good average performance of the resulting schedules for a test suite of scientific code kernels (see Chapter 4). When scheduling a node, an available functional unit of the appropriate type is required. For example, node 5 in Figure 3.3, which is an OPC I4ADD node, requires an integer adder or ALU. To this effect, the algorithm checks if any functional units of the appropriate type are available 1 (lines 12 and 13). If a suitable functional unit is found, the node is marked as scheduled (line 14), the functional unit is marked as busy for the duration of the node’s latency (line 15), and the node is logically removed from the graph (line 19). No further functional units need be considered. Lines 16 and 17 1 A functional unit is available if it isn’t currently busy servicing some other node 34 update the total running time of the computation. When the ready queue is empty, the clock is incremented (line 24) and the process continues until all nodes are processed. Once a DFG is initially scheduled, the RaS infrastructure determines the compu- tation’s IR by scheduling it a second time, this time with the initial schedule acting as resource constraints. This process is shown in figures Figure 2.1 and Figure 2.2 (bottom) where the non-shaded operations correspond to the original schedule. The shaded oper- ations represent the replica operations of RaS, and are added on this second scheduling pass, whenever idle resources allow. 3.1.4 FPGA Device Area Modeling As part of the algorithmic design space exploration described in detail in Section 4.1.4, the RaS infrastructure makes use of an architecture configuration description where each functional unit and ancillary resources are characterized in terms of device area con- sumption and execution latency in cycles. As the current Open64-based RaS infrastructure implementation targets the Virtex- 5 family of FPGAs [68] the area consumption model uses two metrics specific to this family of FPGAs, namely, the number of slice registers (or flip-flops) and the number of LUT. In order to estimate the area of a complete design, two main components are mod- eled: functional units blocks that implement the various high-level operators of the com- putation, and multiplexer blocks used to select inputs when operators are reused. There are other components that make up a design, such as the controller, but these were not included in the area consumption model for two reasons. First, such components are difficult to accurately model using the LUT metric given above, as the hardware design tools may enact a number of optimizations that make isolating the impact of these por- tions of the design hard to track. As an example, a high-level synthesis tool can chose 35 to use a one-hot encoding of the underlying finite-state-machine (FSM) of the controller versus the more traditional combinatorial encoding. Secondly, empirical results show that these components are relatively small, and do not substantively contribute to overall area consumption. Functional Unit Area Modeling Table 3.1 presents the area consumption results for the various base operators used in the designs explored in this thesis work when mapped onto a Xilinx Virtex-5 FPGA [68]. The table indicates the latency of each operator in terms of the number of clock cycles and the maximum frequency at which each operator can be safely clocked. The current set of designs are implemented using16-bit integer arithmetic operators. The choice of using 16-bit arithmetic operators was made for a number of reasons. First, as a proof-of-concept, they are simpler to implement and debug. Secondly, while the fundamental concepts behind RaS would still be relevant using higher bit-widths (or moving to floating-point arithmetic implementations), doing so would require a much larger FPGA, as the size of the individual functional units would increase dramatically. This increase would impact the number of each type of unit that could fit on a single FPGA, which would greatly constrain the number of feasible design points for RaS to explore during the design-space exploration phase of the RaS infrastructure. Lastly, as the experimental results in Chapter 4 show, the RaS approach succeeds by eliminating the need for functional units, while maintaining a performance threshold. The larger the arithmetic blocks, the larger the area savings for RaS when they are removed. To present results in as fair a light as possible, smaller functional units were chosen. Some specific implementation choices were made in order to reflect the relative sizes of the different arithmetic operators. For the addition operator, the Sklansky [62] look- ahead implementation was chosen, rather than the much simpler and slower ripple-carry 36 Operator Slice Slice Latency Min. Clock (16 bits ) LUT Flip Flops (cycles) Period (nsecs) Adder (Sklansky) 23 0 1 1:58 Adder (Sklansky-TMR) 32 0 1 1:58 Address Generator/Array 16 0 1 1:26 Address Generator/Array (TMR) 63 0 1 1:26 Multiplier (CSA) 349 0 2 16:2 Multiplier (CSA-TMR) 1061 0 2 15:4 Division (iterative) 40 40 17 6:12 Division (iterative-TMR) 165 201 17 7:25 Square root (iterative) 834 81 51 7:80 Square root (iterative-TMR) 2497 210 51 8:87 Table 3.1: Area and latency metrics for various operators implemented on a Xilinx Virtex-5 FPGA. design. Similarly, multiplication is implemented using the parallel carry-save-adder (CSA) tree design, resulting in operators that are faster than purely sequential designs, but at the expense of FPGA area resources. Lastly, the division and square-root oper- ators use purely sequential implementations with much longer latencies. This choice – to conserve area while sacrificing latency – is justified by the reduced occurrence of these operators in many scientific computations, as it may not be prudent to allocate a significant portion of the FPGA’s area to an operator that is used very sparingly. Multiplexer Block Area Modeling RaS is fundamentally built on the idea of operator reuse. In order for a functional unit to compute results for more than one operation in the DFG, the inputs for the various operations must be time multiplexed. To accomplish this, the designs explored in this work use multiplexer hardware blocks as depicted in Figure 3.5 (left). As the truth table indicates, when thesel input assumes the logic value0, the output of the multiplexerf assumes the values of thea input. Whensel has the logic value1 the output assumes the values of theb input. Multiplexers with larger number of inputs can be composed from this base2-to-1 multiplexer as shown in Figure 3.5 (right). 37 f sel 0 1 a b f a b 0 1 0 1 0 1 1 0 c d sel 0 1 sel Mux 4to1 f a b c d Mux 2to1 Mux 2to1 Mux 2to1 1 sel sel 0 sel sel 0 1 0 1 0 1 sel 0 1 2 3 Mux 2to1 a b sel f 0 1 Figure 3.5: Discrete implementations of a 2-to-11-bit multiplexer (left) and 4-to-11-bit multiplexer (right). The area, in terms of the required LUT, of a generic n-to-1 multiplexer for bitWidth wide operators is modeled by determining the number of base 2-to-1 multiplexer units of 1-bit that are needed to form the wider multiplexer. Figure 3.5 illustrates the composition for base multiplexers to form of a4-to-1 multiplexer. Given an operator that requires N values to be multiplexed to a single input, k = dlog 2 Ne binary selection signals are required to uniquely identify which of the N inputs is to be selected. For thisk, it then requires2 k 1 of such2-to-11-bit width multiplexers to construct an-to-1 multiplexer. Equation 3.1 represents this in terms of FPGA area modeling, where A LUT is the total number of FPGA LUT required for multiplexing NumDistinctInputs for operands of size bitWidth (typically 16, as mentioned above). In the equation, NumOperatorInputs is the number of operator inputs, typically 2 in the case of adders or multipliers, but 1 for the case of the square-root or any other unary operator. Lastly, as the size of a2-to-1 multiplexer is modeled as1 LUT per bit, the number of these canonical multiplexers is multiplied bybitWidth. A LUT = 2 log 2 NumDistinctInputs 1bitWidth NumOperatorInputs (3.1) 38 Area Modeling for Architecture Exploration It’s important to note that the Open64-based RaS infrastructure is not bound to any particular architecture. The number and type of available functional units, as well as the latency for the operations executed on them are inputs to the RaS infrastructure as run-time parameters of the tool. By varying these parameters, the impact of different functional unit profiles can be studied. For example, there are many different techniques for implementing multipliers on an FPGA [4]. The RaS infrastructure can quickly deter- mine whether it is more beneficial to implement a number of slower multipliers that each require minimal area, or fewer faster multipliers that consume a larger area, or some combination thereof. In addition to known architectures, or reconfigurable architectures (like an FPGA), future architectures which may not exist yet in reality can be tested, in order to determine the possible benefits of their development. 3.1.5 Hardware Support for hTMR At the hardware level, hTMR execution is supported by independent functional units connected to a triple register block. Value matching of these register values is then performed by a checker circuit as depicted in Figure 3.6 (right). The main output of this checker logic is labeled Check and provides a direct comparison between regis- ter REG 0 and the two other registers, REG 1 and REG 2. Should there be a mismatch between either of the two pairs of registers the computation is deemed erroneous, which leads to a roll back and retry procedure 2 . Note that the pairwise testing is required as the implementation readily uses the value stored in register REG 0 for a subsequent operation in the DFG. As a result, the controller must be able to check, post-mortem, if indeed this optimistic value was the ”correct” one. Figure 3.6 (left) also depicts, for 2 An alternative design would consist of a single comparator with its inputs multiplexed as the com- parisons are not required to execute simultaneously. 39 REG load REG load 0 2 Checker 0 2 REG load 1 1 REG load REG load adder dataOut Comparator Comparator Comparator Check optCheck adder REG load REG load REG load Bitwise Majority Voter Figure 3.6: Traditional TMR output, with in-line, bit-wise voting logic (left). Triplicated registers and checker logic supporting hTMR (right). comparison purposes, the logic used for TMR. Here, the bit-wise (in-line) majority vot- ing logic is inserted between the three operator replicas and the register that stores the corresponding operation’s value. The checking logic used in hTMR also includes a secondary output labelled optCheck. This output provides a direct comparison between the registers labelled as REG 0 and REG 1. A dynamic scheduler could potentially check this output, and if valid, forgo the execution of the third replica and comparison, potentially shortening the overall schedule length. This scheduling option is, however, not explored in this work. 40 3.1.6 Code Generation for FPGAs This section describes the hardware code generation algorithm, which outputs code in a hardware description language (HDL). This HDL code can then be synthesized on an FPGA using vendor-specific hardware tools. The generated HDL is based on the execu- tion schedule of the computation – which itself was generated using the RaS algorithm as described in Section 3.1.3 – and the corresponding target hardware configuration parameters. The algorithm makes use of a set of auxiliary functions defined over the set of functional units of the target architecture configuration and over the operations in the schedule. It is assumed that the schedule has a total length ofL clock cycles. Each of the operations,op i , is scheduled in a specific clock cycle,scheduleSlot(op i ) and com- pletes its execution in the clock cycleretiredSlot(op i ). Note thatretiredSlot(op i ) > scheduledSlot(op i ), if the operation has an execution latency greater than1 clock cycle. In addition, each operation has a type, type(op i ), and identifiers replica(op i ), where i = 0 for the original operation, andi = 1;2 for the two replicas of the operation. For each functional unit FU i , the number of inputs and outputs are defined numInputs(FU i ) andnumOutputs(FU i ), respectively. For simplicity, it is assumed that all inputs have the same bit width defined as width(input(FU i )) and each func- tional unit has a single output, with a bit width ofwidth(output(FU i )). Lastly, func- tions on sets of operations, and selection functions over functional units or schedule slots are defined. The function assignedOp(FU i ) defines a list of tuples, containing pairs of operations and the corresponding time slot where the operation is scheduled to execute on the functional unit FU i . The function distinctOp(list) returns the set of distinct operations in the list irrespective of its replica identifier. Given these auxiliary functions, the code generation algorithm for the definition of a custom data path that supports the execution of a given schedule is fairly straightforward. 41 1: for each operationOp i do 2: emit HDL for registerREG i 3: ifreplica(Op i )> 0 then 4: emit HDL for triple register forREG i 5: emit HDL for checker logic forREG i 6: end if 7: end for 8: for each DFG inputj do 9: emitHDLforregisterREG j 10: end for 11: for each functional unitFU i do 12: emit HDL code forFU i 13: NInputs = NumberInputs(FU i ) 14: NOutputs = NumberOutputs(FU i ) 15: OpList =AssignedOp(FU i ) 16: N =distinctOp(OpList) 17: k =dlog 2 (N)e 18: emit HDL for NInputs N-to-1 Multiplexers with width(inputs(FU i )) bits 19: connect k selection signals across multiplexers 20: for eachn th op2OpList do 21: connect inputs of op ton th input of multiplexer 22: r =replica(op) 23: connect (unique) output ofFU i to registerREG i(r) 24: end for 25: end for Figure 3.7: Pseudo-code for the data path code generation algorithm. For simplicity, the focus here is on the data path generation, as the definition of the corresponding scheduler controller is trivially derived from the schedule definition itself. The core blocks used in the definition of the data path rely on blocks that implement the functional units, registers that hold the values resulting from the execution of the various operations, and multiplexers to direct the values in registers to the appropriate functional unit inputs. Each register is designated either by an identifier directly defined as the input node of the DFG, or as the operation identifier whose result value the register stores. The code generation algorithm depicted in Figure 3.7 generates a data path where values are saved in discrete registers, sometimes using triple registers to hold the values of each operation and its two additional replicas. This data path connects the inputs of the functional units via multiplexers. The outputs of the operations are then connected to various outputs registers relying on the controller to select the appropriateload wire signal to store the output of the functional units in the register at the appropriate clock cycle throughout the schedule’s execution. 42 MUX_ARRAY1_2 MUX_ARRAY1_1 MUX_ADDER1_1 0 1 2 3 MUX_ADDER1_2 0 1 2 3 REG_1 REG_2 REG_3 REG_4 REG_5 REG_6 REG_16 REG_20 16 16 16 16 16 ARRAY1 16 ADDER1 REG_10 REG_12 REG_17 REG_19 REG_15 REG_7 REG_8 REG_14 REG_21 REG_23 MUX_MULT1_1 0 1 MUX_MULT1_2 0 1 16 16 MULT1 16 2 MUX_ARRAY1_sel 1 2 R14 R19 R15 R7 R21 Checker valid REG_7_out Checker valid REG_8_out Checker REG_10_out Checker REG_12_out Checker valid REG_14_out 0 1 2 3 0 1 2 3 MUX_ADDER1_sel MUX_MULT1_sel Figure 3.8: Generated data-path for a multiply-accumulate computation supporting RaS. The algorithm begins by emitting code that implements the various functional units, followed by code for the registers. A register needs to be triplicated if the correspond- ing operation has been executed three times. After this first basic step, the algorithm determines the number of multiplexers to include at the input of each functional unit, computing the dimensions of each multiplexer by determining the number of distinct operations assigned to the corresponding functional units 3 . The association of each multiplexer selector configuration with each operation is arbitrary, as a single sequential numbering will suffice. The generated code simply connects the output of the registers to the corresponding multiplexer inputs. 3 Chaining of operations via discrete registers, replicated temporal redundancy, requires that each func- tional unit will always use the first replica register as its input. Other arrangements are possible, but are not being pursued at this time. 43 As a validation vehicle, an RTL Verilog HDL datapath and controller specification, geared towards an FPGA implementation is derived. Clearly, the code generation algo- rithm could also target other configurable computing platforms or even a custom ASIC. The RaS schedule illustrated in Figure 2.1 for the example multiply-accumulate computation is illustrated via a schematic of the generated data path hardware in Fig- ure 3.8. This example schedule corresponds to a configuration of a single functional unit of each type (1 adder,1 array-addressing unit and1 multiplier, in this case) and exploits the IR of the original computation. As only some of the operations are covered, only a subset of the registers are triplicated and only these are connected to individual checker logic circuits. 3.2 Vex VLIW Infrastructure This section describes the RaS Vex-based VLIW target infrastructure. This VLIW infrastructure is based on the Vex compiler [23] and is outlined in Figure 3.9. The system flow of the RaS Vex infrastructure is as follows: A C source 4 is passed through the Vex compiler FE, using the-S compiler switch to produce a Vex assembly file. A parser then processes this Vex assembly file, with the Vex statements translated into appropriate data structures for hardening. These data structures are hardened via the Replica/V oting Instruction Injector. The Instruction Injector requires, in addition to the assembly file produced by the compiler, a number of command-line parameters. These parameters include a replication scheme, a voting scheme, and an architecture configuration, which outlines the available hardware resources of the underlying system for which the computation will be compiled. Once the hardening has taken place, via the RaS algorithm, the Instruction Injector finally re-translates the aforementioned data 4 As explained in 3.2.1, the compiler bundled with the Vex toolchain is based on the Lx/ST200 C compiler. Therefore, C is the only language supported. 44 C source VEX compiler V oting Scheme Replica/ V oting Instruction Injector Architecture Parameters ρ-Vex Synthesis Data Resource Configuration VEX Binary VEX Simulation Data Vex Parser VEX Assembly Replication Scheme Figure 3.9: A block diagram outlining the main components of the RaS Vex target infrastructure. structures, and outputs a hardened Vex assembly file. The assembly file can then be compiled by the Vex compiler and run by the resultant compiled simulator. While the infrastructure differs greatly from the Open64-based tool, it implements many of the same RaS techniques. There is one main difference, however: the Vex- based VLIW target applies RaS at the assembly level, after compiler transformations have occurred. This difference has a number of implications, which are explained in Section 3.2.2. The next sections describe the RaS Vex VLIW target infrastructure in greater detail. 3.2.1 The Vex Toolchain The Vex VLIW infrastructure [23] was developed by Hewlett-Packard and STMicro- electronics and is a 32-bit clustered VLIW instruction set architecture (ISA). The Vex toolchain [34] includes an ISO/C89 trace-scheduling compiler, derived from the 45 Lx/ST200 C compiler, and a simulator. The ISA is flexible, allowing a number of archi- tecture parameters to be input as a machine model to the compiler. This machine model specifies the number and type of functional units (ALUs, multipliers, etc.), register file size and number of clusters, and can be changed at compile time to observe the impact different architectural parameters have on execution performance. The availability of a robust, customizable, toolchain make it the ideal choice for work of this nature. In addition, an advanced VLIW softcore processor, based on the Vex ISA has been devel- oped [67]. These VLIW architectures are becoming common for use in embedded applications, due to increased FPGA densities, which makes larger and custom-width configurable VLIW softcore processors more practical [24, 54, 67]. 3.2.2 Assembly-Level Hardening As previously mentioned, the RaS VLIW target infrastructure implements RaS at the assembly level. This has a number of advantages (and disadvantages) when compared to hardening at the compiler level. The most fundamental difference is the trade-off between code optimization and code clarity. Hardening at the assembly level happens after compiler optimizations. This allows for compiler transformations such as loop unrolling, loop tiling, and other optimizations to take place. Such transformations are often engineered at a level of complexity and efficiency that surpasses high-level approaches. It is unlikely that code transformations done at a high level would produce as efficient and optimized code as a native compiler. These optimizations come at the cost of code clarity, however, as the majority of the structure of high-level source code is no longer present at the assembly level. As such, tasks like dependency analysis or determining which [source-level] variable corresponds to a particular register are often difficult or impossible. In addition, the link between an operation at the source level and the assembly instructions that correspond to it can 46 c0 cmplt $r0.5 = $r0.2, $r0.0 c0 sub $r0.6 = $r0.0, $r0.2 ;; ## 0 c0 cmplt $r0.8 = $r0.7, $r0.0 ;; ## 1 c0 sub $r0.9 = $r0.0, $r0.7 c0 cmpeq $b0.3 = $r0.8, $r0.5 ;; ## 2 c0 slct $r0.6 = $b0.0, $r0.6, $r0.2 ;; ## 3 c0 slct $r0.9 = $b0.2, $r0.9, $r0.7 c0 addcg $r0.2, $b0.0 = $r0.6, $r0.6, $b0.1 ;; ## 4 c0 divs $r0.2, $b0.0 = $r0.0, $r0.9, $b0.0 c0 addcg $r0.6, $b0.2 = $r0.2, $r0.2, $b0.1 ... ... c0 addcg $r0.6, $b0.2 = $r0.5, $r0.5, $b0.2 c0 cmpge $r0.2 = $r0.2, $r0.0 ;; ## 37 c0 orc $r0.6 = $r0.6, $r0.0 ;; ## 38 c0 sh1add $r0.6 = $r0.6, $r0.2 ;; ## 39 c0 sub $r0.2 = $r0.0, $r0.6 ;; ## 40 c0 slct $r0.4 = $b0.3, $r0.6, $r0.2 ;; ## 41 Figure 3.10: A sample of the resultant Vex assembly from a simple division expression, a =b=c. be difficult to ascertain. For example, the source code in Figure 3.10 is only a partial listing of the Vex assembly produced by a single line of C, the expression a = b / c. This disconnect between the source and the resultant assembly code makes the task of replicating operations, with respect to the source code, difficult. This process can be assisted by mapping annotations at the compiler level, however, at current, the Vex compiler does not allow this. Another notable difference between the compiler-level approach and assembly-level hardening is the handling of dependencies for scheduling. Instead of instructions being deemed ”ready” based on dependencies in a dataflow graph, at the assembly level, the Vex operations have already been scheduled and ordered by the compiler. Dependencies are trivially satisfied by the ordering of the instructions. 47 3.5 VLIW Encoding 115 1 2 3 4 5 6 7 8 Time Memory address Issue width A0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 Schedule (a) (b) (c) Instruction stream Memory image FIGURE 3.4 HowaVLIWschedulecanbeencodedcompactlyusinghorizontalandvertical nops. This example shows a schedule for a 4-wide VLIW machine over eight cycles. In (a), gray boxes indicate nops in the original schedule. (b) shows the temporally equivalent instruction stream (made up of 0 to 4 operations per instruction) with variable-width instructions. Instruction length might be encoded using start bits, stop bits, or templates in each case. Darkened vertical bars show the end of each instruction. (c) shows the corresponding memory image for this instruction stream. Cycles 6 and 7 are nop cycles, and thus are replaced by a vertical nop instruction, represented by the gray box in (c) between operation EO and operation FO. Instruction boundaries are still portrayed in the memory image, although they are of course encoded within the constituent instructions and operations. Fixed-overhead Encoding Fixed-overheadencodingisafixed-overheadcompressedformatforVLIWinstructions. Itwasusedinfirst-generationVLIWmachines,suchastheMultiflowTRACE.Itinvolves prependingasyllable-sizedbitmasktoallinstructions,tospecifythemappingbetween parts of the instructions and slots in the instruction buffer. The main advantage of this approachisitssimplicity.Themaindisadvantageisthefactthatshorterformatsneedto paytheprice(intermsofmaskbits)forthelongestpossibleVLIWinstruction.Figure3.5 shows an example of mask-based encoding. Distributed Encoding Distributed encoding is a variable-overhead method of encoding VLIW instructions by explicitly inserting a stop bit to delimit the end of the current instruction (or a start bit for the start of the next instruction). Compared against mask encoding, variable-size encoding has the advantage of having a distributed and incremental encoding cost that does not penalize the shortest instructions. In addition, there is no need to explicitly encode the “next PC,” as it is obvious from the presence of the start or stop bit. On the negative side, it does require a more complex decoding scheme and a more complex sequencing scheme (for example, it is necessary to “look ahead” for the next stop bit to findthenextPC).Anequivalentalternativetotheuseofstart/stopbitsusestheconcept Figure 3.11: A 4-wide VLIW instruction stream over 8 cycles [23]. The shaded squares in (a) correspond to NOPS, where no operations are being executed. These ”holes” in the schedule are used by RaS to schedule resiliency operations. This example has 17 Vex operations (b). Lastly, it’s possible that replicating instructions in assembly (post-compiler opti- mizations) may increase the live-ranges of variables, as one instance of an instruction may be executed very ”far away” (in time) from another. Fortunately, VLIW architec- tures traditionally have many replicated functional units and large register files. 3.2.3 RaS for VLIW Assembly RaS operation hardening for VLIW assembly is similar to the compiler-level replication technique presented in Section 1.2.1. As functional unit availability and issue width allow, replica operations 5 are created and inserted in ILP bubbles, replacing NOPs, if possible. If there are noNOPs available – which is often the case for a code that exhibits a large degree of parallelism – then the original, unhardened schedule must be extended to accommodate the replica and voting operations, which increases the computation’s 5 This work adopts the definitions of ”operation” and ”instruction” similar to that of [23]. An instruc- tion is the full VLIW unit of execution. Operations are the individual entities that are executed on the functional units, represented by the individual squares in Figure 3.11 (a). This example has 8 instructions and 17 operations. 48 runtime. Unlike the Open64-based infrastructure, which handled voting via dedicated hardware embedded in the data path, the VLIW target inserts additional exclusive OR (XOR) operations explicitly, in the same manner in which it inserts replica operations. TheseXOR operations perform a bitwise compare on two values, with the system raising an exception on a mismatch. An example of a VLIW instruction stream is shown in Figure 3.11. This example has 8 instructions over 8 cycles, with 17 operations and 15 operation slots wherein no operation is executed (NOPs). Figure 3.12 provides pseudocode for the RaS VLIW assembly-level hardening algo- rithm. The algorithm starts by determining the replication scheme, either pairing or triplication which replicates the original instruction 2 or 3 times, respectively (lines 1 through 5). A list of operations which to replicate is then constructed via a linear scan of the instructions (lines 6 through 13). When an instruction to be replicated is encoun- tered, the number of replicas for that instruction is initialized to0 (line 10). The main hardening loop starts on line 14. For each operation to harden, the algo- rithm starts at that operation’s instruction, and searches for open issue slots and func- tional units with which to schedule the replica operation. The algorithm keeps track of the number and type of functional units that are unused for each instruction. Once a suitable operation slot – that is, an available functional unit in an open issue slot in an instruction – is found, the instruction is replicated (line 19), the corresponding func- tional unit is marked as busy (line 20) and the number of replicas is incremented (line 21). This process continues until all relevant instructions have been hardened. Note that this may extend the schedule past the duration of the original computation that was input to the hardening algorithm. 49 1: if replication scheme ==pairing then 2: replicaScheme 2 3: else if replication scheme ==triplication then 4: replicaScheme 3 5: end if 6: for each instructionIns i do 7: for each operationOp ij 2Ins i do 8: ifOp ij is replicatable operation then 9: RepOps Op ij 10: numberOfReplicas[RepOp ij ] 0 11: end if 12: end for 13: end for 14: for each operationRepOp ij 2RepOps do 15: while numberOfReplicas[RepOp ij ]<replicaScheme do 16: for instructionIns Ins i toIns end do 17: if numberOfOperations(Ins)<issueWidth then 18: iffunctionalUnit findAvailableFunctionUnit(RepOp ij )==true then 19: Ins replicateOperation(RepOp ij ) 20: functionalUnit busy 21: numberOfReplicas[RepOp ij ]++ 22: end if 23: end if 24: end for 25: end while 26: end for 27: for each operationRepOp ij 2RepOps do 28: for instructionIns Ins i toIns end do 29: if numberOfOperations(Ins)<issueWidth then 30: iffunctionalUnit findAvailableFunctionUnit(RepOp ij )==true then 31: Ins replicateOperation(RepOp ij ) 32: functionalUnit busy 33: numberOfReplicas[RepOp ij ]++ 34: end if 35: end if 36: end for 37: end for Figure 3.12: Pseudo-code for the assembly-level hardening algorithm for the lazy voter scheme. 3.3 RaS for VLIW Hardening Variants While hardening computations at the granularity of individual assembly operations, there are two main variations that can affect how the code is constructed. The first is the replication scheme, which impacts how many replica operations are injected in to the original program. The second is the voting scheme, which decides where the voting operations are to be placed, relative to the replicas operations. As noted in Fig- ure 3.12, the replication scheme can be either pairing or triplication. In addition, the RaS VLIW infrastructure supports two different voting schemes: lazy voter and greedy 50 voter. The next sections outline the differences between these schemes and how they can potentially impact the performance of the resulting hardened code. 3.3.1 Replication Schemes As mentioned above, the two choices for replication scheme are pairing and triplication. As their names imply, pairing replicates an operation twice, whereas triplication repli- cates an operation three times. In addition, regardless of the voting scheme chosen – which impacts where the voting operations are placed, not how many of them there are – triplication requires three voting instructions, whereas pairing only requires one. Figure 3.13 displays a sample VLIW instruction stream hardened via RaS. The shaded boxes indicate NOPs, which are used for replica and voting operations. The triplication example, shown in Figure 3.13(a), requires 5 voting and replica operations per original operation to be hardened. Conversely, the pairing example given in Fig- ure 3.13(b) requires the addition of only2 operations for hardening. Intuitively, codes that do not exhibit a large amount of parallelism – thereby hav- ing free resources during most of the computation’s execution – would be amenable to triplication, without much impact on performance. Similarly, machines with large instruction issue-widths and many functional units would also be a good match for trip- lication. Conversely, smaller machines, with less resources, and highly parallel codes could potentially see a significant increase in nominal latency with the extra operations required for triplication. In these instances, pairing could potentially be a better choice. These trade-offs are explored in Section 4.2.2. In addition to affecting the overall number of executed operations, the choice between pairing and triplication also impacts a computation’s expected performance in the face of errors. This impact is detailed in Section 4.2.3. 51 3.5 VLIW Encoding 115 1 2 3 4 5 6 7 8 Time Memory address Issue width A0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 Schedule (a) (b) (c) Instruction stream Memory image FIGURE 3.4 HowaVLIWschedulecanbeencodedcompactlyusinghorizontalandvertical nops. This example shows a schedule for a 4-wide VLIW machine over eight cycles. In (a), gray boxes indicate nops in the original schedule. (b) shows the temporally equivalent instruction stream (made up of 0 to 4 operations per instruction) with variable-width instructions. Instruction length might be encoded using start bits, stop bits, or templates in each case. Darkened vertical bars show the end of each instruction. (c) shows the corresponding memory image for this instruction stream. Cycles 6 and 7 are nop cycles, and thus are replaced by a vertical nop instruction, represented by the gray box in (c) between operation EO and operation FO. Instruction boundaries are still portrayed in the memory image, although they are of course encoded within the constituent instructions and operations. Fixed-overhead Encoding Fixed-overheadencodingisafixed-overheadcompressedformatforVLIWinstructions. Itwasusedinfirst-generationVLIWmachines,suchastheMultiflowTRACE.Itinvolves prependingasyllable-sizedbitmasktoallinstructions,tospecifythemappingbetween parts of the instructions and slots in the instruction buffer. The main advantage of this approachisitssimplicity.Themaindisadvantageisthefactthatshorterformatsneedto paytheprice(intermsofmaskbits)forthelongestpossibleVLIWinstruction.Figure3.5 shows an example of mask-based encoding. Distributed Encoding Distributed encoding is a variable-overhead method of encoding VLIW instructions by explicitly inserting a stop bit to delimit the end of the current instruction (or a start bit for the start of the next instruction). Compared against mask encoding, variable-size encoding has the advantage of having a distributed and incremental encoding cost that does not penalize the shortest instructions. In addition, there is no need to explicitly encode the “next PC,” as it is obvious from the presence of the start or stop bit. On the negative side, it does require a more complex decoding scheme and a more complex sequencing scheme (for example, it is necessary to “look ahead” for the next stop bit to findthenextPC).Anequivalentalternativetotheuseofstart/stopbitsusestheconcept Add 2 $ Add 3 $ Add 1 $ Add$ XOR 1*2 $ Add$ XOR 1*3 $ Add$ XOR 2*3 $ Mult 1 $ Mult 2 $ Mult$ XOR 1*2 $ Mult 3 $ Mult$ XOR 1*3 $ Mult$ XOR 2*3 $ Sub 1 $ Sub 2 % Sub$ XOR 1*2 $ Sub 3 % Sub$ XOR 1*3 $ Sub$ XOR 2*3 $ 3.5 VLIW Encoding 115 1 2 3 4 5 6 7 8 Time Memory address Issue width A0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 Schedule (a) (b) (c) Instruction stream Memory image FIGURE 3.4 HowaVLIWschedulecanbeencodedcompactlyusinghorizontalandvertical nops. This example shows a schedule for a 4-wide VLIW machine over eight cycles. In (a), gray boxes indicate nops in the original schedule. (b) shows the temporally equivalent instruction stream (made up of 0 to 4 operations per instruction) with variable-width instructions. Instruction length might be encoded using start bits, stop bits, or templates in each case. Darkened vertical bars show the end of each instruction. (c) shows the corresponding memory image for this instruction stream. Cycles 6 and 7 are nop cycles, and thus are replaced by a vertical nop instruction, represented by the gray box in (c) between operation EO and operation FO. Instruction boundaries are still portrayed in the memory image, although they are of course encoded within the constituent instructions and operations. Fixed-overhead Encoding Fixed-overheadencodingisafixed-overheadcompressedformatforVLIWinstructions. Itwasusedinfirst-generationVLIWmachines,suchastheMultiflowTRACE.Itinvolves prependingasyllable-sizedbitmasktoallinstructions,tospecifythemappingbetween parts of the instructions and slots in the instruction buffer. The main advantage of this approachisitssimplicity.Themaindisadvantageisthefactthatshorterformatsneedto paytheprice(intermsofmaskbits)forthelongestpossibleVLIWinstruction.Figure3.5 shows an example of mask-based encoding. Distributed Encoding Distributed encoding is a variable-overhead method of encoding VLIW instructions by explicitly inserting a stop bit to delimit the end of the current instruction (or a start bit for the start of the next instruction). Compared against mask encoding, variable-size encoding has the advantage of having a distributed and incremental encoding cost that does not penalize the shortest instructions. In addition, there is no need to explicitly encode the “next PC,” as it is obvious from the presence of the start or stop bit. On the negative side, it does require a more complex decoding scheme and a more complex sequencing scheme (for example, it is necessary to “look ahead” for the next stop bit to findthenextPC).Anequivalentalternativetotheuseofstart/stopbitsusestheconcept Add 2 $ Add 3 $ Add 1 $ Add$ XOR 1*2 $ Add$ XOR 1*3 $ Add$ XOR 2*3 $ Mult 1 $ Mult 2 $ Mult$ XOR 1*2 $ Mult 3 $ Mult$ XOR 1*3 $ Mult$ XOR 2*3 $ Sub 1 $ Sub 2 % Sub$ XOR 1*2 $ Sub 3 % Sub$ XOR 1*3 $ Sub$ XOR 2*3 $ Triplica.on%with%lazy%vo.ng% Triplica.on%with%greedy%vo.ng% (a) RaS triplication example 3.5 VLIW Encoding 115 1 2 3 4 5 6 7 8 Time Memory address Issue width A0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 Schedule (a) (b) (c) Instruction stream Memory image FIGURE 3.4 HowaVLIWschedulecanbeencodedcompactlyusinghorizontalandvertical nops. This example shows a schedule for a 4-wide VLIW machine over eight cycles. In (a), gray boxes indicate nops in the original schedule. (b) shows the temporally equivalent instruction stream (made up of 0 to 4 operations per instruction) with variable-width instructions. Instruction length might be encoded using start bits, stop bits, or templates in each case. Darkened vertical bars show the end of each instruction. (c) shows the corresponding memory image for this instruction stream. Cycles 6 and 7 are nop cycles, and thus are replaced by a vertical nop instruction, represented by the gray box in (c) between operation EO and operation FO. Instruction boundaries are still portrayed in the memory image, although they are of course encoded within the constituent instructions and operations. Fixed-overhead Encoding Fixed-overheadencodingisafixed-overheadcompressedformatforVLIWinstructions. Itwasusedinfirst-generationVLIWmachines,suchastheMultiflowTRACE.Itinvolves prependingasyllable-sizedbitmasktoallinstructions,tospecifythemappingbetween parts of the instructions and slots in the instruction buffer. The main advantage of this approachisitssimplicity.Themaindisadvantageisthefactthatshorterformatsneedto paytheprice(intermsofmaskbits)forthelongestpossibleVLIWinstruction.Figure3.5 shows an example of mask-based encoding. Distributed Encoding Distributed encoding is a variable-overhead method of encoding VLIW instructions by explicitly inserting a stop bit to delimit the end of the current instruction (or a start bit for the start of the next instruction). Compared against mask encoding, variable-size encoding has the advantage of having a distributed and incremental encoding cost that does not penalize the shortest instructions. In addition, there is no need to explicitly encode the “next PC,” as it is obvious from the presence of the start or stop bit. On the negative side, it does require a more complex decoding scheme and a more complex sequencing scheme (for example, it is necessary to “look ahead” for the next stop bit to findthenextPC).Anequivalentalternativetotheuseofstart/stopbitsusestheconcept Add 2 Add 1 Add XOR 1-‐2 Mult 1 Mult 2 Mult XOR 1-‐2 Sub 1 Sub 2 Sub XOR 1-‐2 3.5 VLIW Encoding 115 1 2 3 4 5 6 7 8 Time Memory address Issue width A0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 Schedule (a) (b) (c) Instruction stream Memory image FIGURE 3.4 HowaVLIWschedulecanbeencodedcompactlyusinghorizontalandvertical nops. This example shows a schedule for a 4-wide VLIW machine over eight cycles. In (a), gray boxes indicate nops in the original schedule. (b) shows the temporally equivalent instruction stream (made up of 0 to 4 operations per instruction) with variable-width instructions. Instruction length might be encoded using start bits, stop bits, or templates in each case. Darkened vertical bars show the end of each instruction. (c) shows the corresponding memory image for this instruction stream. Cycles 6 and 7 are nop cycles, and thus are replaced by a vertical nop instruction, represented by the gray box in (c) between operation EO and operation FO. Instruction boundaries are still portrayed in the memory image, although they are of course encoded within the constituent instructions and operations. Fixed-overhead Encoding Fixed-overheadencodingisafixed-overheadcompressedformatforVLIWinstructions. Itwasusedinfirst-generationVLIWmachines,suchastheMultiflowTRACE.Itinvolves prependingasyllable-sizedbitmasktoallinstructions,tospecifythemappingbetween parts of the instructions and slots in the instruction buffer. The main advantage of this approachisitssimplicity.Themaindisadvantageisthefactthatshorterformatsneedto paytheprice(intermsofmaskbits)forthelongestpossibleVLIWinstruction.Figure3.5 shows an example of mask-based encoding. Distributed Encoding Distributed encoding is a variable-overhead method of encoding VLIW instructions by explicitly inserting a stop bit to delimit the end of the current instruction (or a start bit for the start of the next instruction). Compared against mask encoding, variable-size encoding has the advantage of having a distributed and incremental encoding cost that does not penalize the shortest instructions. In addition, there is no need to explicitly encode the “next PC,” as it is obvious from the presence of the start or stop bit. On the negative side, it does require a more complex decoding scheme and a more complex sequencing scheme (for example, it is necessary to “look ahead” for the next stop bit to findthenextPC).Anequivalentalternativetotheuseofstart/stopbitsusestheconcept Add 2 Add 1 Add XOR 1-‐2 Mult 1 Mult 2 Mult XOR 1-‐2 Sub 1 Sub 2 Sub XOR 1-‐2 Div XOR 1-‐2 Pairing with lazy vo4ng Pairing with greedy vo4ng Div 1 Div 2 Div 1 Div 2 Div XOR 1-‐2 (b) RaS pairing example Figure 3.13: Example VLIW instruction stream hardened via RaS, using the triplication (a) and pairing (b) schemes. The shaded boxes indicate operation slots that wereNOPs, and were replaced with replica or voting operations. 3.3.2 Voting Schemes The current VLIW target infrastructure supports two voting schemes: lazy and greedy. These schemes dictate the order in which the XOR instructions used for majority vot- ing are inserted in the instruction stream, relative to the replica operations used for resiliency. 52 In lazy voting, voting operations are inserted only after all ready replica operations. This is shown in Figure 3.13(a). In the second operation of instruction 2 there is an available issue slot 6 . There is an outstanding replica operation – the third execution of the add operation – so the lazy voting scheme cannot insert a vote operation in that issue slot. Only when all ready replicas have been inserted are voting operations inserted, which happens in the fourth operation of instruction2 in this example. Conversely, greedy voting, as its name implies, inserts the voting operations as soon as possible. Once a replica operation is added to the instruction stream – as the XOR operations used for voting are pairwise – a voting operation is generated and inserted in the next available issue slot. This is shown in Figure 3.13(b). After the replicaadd operation is added in the third slot of instruction 1, a voting operation is generated. ThisXOR operation is inserted in the second slot of instruction 2. Note that this voting operation is inserted before the replica operation of thediv operation in instruction 1, even though the replica is ready. While producing the same number of overall operations – and thereby the same nominal performance – these two voting schemes can potentially impact a number of aspects of the computation’s execution. In most cases, greedy voting will be the supe- rior scheme. This is because the greedy scheme seeks to reduce the distance between a replica and its voting operations. Doing so is beneficial in two ways. First, it can relieve register pressure, as the live ranges of the registers used to hold replica values are poten- tially reduced. Second, because the distance between a replica and its voting operations is reduced, the potential impact on latency of an error is reduced (see Section 2.5). There is one instance where lazy voting may be beneficial. Lazy voting may be advantageous if the overall number of voting operations can be reduced. For example, if the computation funnels to a few (or single) result(s), a programmer might decide to 6 In this context, ”instruction 2” means the instruction issued in cycle 2. 53 1 Add 1 2 Add 2 3 4 5 6 7 8 XOR 1-‐2 Issue width Time High impact error 1 Add 1 2 Add 2 3 XOR 1-‐2 4 5 6 7 8 Issue width Low impact error Figure 3.14: The effect ofXOR placement on execution latency. The ”high impact” error on the left is not detected for 6 cycles, whereas the ”low impact” error on the right is detected within1 cycle. forgo replicating intermediate operations. In this instance, faster execution is possible because the computation can do more ”useful” work by executing fewer resiliency and voting operations. However, this greatly increases the impact of an error, as it would not be potentially detected until the end of the computation, regardless of where in the execution stream the error occurred. 3.3.3 Dynamic Scheduling As detailed in Section 2.5, the performance impact of an error is dictated by the distance – in terms of execution cycles – between where the error occurs and when it can be detected. This relationship is depicted in Figure 3.14. The execution stream on the left delays theXOR operation until cycle 8, which causes 6 cycles to elapse between when the error occurs and when it is detected. Conversely, the execution stream on the right 54 has theXOR immediately following the execution of the last replica operation, allowing the error to be detected on the next cycle. In an environment with high error rates, greedy voting may not be enough to pro- vide acceptable performance. In these cases, a third voting scheme, forced voting, is possible. In forced voting, the XOR operations are inserted directly after the replicas on which they are voting, regardless of whether there are any issue slots available. The voting operations ”force” their way into the computation, potentially shifting the rest of the operations down to later instructions. This can be thought of as software TMR, as the majority voter operations are inserted in the very next cycle after the final replica operation is executed. In this way, errors could potentially be corrected, as the voting takes place before the output of the replicated operation is consumed. This comes at a cost, however, as the reason theXOR operations were not inserted in to that position in the first place is because the computation exhibits enough ILP to be doing ”useful” work at that point in the computation. When the voting operations are forced in to the instruction stream, that useful work is then delayed, which can lead to longer nominal execution times. If architecture and compiler support is provided, a dynamic schedule is possible, wherein the voting scheme is changed depending on the prevailing error rate. In this scenario, as errors increase, theXOR operations would gravitate ”upwards” toward the replica operations on which they are voting. This process is effectively moving from the lazy voting scheme to the forced voting scheme. If the error rate were to decrease, the system would relax the constraints on the voting operations, allowing them to fall back to their original positions, which would improve nominal [error-free] performance. However, because of the complexity in reordering operations while still maintaing dependencies, forced voting and this dynamic scheduling was not implemented in the current infrastructure and is an avenue for future work. 55 3.4 Chapter Summary This chapter detailed the software toolchains that were built to implement the RaS con- cepts. Two distinct infrastructures were presented: an Open64-based infrastructure tar- geting FPGAs, and a Vex-based infrastructure targeting VLIW softcore architectures. The workflows of these toolchains was discussed, as well as the internal analysis pro- cesses that enable RaS. Key aspects include the FPGA area consumption model for the Open64-based infrastructure, and a description of the hardening variants for the Vex- based infrastructure. These toolchains are used to implement RaS on a test suite of scientific code kernels, the results of which are presented in the next chapter. 56 Chapter 4 Experimental Results This chapter presents experimental results produced from the application of Resiliency- aware Scheduling concepts, detailed in Chapter 2 and carried out via the two distinct infrastructures discussed in Chapter 3. Section 4.1 presents results for the Open64-based FPGA target. FPGA design results from a test suite of seven realistic computational kernels are given, showing significant area reductions when compared to a traditional coarse-grained resiliency approach using Triple Modular Redundancy (TMR). Results for the Vex-based [reconfigurable] VLIW target are shown in Section 4.2, where a case study, using a kernel code from the test suite mentioned above, is pre- sented. Overall, and for this set of experiments, the RaS approach more efficiently uses resources and exhibits higher scalability than software-level hardening or TMR, and can achieve lower latencies for a fixed area consumption target than traditional hardening approaches. 4.1 Open64 Infrastructure: FPGA Target This section details the experimental results for the RaS Open64-based FPGA target infrastructure. The next subsection discusses the experimental methodology. Sec- tion 4.1.2 details the test suite of code kernels used in the experiments presented in this chapter. The six designs that were synthesized for each code in the test suite are 57 discussed in Section 4.1.3, as well as the experimental results and accompanying dis- cussion. An explanation of the design space exploration methodology is given in Sec- tion 4.1.4. Finally, an expected latency analysis, simulating the performance of RaS designs in the face of errors is given in Section 4.1.5. 4.1.1 Experimental Methodology There are a number of factors that determine the relative merits of a particular resilient design. In order to focus the contributions of this work, three main metrics of perfor- mance were chosen: latency (in terms of execution cycles), operational coverage, and device area consumption 1 . Furthermore, to more concretely quantify the relative benefits of RaS, performance and coverage are fixed at ”optimal” values. This means that only designs with 100% operational coverage and performance equal to the computation’s critical path are con- sidered. This experimental methodology was chosen for two reasons. First, these ”opti- mal” designs, with critical path performance and maximal operational coverage, are likely the designs that would be the most interesting to potential device engineers. Sec- ond, this design methodology was chosen in order to restrict the potential design space and more specifically define the notional of ”beneficial”, as mentioned above. The num- ber of potential design configurations for even a relatively small source code can number in the tens of thousands. There are a number of other potential design points that may be of interest, in specific cases. For example, for a fixed set of resources, it might be useful to know the percentage of operational coverage provided via RaS without affecting per- formance, or the fastest potential design, regardless of operational coverage. However, 1 Other design considerations, such as power consumption, are relevant but are beyond the scope of this work. 58 the designs selected by the experimental methodology above are likely to be the most relevant to the vast majority of design engineers. Additionally, while the mere discovery of other such design points may be worth- while, they are of little comparative value. Comparing a design that is, for example, very fast, but offers little operational coverage to a design that is very large and has full coverage is ambiguous at best. The notion that the fast design is ”better” than the large design is not particularly well defined. Fixing performance to that of the critical path, and insisting on full operational coverage removes this ambiguity. By focusing on device area consumption, the design points from the chosen experimental methodol- ogy can clearly illustrate the benefit – or detriment – of one approach over another. Of course, the RaS infrastructure can easily generate designs with other goals in mind, but they are not presented here. To quantify area consumption on a reconfigurable target such as an FPGA, slice Look Up Tables (LUT) was chosen as the metric of choice. Modern FPGA design tools provide totals for LUT of a given design, which includes the data path, hardware blocks, routing and controller. These metrics are a reasonable estimate of the amount of reconfigurable fabric a particular design consumes, and while the LUT metric may not be meaningful across vendors – or even across differing FPGA families – it is a succinct and reasonable metric for evaluating the area consumption between designs on a fixed FPGA platform. 4.1.2 Test Suite In order to evaluate the RaS Open64 infrastructure against TMR, a test suite was devel- oped, with seven code variants across four computational kernels: MAC, UMT2K, GMPS and CGM. The four codes kernels chosen for the test suite represent a significant subset of computations used in scientific computing today. They are either taken directly 59 from benchmarking packages used as a representative sample of scientific computation, or are themselves mathematical functions used in more complex computations. Scientific code kernels were chosen for two main reasons. Primarily, these codes are often the sort that would require resiliency in the first place. While some mathematical codes can tolerate errors – an image processing kernel getting an ”off by one” error on the RGB value of a single pixel might not be particularly disastrous – many scientific computing codes converge to a single ”important” value. An error in such values could result a significant waste of computing power, energy and financial resources. Secondly, codes with significant control flow or that require user input are simply not a good match for RaS. In cases where hardening for these sorts of codes is required, a different approach, such as checkpoint-and-restart or hardening by process would likely be more beneficial. Code Critical Additions Multiplications Array Divisions Square Total kernel path operator root operations (cycles) MAC (2) 6 4 2 4 10 UMT 52 23 45 5 73 GMPS 96 6 6 3 2 17 CGM (1) 5 6 1 5 12 CGM (2) 5 9 2 10 21 CGM (4) 5 21 4 20 45 CGM (8) 5 45 8 40 93 Table 4.1: Critical path latency and operation breakdown for the computational kernels under test. In short, the chosen test suite represents a wide spectrum of potential use-cases for RaS. Table 4.1 gives a breakdown of operations, as well as the critical path length for the seven code variants under test. The remainder of this section explains the properties of these codes in more detail. 60 MAC (2) The MAC (2) code was described as a motivating example in Section 2.4. It contains a simple multiply-accumulate kernel, which has been unrolled twice, to expose more operations in the basic block of the loop nest. It is a common kernel used in many matrix manipulation routines. UMT The UMT kernel is directly derived from the UMT1.2/UMT2K benchmark suite [7]. This kernel is the core of a 3D, deterministic, multigroup, photon transport computation for unstructured meshes. As shown in Table 4.1, the UMT kernel has 73 operations, with a critical path of 52 clock cycles. The large number of operations relative to the critical path is indicative of a high level of ILP and creates a relatively dense DFG (that is too wide to display accurately at normal page widths.) The UMT kernel computation is dominated by five long-latency divide operations, two of which are on the critical path. It also has45 multiply operations, which are exe- cuted on relatively large hardware blocks, as shown in Table 3.1. These two properties – the long latency of the divide operations and the many multiplication operations that require ”expensive” hardware – make it an interesting code choice for exploration. GMPS The GMPS code kernel was taken from a routine that implements a matrix algorithm for converting an indefinite system to a positive-definite system [25]. The GMPS kernel is notable in that it has two different types of long-latency operations, containing both square root and divide operations. These long-latency operations create a long critical path of96 cycles, as shown in Table 4.1. In addition, the vast majority of the operations in the kernel are on the critical path. This can be seen by inspection of the computation’s 61 Figure 4.1: The ”long and thin” dataflow graph for the GMPS code kernel. The shaded nodes are on the critical path. DFG, given in Figure 4.1, where the shaded nodes are on the critical path. This ”long and thin” DFG is in contrast to the dense dataflow graph of UMT2K, or to a code that has been unrolled, such as the CGM variants below. CGM CGM, or conjugate-gradient method is the last code of the test suite. This kernel is derived from the CG code in the NAS Parallel Benchmark suite and is used to solve an unstructured sparse linear system by the conjugate gradient method [10]. 62 (0) source / sink (1) OPC_I4I4LDID: XJ {2} 0 (2) OPC_I4INTCONST: -1 {4} 0 (3) OPC_U4LDA: A {3} 0 (4) OPC_I4INTCONST: -1 {0} 0 (5) OPC_I4INTCONST: -1 {0} 0 (6) OPC_U4LDA: ROWIDX {0} 0 (7) OPC_U4LDA: Y {0} 0 (8) OPC_I4INTCONST: -1 {2} 0 (9) OPC_I4INTCONST: -1 {4} 0 (10) OPC_I4I4LDID: K {0} 0 (23) OPC_U4LDA: ROWIDX {3} 0 (27) OPC_U4LDA: Y {1} 0 (30) source / sink 0 (20) OPC_I4MPY: {2} 0 (17) OPC_I4ADD: {4} 0 (18) OPC_U4ARRAY: {3} 0 (14) OPC_I4ADD: {0} 0 (11) OPC_I4ADD: {0} 0 (12) OPC_U4ARRAY: {0} 0 (15) OPC_U4ARRAY: {0} 0 (26) OPC_I4ADD: {2} 0 (22) OPC_I4ADD: {4} 0 0 0 0 0 (24) OPC_U4ARRAY: {3} 0 (28) OPC_U4ARRAY: {1} 0 (21) OPC_I4ADD: {0} 2 1 (19) OPC_I4I4ILOAD: {2} 1 1 1 (13) OPC_I4I4ILOAD: {0} 1 (16) OPC_I4I4ILOAD: {0} 1 1 1 0 0 (29) OPC_I4ISTORE: {0} 1 0 0 (25) OPC_I4I4ILOAD: {2} 1 0 1 Figure 4.2: The CGM (1) kernel code dataflow graph. The shaded nodes are on the critical path. The core of the CGM kernel is an indirect reference to the y array, represented in FORTRAN as y(rowidx(k)) =y(rowidx(k))+a(k)xj (4.1) As shown in Table 4.1, the CGM kernel has a very short critical path. As such, it was a prime candidate for loop unrolling [8]. Loop unrolling increases the amount of operations executed in a basic block, creating more instruction-level parallelism (ILP). Unrolling does not induce any new dependencies – other than on the loop increment variable – and does not increase the critical path of the computation 2 . This creates dataflow graphs that are ”short and wide”. Figure 4.3 shows the dataflow graph for the 2 CGM (2) indicates the original CGM computation has been unrolled twice, CGM (4) is unrolled four times, and so on. 63 Figure 4.3: The ”short and wide” dataflow graph of CGM (4). Loop unrolling causes four equivalent critical paths. CGM (4) variant, and is in stark contrast to the ”tall and thin” DFG shown for GMPS in Figure 4.1. Note that the DFG for the CGM (8) variant is approximately twice as wide – without being any taller – and is far too large to accurately display at typical page widths. Despite the relatively simple nature of the CGM core, the four CGM variants provide an interesting look at how RaS performs as ILP increases, while still maintaining a fixed and small critical path. 4.1.3 FPGA Design Results This section details the results of using the Open64-based RaS infrastructure on the test suite described above. For each code in the suite, six designs were generated and synthesized, each representing a significant data point in the potential design-space of the computation when exploring the trade-offs between performance, area consumption and operational resiliency coverage. The RaS approach is represented by the RaS-OPT 64 design, and the competing TMR approach is represented by the CP-TMR design. The six design variants are explained in more depth below. Design variants In addition to the RaS-OPT and CP-TMR designs, four other design points were consid- ered. The six design variants derived for each computation in the test suite are: 1. Baseline: This design includes a single hardware functional unit for each class of operation. There is no attempt to provide operational resiliency coverage. The performance is bound by resource contention and need not be optimal. This design is merely used as a reference point, to better observe the relative area penalties for the other design variants. 2. Baseline-TMR: The same as the Baseline design, but using hardware TMR opera- tors. This design will have100% operational coverage, and has the same execution performance as the Baseline design in terms of clock cycles. 3. Baseline-RaS: This design includes a single type of operator for each class of operation, like the Baseline design. However, 100% operational coverage is pro- vided via RaS. The performance need not be optimal. 4. CP-m: The minimal resource configuration required to alleviate all resource con- tention, thereby achieving performance consistent with the critical path of the computation. There is no operational coverage for this design. 5. CP-TMR: Uses the same schedule and performance as the CP-m design, but uses hardware TMR operators, therefore providing 100% operational coverage. This design is considered the reference point when comparing RaS against TMR designs. 65 0 2000 4000 6000 8000 10000 12000 14000 MAC (2x) GMPS UMT CGM (1x) CGM (2x) CGM (4X) CGM (8x) Area Consump-on (LUT) Code Kernel CP-‐TMR RaS OPT Figure 4.4: Area consumption for RaS-OPT and CP-TMR design variants for the codes in the test suite. 6. RaS-OPT, the RaS-determined ”optimal” design. This design is the smallest design feasible with critical path performance and 100% coverage, provided via RaS. This is the main design under test. In order to quantify the area consumption of each hardware configuration, the designs were simulated and verified, targeting a Virtex-5 (LX50-3FF324) FPGA device. The correctness of the designs was verified using ModelSim TM , and the area of the cor- responding implementation was quantified using the Xilinx ISE 13.3 synthesis tool, with area reduction as the design goal setting. A summary of all the designs derived in the test suite is provided in Table 4.2. The comparison between the CP-TMR and RaS-OPT designs – the main designs under test – is shown in Figure 4.4. 66 Area MAC UMT GMPS CGM CGM CGM CGM (LUT) (2) (1) (2) (4) (8) Baseline 498 1;451 1;500 548 639 1;137 1;475 Baseline 1;303 2;273 3;938 1;336 1;444 1;959 2;278 (TMR) Baseline 679 3;059 1;896 762 1;019 1;970 3;141 (RaS) CP-m 847 4;516 1;500 509 1;008 2;266 4;102 CP-TMR 2;451 12;142 3;938 1;369 2;641 5;789 11;452 RaS-OPT 11;65 7;322 3;419 1;616 2;023 4;321 11;706 RaS Improvement 52% 40% 13% 18% 24% 25% 2% over CP-TMR Table 4.2: Summary of design implementation results for the Open64-based FPGA tar- get. The following subsections detail the results of applying the experimental methodol- ogy presented in Section 4.1.1 to the test suite given above. UMT Figure 4.5 displays the design results for the UMT kernel code. This code exhibited the second largest area savings for an RaS-hardened design. The RaS-OPT design occu- pies 40% fewer LUT than the CP-TMR design, 7;322 to 12;142. This savings can be attributed to a two-thirds reduction in the overall number of large multiplier blocks in the RaS-OPT design, as shown in Table 4.3. The UMT kernel code is made up of a DFG such that many of these ”costly” multiplications can execute in parallel, in the absence of resource contention. In fact, of the 45 multiplication operations in the UMT kernel code,10 of them can be executed in parallel, if resources allow. This fact, which is difficult to determine from the high-level source code, is the key insight explaining the relatively large CP-TMR design. As the design methodology requires critical path performance, enough resources must be allocated for a design to completely alleviate resource contention. In order to execute all10 of the multiplication 67 1,451 3,059 2,273 4,516 12,142 7,322 3644 3644 4,490 6143 13,665 10846 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Resource Configura-on UMT LUT (Actual) LUT (Mode) Figure 4.5: Device area consumption for the UMT kernel code. operations concurrently, the TMR-hardened design requires 10 individual TMR units, each of which includes three large multiplier blocks. The flexibility of hTMR, however, allows for a different approach. The RaS-OPT design allocates the 10 multipliers, allowing the execution to continue at its ”widest” point without resource contention. Resiliency is then provided later on during execution, when the computation’s IR allows for re-execution on the same units. If the computation did not exhibit this level of IR – that is, if there were no spare cycles in which to execute replica instructions – the RaS infrastructure would allocate more multiplier resources. Under this scenario, the RaS-OPT and CP-TMR designs would degenerate to the same configuration. Figure 4.5 also shows the area modeling estimates for the UMT kernel code. In this instance, the results from the modeling are not very accurate, with overestimates ranging from a low of 12:54% for the CP-TMR design to a high of 151:14% for the Baseline 68 Design LUT Adders Multipliers Dividers Base 1,451 1 1 1 Base-RaS 3,059 1 1 1 Base-TMR* 2,273 3 3 3 CP-m 4,516 3 10 3 CP-TMR* 12,142 9 30 9 RaS-OPT 7,322 4 10 8 *Nominal functional units. For actual number of TMR units, divide by 3. Table 4.3: Design configurations and area consumption for the UMT kernel code. design. On average, the model overestimates the LUT consumption by 60:75%. This discrepancy is attributed to the model not being able to predict optimizations that the FPGA design tools can enact on such a large design. GMPS Table 4.4 summarizes the design configuration and area consumption results for the GMPS kernel code. As mentioned in Section 4.1.2, the GMPS kernel contains long- latency operations, such as divides and square roots. As such, the computation exhibits a large amount of IR; while these slow operations are taking place, other operations can execute (or re-execute, in the case of RaS). This property leads to minimal hardware requirements. In fact, the Baseline configuration is all that is needed to achieve critical path performance. The operations are laid out in such a way that there is no resource contention, even with the minimal configuration. This property also leads to an interest- ing dichotomy: while the hardened designs (both TMR and RaS) for GMPS are average in terms of area consumption, the Baseline configuration is the largest, relative to the other codes in the test suite. This is because the layout of the DFG allows for signif- icant operator reuse, and the many inputs to each of the single operator types must be multiplexed 3 . 3 This process is discussed in detail in Section 3.1.4. 69 1,500 1,896 3,938 1,500 3,938 3,419 1806 1806 4,315 1806 4,315 4074 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Resource Configura-on GMPS LUT (Actual) LUT (Model) Figure 4.6: Device area consumption for the GMPS kernel code. For this kernel code, the application of TMR incurs a 262% area penalty, which is less than the expected 3 penalty, but still substantial. Conversely, the Baseline- RaS design only requires 26% LUT over a non-hardened minimal configuration. This savings comes at a cost, however, as the Baseline-TMR hardened design reaches critical- path performance (96 cycles), whereas the RaS variant requires162. Comparing the main designs under test, the RaS-OPT design uses 13% fewer (or 3;419 to 3;938) LUT than the CP-TMR configuration, while maintaing equal critical- path performance. This is due to a reduction in the number of adders and square root units, as shown in Table 4.4. It is reasonable to assume that codes with a similar ”shape” would benefit from a RaS-hardened design. The long-latency operations of the GMPS kernel allow for sig- nificant reuse of the other operators, which creates an area savings by eliminating certain functional units. In this instance, the hardware blocks in question — adders and square root units – are not very large, when compared to the overall size of the FPGA. However, 70 as these blocks increase in size – due to a transition to floating point operations, or an increase in bit-width, for example – the benefit of RaS will only increase. Design LUT Adders Multipliers Dividers Square Root Operators Base 1,500 1 1 1 1 Base-RaS 1,896 1 1 1 1 Base-TMR* 3,938 3 3 3 3 CP-m 1,500 1 1 1 1 CP-TMR* 3,938 3 3 3 3 RaS-OPT 3,419 1 3 3 2 *Nominal functional units. For actual number of TMR units, divide by 3. Table 4.4: Design configurations and area consumption for the GMPS kernel code. On average, the RaS infrastructure overestimates the area consumption of the 6 designs of GMPS by 13:98%, when compared to the designs produced on the Virtex-5 FPGA (see Figure 4.6). The GMPS kernel produced the most accurate area estimates, when comparing the average of the6 designs of each code in the test suite. The design with the most accurate area estimate is the Baseline-RaS design, which the RaS infrastructure underestimated by4:75%. The least accurate is the CP-m design – which is equivalent to the Baseline design in this instance – which was overestimated by approximately20%. It is likely that the smaller overall size of the designs produced by the GMPS kernel code allow for more accurate area consumption prediction. MAC The MAC (2) is the smallest of the seven codes in the test suite. As shown in Table 4.5, the Baseline configuration consumes only498 LUT. This is not surprising, as the MAC (2) code kernel is essentially only two lines of source code, as shown in Figure 3.2. The small size of the code very likely hinders opportunities for optimization by the design tool. This is illustrated in the area penalties for TMR configurations. For exam- ple, the Baseline-TMR design consumes 1;303 LUT, which is 2:62 the total for the 71 498 679 1303 847 2451 1165 612 612 1380 936 2472 1708 0 500 1000 1500 2000 2500 3000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Configura-on MAC (2x) LUT (Actual) LUT (Model) Figure 4.7: Device area consumption for the MAC (2) kernel code. Baseline design, and the CP-TMR configuration requires 2;451 LUT, a 2:89 penalty over the unhardened CP-m design. As shown in Table 4.2, the RaS-OPT design of the MAC (2) kernel code exhibits the best relative area savings over the competing CP-TMR configuration. The traditional TMR approach requires 2;451 LUT, while the RaS-hardened design uses 1;165, for a 52% improvement. Figure 4.7 details the area consumptions for the six design variants under test. The area modeling results for the MAC (2) kernel code are also shown in Fig- ure 4.7. The modeling results are relatively accurate in this instance. The CP-TMR design was the most accurately modeled design of the entire test suite, with the model overestimating the area consumption by less than1%. The largest discrepancy between the modeled and actual areas is an overestimation of46%, for the RaS-OPT design. On average, the MAC (2) overestimated the area consumption of the six designs by17%. 72 Design LUT Adders Array Multipliers Operators Base 498 1 1 1 Base-RaS 679 1 1 1 Base-TMR* 1,303 3 3 3 CP-m 847 2 2 2 CP-TMR* 2,451 6 6 6 RaS-OPT 1,165 3 3 3 *Nominal functional units. For actual number of TMR units, divide by 3. Table 4.5: Design configurations and area consumption for the MAC (2) kernel code. For small computations, such as the MAC (2), that require [relatively] large hard- ware blocks, RaS approach will likely be beneficial, due to the flexibility of hTMR. For cases like these, the decoupled nature of hTMR allows for a reduction in overall func- tional units. With a single multiplier block consuming almost 15% of the LUT of the entire CP-TMR design, decreasing number of such blocks required can be significant in reducing overall design area consumption. CGM The fourth kernel code, CGM, has four distinct instances in the test suite: the original, unaltered computation (1), and variants that result from unrolling the main loop by a factor of 2, 4 and 8, respectively. The CGM kernel code is notable in that the 1 and 8 variants are the only instances in the test suite wherein the RaS-hardened design is actually larger than the competing TMR design. The design results for the four CGM variants are presented in Figure 4.9 and resource configurations for the designs are shown in Table 4.6. As displayed in Table 4.2, the CGM (1) kernel code has the largest percentage area penalty of the test suite, when comparing RaS to TMR. In this instance, the RaS-OPT design is 18% – or 247 LUT – larger than the competing CP-TMR design. A justifi- cation for these results can be seen by observing the schedules produced by the RaS 73 Cycle Adder 1 Adder 2 Adder 3 Adder 2 Mul,plier 1 Mul,plier 2 Mul,plier 3 Array Operator 1 Array Operator 2 Array Operator 3 Array Operator 4 Array Operator 5 Array Operator 6 Array Operator 7 Array Operator 8 Array Operator 9 1 11 17 22 11 2 11 17 17 22 12 12 12 18 18 18 24 24 24 3 14 26 14 14 20 20 20 4 26 26 22 20 20 20 15 15 15 28 28 28 5 21 21 21 Cycle TMR Adder 1 TMR Adder 2 TMR Adder 3 TMR Mul3plier 1 TMR Array Operator 1 TMR Array Operator 2 Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3 Unit 1 Unit 2 Unit 3 1 11 11 11 17 17 17 22 22 22 2 12 12 12 18 18 18 3 14 14 14 20 20 20 24 24 24 4 26 26 26 20 20 20 15 15 15 5 21 21 21 28 28 28 Figure 4.8: The CGM(1) kernel code schedules for the RaS-OPT (top) and CP-TMR (bottom) design variants. The TMR units of the CP-TMR schedule have been expanded for clarity. The node numbers correspond to the DFG in Figure 4.2. infrastructure for the two designs, shown in Figure 4.8, where the cells in the schedules correspond to node numbers from the DFG of the code 4 . The figure shows that in this instance, the two schedules are quite similar. In fact, most of the operations in the RaS-OPT design are executed in a TMR fashion, with all three instances of the operation executing on different functional units in the same cycle 5 . For this particular computation, the RaS approach does not perform particularly well because the dependencies of the operations dictate that they are scheduled in such a way that there is simply very little IR for the RaS-OPT design to exploit. This is especially true with respect to the multiplier blocks, which are by far the largest of the functional units used in this computation. As the schedule shows, there simply are 4 The DFG for the CGM (1) kernel code is shown in Figure 4.2. 5 In this instance, operations 12, 15, 18, 20, 21, 24 and 28 are executed in TMR fashion. 74 548 762 1336 509 1369 1616 836 836 1605 642 1475 1859 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Design Configura-on CGM (1x) LUT (Actual) LUT (Model) (a) CGM (1) 639 1019 1444 1008 2641 2023 1380 1380 2312 1359 2951 3301 0 500 1000 1500 2000 2500 3000 3500 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Device Configura-on CGM (2x) LUT (Actual) LUT (Model) (b) CGM (2) 1137 1970 1959 2266 5789 4321 2468 2468 3236 2643 5995 6861 0 1000 2000 3000 4000 5000 6000 7000 8000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Design Configura-on CGM (4x) LUT (Actual) LUT (Model) (c) CGM (4) 1475 3141 2278 4102 11452 11706 4644 4644 5412 5275 12147 14481 0 2000 4000 6000 8000 10000 12000 14000 16000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Design Configura-on CGM (8x) LUT (Actual) LUT (Model) (d) CGM (8) Figure 4.9: Device area consumption for the four unrolled variants of the CGM kernel code. not enough cycles left before the end of the computation to re-execute the multiply (operation 20). In general, any operation who’s latency is larger than the distance – in cycles – between its execution and the end of the computation cannot be replicated in time and will need to be replicated in space, as with TMR. This is trivially true for the last operation of the computation, and can be seen in operation 21 in the schedules above. In short, because RaS cannot appreciably reduce the number of functional units required, the extra multiplexers required for the RaS-OPT design lead to it not being beneficial in this context. As such, the CGM (1) kernel is still a useful data point, as it illustrates a particular scenario where RaS may not be beneficial: whenever large 75 operators are executed at or near the end of the computation. As mentioned above, it is impossible to eliminate these operators, if full operational resiliency coverage is desired. However, if the other operators of the computations were larger – which could easily be caused by a move to larger bit-widths or floating point operations – RaS might again prove to be a beneficial approach. The CGM (1) variant has a number of other notable properties. It is the second smallest of the kernel codes in the test suite, with a Baseline configuration of548 LUT. Interestingly, the CP-m configuration – which contains 2 additional adders and 1 addi- tional array operator over the Baseline configuration – is actually smaller, requiring only509 LUT. This supposed anomaly is likely because of the reduction in multiplexers required in the CP-m design. As shown in Section 3.1.4, multiplexers can be expensive in terms of area consumption. As the addition of an array addressing unit uses rela- tively few resources, it is often more beneficial to replicate these small units rather than multiplex multiple inputs into a smaller number of resources. The other code in the test suite wherein the RaS approach is not beneficial is CGM (8). As Table 4.2 shows, the RaS-OPT design requires 252 more (2%) LUT than the competing CP-TMR design. In this instance, the reasons for the area increase of the RaS configured design are not as clear. This increase is likely due to the overall size of the computations, and the number of distinct computing elements – functional units and multiplexers– it contains. The CGM (8) is the second largest of the test suite, at11;452 LUT for the CP-TMR design. While this CGM design is only 5:7% smaller than the UMT design (which is the largest in the test suite) it executes27% more operations, in90% fewer cycles. This requires a total of 150 functional units for the CGM (8) variant, versus 48 for UMT. In contrast to the UMT design, it is likely that the large number of discrete functional units and the overall size of the design (relative to the capacity of the target FPGA) 76 impacts the place-and-route mechanism in the design tool, and hampers its ability to enact optimizations. This impacts the efficiency of the design and leads to the increase in area consumption. Unlike the1 and8 versions of the CGM kernel code, the RaS approach reduces area consumption for the 2 and 4 variants when compared to a TMR approach, by 24% and25% (or612 and1;468 LUT), respectively. The explanation of this can be seen by observing the functional unit totals for the design configurations given in Table 4.6. While the increase in unrolling factor is fixed across the CGM variants, with the size of the computation doubling each time, the subsequent increase in functional units required to achieve latency consistent with the critical path is not constant. This is most notable in the number of large multiplier blocks required for each com- putation. As the computation grows – again, doubling in size each time – the number of multiplier blocks required for the TMR configuration also doubles, from 3 to 6, 12 and 24. The RaS-OPT designs do not follow such a pattern, however. For example, when compared to the 1 design, the CGM (2) design has almost double 6 the num- ber of operations, but requires only a single additional multiplier for the RaS-hardened configuration. The CGM (2) and 4 variants illustrate a fundamental strength of the RaS approach. By allowing the scheduling flexibly of individual hTMR units, an RaS- hardened design scales more gracefully as the relationship between the number of oper- ations and the number of required functional units changes. This relationship, between the size of the computation and the functional units it requires for a given performance threshold, is not necessarily linear, nor is it easy to 6 The number is not exactly double, as loop unrolling introduces extra addition operations on the loop induction variable that are not part of the original computation. 77 0 2000 4000 6000 8000 10000 12000 14000 Base Base-‐RaS Base-‐TMR CPm CP-‐TMR RaS-‐Opt Area Consump-on (LUT) Design Configura-on Actual Area Consump-on for CGM Variants CGM (1x) CGM (2x) CGM (4x) CGM (8x) Figure 4.10: The effect of loop unrolling on the CGM kernel code for the six design variants under test. determine a priori. The RaS infrastructure can quickly perform a design-space explo- ration, to help illustrate these relationships, and assist in determining which design approach makes the best use of resources. An explanation of this process is given in Section 4.1.4. (a) CGM (1) Design LUT Adders Array Multipliers Oprs. Base 548 1 1 1 Base-RaS 762 1 1 1 Base-TMR* 1,336 3 3 3 CP-m 509 3 2 1 CP-TMR* 1,369 9 6 3 RaS-OPT 1,616 4 9 3 (b) CGM (2) Design LUT Adders Array Multipliers Oprs. Base 639 1 1 1 Base-RaS 1,019 1 1 1 Base-TMR* 1,444 3 3 3 CP-m 1,008 3 3 2 CP-TMR* 2,641 9 9 6 RaS-OPT 2,023 7 9 4 (c) CGM (4) Design LUT Adders Array Multipliers Oprs. Base 1,137 1 1 1 Base-RaS 1,970 1 1 1 Base-TMR* 1,959 3 3 3 CP-m 2,266 9 9 4 CP-TMR* 5,789 27 27 12 RaS-OPT 4,321 24 24 9 (d) CGM (8) Design LUT Adders Array Multipliers Oprs. Base 1,475 1 1 1 Base-RaS 3,141 1 1 1 Base-TMR* 2,278 3 3 3 CP-m 4,102 21 21 8 CP-TMR* 11,452 63 63 24 RaS-OPT 11,706 48 60 21 Table 4.6: Design configurations and area consumption for the unrolled variants of the CGM kernel code (an * indicates the number of nominal functional units. For TMR units, divide by3). 78 Finally, Figure 4.6 displays the CGM area consumption results for the six designs under test as the loop unrolling factor is increased. Notably, the CP-m and CP-TMR designs follow a predictable pattern: as the unrolling factor is doubled, the area con- sumption follows suit. The RaS-OPT designs do not follow such a pattern, as the RaS approach provides more scheduling flexibility by exploiting the computation’s IR. How- ever, as shown above, in many cases, the dependencies of the operations are such that they cannot be scheduled concurrently. When such dependencies exist, the benefit of the RaS infrastructure is not a reduction in area consumption for a particular design – although that is certainly possible – but the ability to assist the programmer in determining which unrolling factor is best suited for the target architecture. Loop unrolling, as shown above, can have diminishing returns, and the RaS infrastructure can quickly assist in quantifying the benefit of such code transformations. 4.1.4 Design Space Exploration While the results in the previous section are promising, there is a fundamental question of how to determine the resource configurations of the designs under test, especially the CP-TMR configuration. Simply put, how does one find the best resource configuration on which to apply TMR? Such a task is, in and of itself, not a trivial undertaking. In most cases, this would be done by hand; the designer would allocate hardware resources for the computation, lay out the data path and synthesize a design. If this design met appropriate timing and area constraints, they would then apply TMR to the existing design and have a hardened computation. On the other hand, if the design does not meet the needed parameters, the designer would typically manually alter the configuration – either adding or removing resources, as needed – and restart the design process. 79 The RaS infrastructure can automate this process by modeling the performance and area consumption of a large number of device configurations very quickly. The same techniques that provide RaS-hardened designs – which is to say, the minimization of functional units required for a particular performance threshold – can be used to deter- mine an optimal resource configuration for which to apply TMR. Figure 4.11 displays the performance, operational coverage, and modeled area consumption for an exhaus- tive search across the over 10;500 non-trivial potential architectural configurations of RaS for the UMT kernel code. The RaS infrastructure produces all of these data-points quickly, and allows the developer to choose a configuration that is suitable for its needs. For example, it is possible that fully-hardened designs with optimal performance are too large for a target device. In this case, the developer could choose from design configu- rations that fit within the area thresholds of the device, deciding to sacrifice operational coverage, latency, or both as needs dictate. The designs that are consistent with the design methodology given in Section 4.1.1 are clustered in the lower right corner of the graph, as thez-axis (operational coverage) moves to the right, and thex-axis (latency) moves down the page. Figure 4.12 displays the design points for the UMT kernel code with a fixed 100% operational coverage. Note that this graph is the same as Figure 4.11, if you rotated the latter on the y-axis clockwise, until the z-axis disappears. The cross symbols (+) represent RaS-hardened configurations, with modeled area consumption and latency estimates, whereas the dash symbols () are potential TMR hardened designs. This graph visually depicts what is likely the most relevant design trade-off, between device area consumption and latency. The designer can tailor the resource configuration and schedule to the particular parameters of the target device, while still maintaining full operational resiliency coverage. 80 Figure 4.11: Full design space for UMT kernel code. The data points for Figure 4.12 are clustered around a number of points along the x-axis. The horizontal ”jumps” from one cluster to another, right to left, correspond to the alleviation of resource contention. At each jump, the new resources can consume available data, exploiting the computation’s ILP. The extension of the clusters vertically, from low to high, indicates that additional resources were added to the configuration, but are wasteful, as they do not allow the computation to execute any faster. The difficulty is that a priori, there is no way to discern which resources will be beneficial and which will be wasteful. The RaS infrastructure can provide this insight quickly. In the experimental results presented here, only the designs with 100% operational coverage and critical path performance are considered. In this instance, those designs correspond to the cluster of points at the far left of Figure 4.12. The smallest design with critical path performance is desired, and these are the configurations chosen for the above test suite. 81 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 45 55 65 75 85 95 105 115 125 135 145 155 165 175 185 195 205 215 225 235 245 255 265 275 285 295 Area Consump-on Es-mate (LUT) Latency (Cycles) UMT TMR RaS Figure 4.12: Performance and estimated area results for 10;502 design configurations for the UMT code kernel. The dash symbols () are potential TMR configurations, and cross symbols (+) are potential RaS configurations. As the results in this section have shown, there are a number of factors that impact the area consumption or latency of a particular resource configuration. Many of these factors, such as the critical path, dependencies between operations and the number of functional units needed to relieve resource contention, are not easily gleaned from the high-level source code. The automated nature of the RaS infrastructures assist in clar- ifying the relationships between these factors and their real-world consequences, and allow programmers to make informed decisions when choosing a design. 4.1.5 Expected Latency Analysis While the results presented in Section 4.1.3 display the ability of RaS to determine a resource configuration that reduces functional unit requirements for a given level of 82 expected performance, the resiliency provided allows only for error detection, not error correction. Based on the error model described in Section 2.5, this section presents an expected latency analysis that illustrates under which error-probability parameters the RaS approach is beneficial, and when another approach, such as TMR, may pro- vide better expected performance. Additionally, it is important to note that RaS is not specifically a new error mitigation technique, but rather a novel way of applying the well accepted and understood mitigation technique of temporal replication to determine a resource configuration. As such, fault injection, while useful for evaluating sensitivity profiles (as in [44]), is orthogonal to this approach, as it would simply verify that those bits related to the operations that are ”covered” via RaS are resilient to errors, and those that are not covered are sensitive to errors. To evaluate the benefits of RaS in the face of errors, the RaS-OPT configuration is compared to an equally sized TMR configuration, as given by area modeling scheme detailed in Section 3.1.4 7 . The resource configurations of the RaS-OPT and Area Equiv- alent TMR (AE-TMR) designs are shown Table 4.7. Design Latency LUT Adders Multipliers Dividers (cycles) AE-TMR* 66 7,390 (est.) 6 12 6 RaS-OPT 52 7,322 4 10 8 *Nominal functional units. For actual number of TMR units, divide by 3. Table 4.7: Latency, area consumption and design configurations for the RaS-OPT and area-equivalent TMR designs of the UMT kernel code. As stated in Section 4.1.3, the CP-TMR configuration is the smallest configuration with critical path performance. As such, any design with fewer functional units than the CP-TMR configuration will exhibit some degree of resource contention, and therefore 7 The choice to model the area of the AE-TMR design, rather than synthesize it is justified by the accuracy of the model across all CP-TMR designs, which was, on average, within 8% of the actual value. 83 have lower performance. This AE-TMR design has a latency of 66 cycles, which is 14 cycles more than the critical path of the UMT kernel code. Expected Performance Simulation Methodology To evaluate the expected latency of competing designs in the face of errors, simula- tions were performed using the error model described in Section 2.5. In this model, a computation’s actual latency is modeled as the sum of the latencies of all the aborted iterations (due to errors), plus the nominal, error-free execution latency. The overall error-probability, per LUT, is an input to the simulation. This per-LUT probability is then scaled to each operator in the design, based on the size of the hardware block on which the operation is executed. These operator sizes are taken from the functional unit synthesis results given in Section 3.1.4 and shown in Table 3.1. For example, if the given per-LUT error probability was :01 and the operation was executed on a 5 LUT functional unit, the error probability for any operation executed on this unit would be :05. The simulator iterates across the operations of the computation in the order they are scheduled. At each operation, a random variable determines if there was an error, based on the functional unit probabilities mentioned above. If there is an error, the compu- tation restarts, incrementing the latency for this iteration by the roll back penalty for this operation. If not, the simulator continues to the next operation, until the compu- tation terminates, at which point the nominal latency of the operation is added to the latency total, to emulate the ”successful” iteration of the computation. In this fashion, the expected latency of various designs can be computed, as is described in the next section. 84 45 50 55 60 65 70 75 0.00E+00 5.00E-‐06 1.00E-‐05 1.50E-‐05 2.00E-‐05 2.50E-‐05 3.00E-‐05 3.50E-‐05 4.00E-‐05 Latency (Cycles) Per-‐slice Error Probability AE-‐TMR RaS-‐OPT Figure 4.13: Expected latency for the RaS-OPT and AE-TMR designs shown in Table 4.7 as the per-slice error probability increases. Expected Performance Simulation Results Figure 4.13 displays the expected performance for the two area-equivalent designs shown in Table 4.7. Here, each of the cross symbols (+) is the average latency of10;000 simulations, as described above, of the RaS-OPT design executing the UMT kernel code at a given error probability. The dash symbols () represent the performance of the AE- TMR design, which is constant, as the TMR operators can correct any single bit error via majority voting, without affecting performance. The error probability value at which the RaS-OPT design exhibits the same expected performance as the AE-TMR configuration is approximately3:2510 5 per LUT. This value is much greater than commonly observed values [29]. As such, for any applica- tion wherein the prevailing error probability is less than this value, the RaS design will achieve a lower expected latency, in terms of cycles, than an equivalently-sized TMR design, and thus better execution time performance. 85 4.2 VEX Infrastructure: VLIW Target This section details the experimental results for the RaS VEX-based [reconfigurable] VLIW target infrastructure detailed in Section 3.2. The next section outlines the exper- imental methodology. Section 4.2.2 presents the results of a case study using the UMT kernel code. Finally, Section 4.2.3 explores the expected performance of an RaS- hardened VLIW computation in the face of errors, based on the error model presented previously in Section 2.5. 4.2.1 Experimental Methodology In order to more precisely quantify the potential benefits of RaS, it is compared to two traditional hardening approaches: a hardware TMR scheme and a software-based, source-level replication approach, as described in Section 1.2.1. Two experimental methodologies are considered. In Sections 4.2.2 and 4.2.2, the performance scalability of the three approaches (RaS, TMR and source-level replica- tion) is explored. This is accomplished by expanding the instruction issue width and available functional units of the underlying VLIW architecture on which the compu- tation is executed. For these experiments, operational resiliency coverage is fixed at 100%, as architecture resources are varied to study their impact on performance. The second experimental methodology is an architecture exploration, shown in Section 4.2.2. Resiliency coverage is also fixed at100%, but for this experiment, performance is fixed as well. This approach illustrates the hardware resources required by each hardening technique at a given performance level. The following subsections outline the TMR and source-level replication approaches used in this case study. 86 Triple Modular Redundancy (TMR) As mentioned in Section 1.2.2 hardware Triple Modular Redundancy (TMR) is the de- facto standard technique for hardware error detection and correction. Unlike the com- peting RaS and source-level approaches, TMR operators do not require the insertion of explicit comparison instructions. Source-level Replication Source-level replication, as its name implies, replicates the application code at the source-code level [55], and explicitly inserts comparison instructions between the inter- mediate and/or final computation results. If the results saved in temporary variables differ, the computation internally raises an exception and possibly retries the execution as discussed in Section 1.2.1. Such transformations place the intelligence of scheduling for resiliency on the com- piler, which, in most cases, was not engineered for such a purpose. In fact, most source- level replication schemes require techniques to ”trick” the compiler from optimizing out the duplicate instructions 8 . The fact that most compilers are inherently ill-suited for tasks such as this underscores the need for approaches such as RaS. In the implementation of source-level replication explored in this work, the source- code is transformed by duplicating code at the granularity of assignment statements. For each such statement, an auxiliary temporary variable is created to save the repli- cated computation value using the same input variables. The results are then explicitly compared with logical exclusive or (XOR) operations. 8 In order to present results as fairly as possible, the additional overhead of the pointers required to maintain replicated operations throughout compiler optimization has been minimized. 87 4.2.2 Case Study: UMT In order to compare the RaS approach to the TMR and source-level techniques men- tioned above, a case study is presented, using a computational kernel from the UMT2K benchmark suite [7]. As detailed in Section 4.1.2, the UMT kernel is the core of a 3D, deterministic, multigroup, photon transport computation for unstructured meshes. It includes various arithmetic operations, including long-latency divide operations. To implement these divide operations as efficiently as possible, the Vex compiler switch -fexpand-div was used, which instructs it to replace calls to library intrinsics with in-line assembly code. RaS Versus Source-level Replication To explore the scalability of RaS-hardened designs compared to source-level resiliency approaches, the UMT kernel was executed on architectures with increasing instruction issue widths. In this experiment, as the instruction issue width increases, so does the number of functional units of a particular configuration (e.g. a 9-wide RaS configura- tion has 9 ALUs and 9 multiplier units). Figure 4.14 displays the relative scalability of the competing approaches. Five variants are presented: Source-level hardening, with pairing and triplication, RaS with pairing and triplication, and the unhardened computa- tion, included as the baseline. As expected, with an issue width of 1, the source-level and RaS variants perform similarly, for both triplication and pairing variants. Both variants are still over 300% slower than the unhardened version of the computation, due to the extra operations required for resiliency. As instruction issue widths increase to more typical levels (e.g. greater than 4) , the RaS-hardened computation is able to exploit the computation’s IR, and exhibits performance that is appreciably better than its source-level hardened counterparts. At an instruction issue width of 4, the RaS variant is 15% faster than the 88 0 100 200 300 400 500 600 700 800 0 4 8 12 16 20 24 28 32 Cycles Issue Width Source-‐level (Triplica9on) Resiliency-‐aware Scheduling (Triplica9on) Source-‐level (Pairing) Resiliency-‐aware Scheduling (Pairing) Unhardened Figure 4.14: Latency, in cycles, of UMT2k kernel as issue width (and accompanying functional units) are increased. equivalent computation hardened via the source-level technique. As the computation reaches its critical path (the point at which there is no resource contention), the RaS triplication scheme is over 40% faster. The pairing variants follow a similar pattern. Due to the instruction scheduling of the Vex compiler, performance actually decreases for the source-level triplication variant between instruction issue widths of 10 and 14. The source-level pairing variant exhibited a similar, but less pronounced effect. These anomalies further illustrate how traditional compilers are often ill-suited for the task of providing software resilience. RaS Versus TMR To accurately compare architectures in terms of device area used (as a proxy for total functional units required), the notion of ”equivalent” TMR is introduced. This is sim- ply the number of TMR-hardened functional units multiplied by three. While an RaS- hardened configuration will have the same number of ALU and multiplier units as its instruction issue width, a TMR-hardened configuration will have a third as many func- tional units as its issue width. For example, a12-wide ”equivalent” TMR configuration 89 0 50 100 150 200 250 300 0 3 6 9 12 15 18 21 24 Cycles Issue Width Resiliency-‐aware Scheduling (Triplica?on) Equivalent TMR Source-‐level (Pairing) Figure 4.15: Expected performance of equivalent TMR as compared to RaS and source- level schemes. can issue up to 12 instructions per cycle, and contains 4 TMR ALU and 4 TMR multi- plier units. This configuration is ”equivalent”, in terms of nominal function units, to a 12-wide RaS configuration. In this manner, configurations across the various hardening schemes can be compared. Figure 4.15 shows performance scalability as instruction issue width – and corre- sponding hardware resources – is increased. The triplication variant of RaS is compared to a TMR configuration of equivalent size, with pairing at the source-level included for reference. Initially, for configurations with 3 functional units (either 3 unhardened functional units, or combined as 1 TMR unit), and an instruction issue width of 3, the TMR con- figuration is 8% faster than the RaS variant, i.e. 228 cycles to 247. This is due to the extra operations the RaS scheme has to execute, and the limited resources of such a narrow machine. As the instruction issue width increases, and the computation’s IR increases, RaS quickly outperforms TMR. At an instruction issue width of 12, the RaS 90 scheme is almost40% faster, with a latency of62 cycles, versus86 for an equally-sized TMR configuration. Finally, for configurations with higher instruction issue widths, the performance of the computation reaches its critical path, and the approaches exhibit sim- ilar performance. At these issue widths values, resource contention, even for the TMR approach is minimal. These results show that in most cases, an RaS approach is more efficient in its use of resources. The monolithic nature of TMR hampers its performance, disallowing its functional units from alleviating resource contention. Even executing many more oper- ations, the RaS scheme can provide resiliency at a software level that exhibits much better performance than TMR. Clearly, RaS only detects errors, whereas TMR can correct them. Therefore, these results are a lower bound on performance for the RaS scheme, as the presence of errors will incur some penalty to ensure correct execution. Section 4.2.3 will cover expected performance in the face of errors, and provide a discussion of when the RaS scheme is preferable to hardware replication, and when it might not be. Architecture Exploration and Design Variants In addition to observing the effects of a particular design on performance, it is illus- trative to look at the potential design space by fixing performance and varying archi- tectural configuration. Table 4.8 presents a number of potential design variants with roughly similar performance (source-level latencies are shown for reference). The XOR units column corresponds to the number of dedicated hardware comparators. Only the assembly-level comparison operations can be executed on these functional units, although if no such units are available, the XOR operations can still be scheduled on traditional ALUs. 91 Resiliency Issue XOR Replication ID # Scheme Width Units Scheme Cycles 1 TMR 4* (12) 0 TMR 86 2 RaS 7 4 Triplication 86 3 RaS 8 1 Triplication 85 4 RaS 5 2 Pairing 84 5 RaS 9 0 Triplication 81 6 RaS 6 0 Pairing 77 7 RaS 6 1 Pairing 72 8 Source-level 16 0 Pairing 95 9 Source-level 16 0 Triplication 109 *4 TMR functional units is equivalent to 12 unhardened functional units. Table 4.8: Performance for selected configurations of the UMT Kernel. The RaS infrastructure can quickly generate a number of architecture configura- tions, as shown in Table 4.8, and allow the programmer to determine which is the most preferable for their particular application. For example, it is quite likely that the ded- icated comparators described above consume fewer chip resources than a full-fledged ALU. In that case it is possible configuration #3 from Table 4.8 might be preferable to configuration #5. Additionally, the impact and utility of additional functional units can quickly be ascertained, as configurations #6 and #7 show. In this instance, it is possible the extra area consumed by an additional comparator is not worth a 7% performance improvement. The Vex-based RaS infrastructure allows design considerations for configurable VLIW softcores to be reasoned about quickly and easily, without the difficulty of actu- ally implementing each design. 4.2.3 Expected Latency Analysis While the above results show how RaS can determine a resource configuration that reduces functional unit requirements for a given level of expected performance, the resiliency provided allows only for error detection, not error correction, as with TMR. 92 60 65 70 75 80 85 90 95 100 1.00E-‐05 1.10E-‐04 2.10E-‐04 3.10E-‐04 4.10E-‐04 5.10E-‐04 6.10E-‐04 7.10E-‐04 8.10E-‐04 9.10E-‐04 Cycles Error Probability 6-‐wide RaS (Pairing) 12-‐wide equivalent TMR 9-‐wide RaS (triplicaEon) Figure 4.16: Expected latency in the face of errors for configurations #1, #5 and #6 from Table 4.8. This section presents an expected latency analysis for the Vex-based RaS infrastructure, in the same fashion as given in Section 4.1.5. Figure 4.16 shows expected latency simulation results for configurations #1, #5 and #6, as given in Table 4.8. These three configurations illustrate the impact on performance and functional unit requirements for a number of key architectural design considerations and help illustrate the need for automated architecture exploration, as provided by the RaS infrastructure. Each point in the figure corresponds to an average latency for10;000 executions of the kernel for the given error probability, using the error model described in Section 2.5. The first configuration (#1) is a baseline 12-wide TMR implementation, which has the same nominal number of functional units as a 4-wide RaS configuration. As expected, the latency of the computation is constant, even as the error probability increases, due to the immediate detection and correction properties of TMR. This con- figuration is inflexible, as the nominal function units are ”trapped” in monolithic TMR 93 units that can only execute one operation at a time and cannot be used to exploit the computation’s IR. Therefore, the TMR-hardened configuration exhibits worse perfor- mance than RaS configurations that are much smaller. In addition, if the prevailing error rate were to decrease – rendering hardware resiliency less necessary – static TMR units could prove wasteful of energy and device area. Configuration #6 is a6-wide RaS configuration, using pairing to provide resiliency. Because this configuration has more immediately accessible functional units, it can exploit the computation’s IR, delivering an 11% performance gain over the TMR- hardened design, while using half of the nominal functional units. These benefits come at a price, however, as this configuration is highly sensitive to a rise in error probabil- ity, as the pairing scheme cannot withstand any errors. Simulation shows that for error probabilities less than 4:7 10 4 per operation, the RaS design has lower expected latency than the TMR variant. This probability is much higher than commonly observed values [29], so the expected latency of the RaS-hardened computation is lower than the TMR version, while requiring far fewer resources. In between configuration #1 and #6, configuration #5 consists of a9-wide RaS con- figuration, with all operations triplicated. The increase in overall operations executed – due to triplication – leads to a marginally slower latency than configuration #6 (84 cycles versus 81 cycles), despite the increase in instruction issue width and functional units. The use of triplication, however, leads to an expected decrease in error probabil- ity sensitivity, as not all errors require a restart. An error probability of 8:310 4 per operation provides equivalent performance to the TMR design, while using 25% less functional units. For error probabilities below that threshold, the RaS approach delivers lower expected latency. 94 4.3 Chapter Summary In this chapter, experimental results derived from the RaS infrastructures were pre- sented. For the Open64-based toolchain, six design configurations were constructed for each computation in a test suite of seven real-word scientific code kernels. The properties of the codes used in the test suite were explained, and the resultant designs analyzed. Using slice Look Up Tables (LUT) as the design metric, the RaS designs were, on average, 19% smaller than the TMR designs. This chapter also discussed the design-space exploration process that the RaS toolchain uses to determine the appropri- ate configurations for the designs under test. Results for the Vex-based toolchain were discussed as well. A case study, based on the UMT code kernel from the Open 64-based test suite, was presented, comparing the RaS approach to both a source-level duplication approach and TMR. For this case study, the RaS approach produces code that performs between15% to40% faster than a source-level hardened approach using the same architecture resources. When compared to a TMR-hardened architecture, the RaS approaches produces code that ranges from 8% slower, for an instruction issue width of 1, to 40% faster, for an instruction issue width of 12. Lastly, an expected performance analysis was presented, depicting the performance of RaS-hardened designs in the face of errors. 95 Chapter 5 Related Literature Resiliency against errors is an an increasingly important issue for programmers, and an area of significant research. This is especially true for reconfigurable devices that may have millions of potentially vulnerable bits throughout their configuration fabric. In addition to resiliency, there are a number of other research areas that are relevant to the Resiliency-aware Scheduling work presented here. Compiler analysis, instruction scheduling, and configurability are all important aspects of the RaS work, and are large areas of research in and of themselves. It is important to note that often, distinctions between these categories are arbitrary, and many works – like the research presented here – cross the boundaries of many of these topics. This chapter will present a subset of the current research in these areas. Section 5.1 discusses resiliency and Single Event Upset (SEU) mitigation across a number of poten- tial hardware targets, including superscalar, VLIW and FPGA architectures. A survey of software-only compiler techniques similar to those used in this work are presented in Section 5.1.1. Hybrid approaches, which normally combine a compiler-based approach mentioned above which specific dedicated hardware are presented in Section 5.1.2. A subset of hardware-only approaches are presented in Section 5.1.3, with a focus on tech- niques geared toward FPGA targets. Finally, Section 5.2 compares and contrasts these approaches with RaS, and places the work presented here in the context of the existing literature. 96 5.1 Resiliency and SEU Mitigation As mentioned in Chapter 1, a number of factors, such as increased deployment in hos- tile environments, shrinking feature sizes and processor aging, lead to an increase in design and transient error rates. This section presents a number of techniques designed at detecting and potentially correcting or mitigating these errors. 5.1.1 Software Approaches In [26], the authors present two approaches, partial explicit redundancy (PER) and implicit redundancy through reuse (IRTR), that attempt to minimize the performance impact of reducing the soft-error rate. The authors specifically target computations that may not require full coverage. PER exploits low ILP phases of the computation and L2 cache misses to introduce explicit redundancy, which provides operational coverage without impacting performance. For additional coverage during high ILP phases, IRTR introduces a reuse buffer, which can add coverage for operations executing in the main processing thread that have the same operands and output. These two methods can pro- vide coverage for many, but not all operations of a computation while having a minimal impact on performance. The work presented in [12] and [13] enables resiliency by exploiting the explicit parallelism and duplicated resources of VLIW architectures. The proposed solution statically compiles an unhardened code on half of the target architecture’s resources, using the other half for explicit operation duplication and comparison. In this way, transient and permanent errors can be detected, although no error correction is provided. The authors justify splitting the target architecture’s resources in half by observing that, in many cases, the computations in the experimental test suite are not able to exploit the resources of the full-width architecture, due to a lack of ILP. Noting this, although the 97 hardware resources available to the computations are halved, the resultant performance of the hardened computations are not necessarily doubled. A significantly different approach is taken in [35]. Here, in order to mitigate the performance and code size penalties incurred by the replication methods proposed in many software approaches, the fundamental data representations of high level source constructs are altered to provide protection against soft errors. For example, rather than replicating a loop induction variable, an inherently fault tolerant data representation is chosen, such as a loop iterator that is a power of two. Under this scenario, any two itera- tor values have a Hamming distance of two and a checksum of one, and these values can be checked during loop iteration to ensure correct execution. However, this approach has the disadvantages of requiring dedicated hardware for the efficient execution of check- sums. Additionally, this approach either constrains the maximum number of potential loop iterations to the target architecture’s word size, or requires multiple nested loops, which potentially increases the computation’s vulnerability to faults. An evaluation of a software TMR approach for hardening a VLIW softcore proces- sor is presented in [63]. Here, the authors triplicate the operations of a test suite at the source level, and inject faults in to the memory of a -Vex softcore VLIW processor, implemented on an FPGA. They evaluate two different strategies to implement majority voting, one using typicalif-then control flow structures and the other using logical and andor operators. They determine that while the control flow structures are supe- rior at suppressing the injected errors, neither approach provides sufficient protection. This is due to ”cross-domain” errors that commonly occur in the registers used as offsets for memory accesses, which the above approach does not mitigate. The authors in [49] present EDDI, a software-only approach to error detection. In their work, instructions of a computation are replicated and the values compared to detect errors. These instructions are duplicated at the assembly level, and are interleaved 98 as to attempt to maximize available ILP when executed on a superscalar architecture. While this approach provides error detection on any architecture, regardless of hardware support, the duplication of program flow incurs an average89:3% performance penalty for the benchmarks tested. A related work, [48] proposes a slightly different approach. Here, in addition to the primary program flow, instructions and program data are transformed, according to a number of specific rules. These rules create a second program flow, equivalent to the original, and these two execution flows are synchronized prior to write instructions and branches. Both program flows are then executed concurrently, and if an error occurs, these two representations are no longer equivalent, allowing the error to be detected. The work presented in [57] details SWIFT, which builds on the EDDI approach and reduces the impact of replicated instructions on performance. The authors modified the OpenIMPACT compiler for a VLIW Itanium2 target [41], altering it to implement the SWIFT code duplication schemes at a very low level, immediately before register allocation and scheduling. Because the Itanium2 architecture conforms to the explicitly parallel instruction computing (EPIC) paradigm, the compiled execution maintains a static data flow, without out-of-order execution. The main differences between SWIFT and EDDI are twofold. First, SWIFT assumes some form of error correcting codes (ECC) in order to protect the memory subsystem. This assumption greatly reduces the number of check instructions that the SWIFT approach executes, which improves performance. Second, this modification to the memory subsystem obviates the need for a number of replicated instructions used for control-flow error checking. 5.1.2 Hybrid Hardware/Software Approaches Extending upon the SWIFT approach, the authors in [59] presents CRAFT, a hybrid hardware/software approach to fault detection. Here, the authors augment the SWIFT 99 compiler-based approach with hardware-only simultaneous and redundantly threaded (SRT) processors (see Section 5.1.3 below). The duplicated store instructions generated by the SWIFT approach are tagged with a version number, and sent to a dedicated hardware Check Store Buffer (CSB) for validation. An entry in the CSB is valid when both the original and the duplicate version of the store have the same address and value. Valid CSB entries are then written to main memory. While this approach requires a custom hardware CSB, it does not create additional write instructions to memory, as only one value from the CSB is written back. In [58], the authors further extend the SWIFT (software) and CRAFT hybrid (approaches), described above. They introduce a software-controlled hybrid approach based on computational profiling, called PROFiT. In order to meet fine-grained perfor- mance and reliability constraints, the source code is analyzed to determine which areas of the computation are inherently vulnerable, and which are not. For example, a code section that computes a large amount of dead code or data that is logically masked (by boolean operations, for example) may be naturally resilient to faults and would therefore not require much protection. Conversely, areas containing many control flow structures are highly susceptible to errors, and would require more coverage. PROFiT establishes the vulnerability profile for the computation, based on a user-defined utility function. This profile then guides the application of the SWIFT and CRAFT methods, giving the programmer fine-grained control over the performance and coverage of the target appli- cation. In [60], the author proposes hardware and software extensions to detect and correct both transient and permanent faults in static data paths. Software extensions in the compiler ensure every operation in the instruction stream is duplicated and executed on different hardware. Hardware extensions to a notional VLIW architecture can detect 100 a mismatch between the two executions of an operation, executing a third instance to determine the appropriate value and isolate faulty hardware. Another approach to hardening static VLIW datapaths is presented in [31]. In this work, operational coverage is provided by compiler-driven instruction duplication and supported by small architecture modifications for efficient comparison operations. A hardware register value queue (RVQ) stores the outputs from operations executed on registers. When a duplicate instruction executes, its value is compared to the value stored in the RVQ, and if there is a mismatch, an error is detected. The programmer can specify an upper-bound on the desired performance penalty, and the compiler will attempt to cover as many operations as possible, greedily duplicating instructions so long as doing so does not exceed the given performance threshold. For a test suite of computations, this scheme was able to duplicate approximately 40% of the operations without increasing the run-time of original, unhardened computations. Providing full coverage incurs a42:2% performance penalty across the test suite. This work is expanded in [30], adding energy consumption models and constraints. As before, the programmer can specify an constraint – this time for energy consumption, not performance – and the compiler will attempt to schedule duplicate operations until it can no longer meet the given threshold. To meet these energy constraints, the compiler executes a two-phase algorithm. In the first phase, the each functional unit is checked to see if it can be put in a low-power state, which, in addition to lowering overall energy consumption, also effects the frequency or clock rate of the functional unit, and therefore can effect the overall performance of the computation. After the first phase, a new schedule is formed, that meets the given energy and performance constraints. Then, the second phase attempts to duplicate instructions for operational coverage, doing so only if the given constraints are maintained. 101 In [21] and [40], the authors propose an approach in which they explore the hard- ware/software co-design space of a PicoBlaze soft-microprocessor. They consider a small subset of architecture components, such as the program counter, register file, and stack pointer for hardening against errors. To do so, they test a number of potential design configurations, hardening some of the components in hardware (via TMR) and others in software (via SWIFT-R [57]). To validate the approach, the authors present a case study, based on a matrix multiplication kernel code. They emulate faults by flip- ping only a single bit during each execution. For their case study, their ”optimal” design configuration was determined to be protecting four of the five registers used in the com- putation via the SWIFT mechanism, while hardening the program counter, flags, stack pointer and pipeline via TMR. This configuration was able to detect 98:86% of unnec- essary for architecturally correct execution faults, with an80% increase in code size and a56% increase in execution latency. 5.1.3 Selected Hardware Approaches The process of configuration scrubbing is discussed in [45]. This system-level approach avoids the complications of sophisticated error detection and correction altogether. Scrubbing periodically replaces the active portion of the configurable memory – often an SRAM-based FPGA – with a ”golden” copy of the design. This golden copy is assumed to be replicated in an error-free location. As many of the approaches listed in this chapter assume either only temporary faults (i.e. SEU) or no more than one fault per program- ming unit, it is assumed that these techniques will be implemented in conjunction with scrubbing in order to prevent the accumulation of errors in the system. Intuitively, scrubbing should occur at a faster rate than incoming errors, to prevent the accumulation of multi-bit errors which can defeat most fault-tolerance methods (including TMR). The authors in [16] conclude that a scrub rate that is one order of 102 magnitude more frequent than the upset rate is acceptable. In other words, the system should scrub, on average, ten times between upsets. A different approach to error detection and mitigation for FPGAs is presented in [6]. Here, the authors suggest using an inherently radiation-hardened (i.e. non SRAM-based) ”auxiliary” FPGA to compute CRC checks. The main FPGA reserves the last two bytes of each frame to store CRC checks from its user configuration bits. The auxiliary FPGA then stores these CRC checks as a ”golden” copy, as well as storing copies of the flip- flops from the main FPGA. Note that when configuration bits are used as user bits (i.e. a LUT is used as a RAM), they are excluded from the checksum computation. To provide added protection, the CRC checker circuitry on the auxiliary FPGA is implemented using TMR. During execution, the two CRC copies are checked against each other to detect errors. If an error is detected, computation halts and the configuration of the main FPGA is rolled-back to a safe state using the information stored in the auxiliary FPGA. In [56], the authors propose simultaneous and redundantly threaded (SRT) proces- sors. The SRT approach exploits the multiple hardware contexts of a symmetric multi- processor architecture for fault detection, where redundant copies of a program’s exe- cution are run on independent threads. Performance overhead is reduced by loosely synchronizing the threads, and eliminating cache misses and branch mis-speculations in the checking thread. In a work developed in [36] and expanded in [32], the authors propose an alternative to TMR called Duplication With Compare-Concurrent Error Detection (DWC-CED). Here the computation is duplicated in space via dedicated hardware for the detection of a transient or permanent error. Each of the duplicated combinatorial circuits has a concurrent error detection (CED) module. In a first execution cycle, the two outputs are compared. If an error occurs, a second execution is performed with encoded inputs. The corresponding outputs are then fed into a voter circuit for the determination of 103 the correct output value. To test the efficacy of the DWC-CED approach, three small functional units were synthesized: an 8-bit multiplier, an 8-bit multiplier and an 8-bit filter. Using a recomputing with shift operands (RESO) encoding scheme, the DWC- CED scheme detected 100% of the injected errors in both the multiplier and filter, and 91:52% of errors in the ALU. In addition, the multiplier requires 22% fewer LUT than one implemented in traditional TMR,2;285 to1;791 respectively. 5.2 Relationship to RaS While the research presented in this thesis is related to many of the above topics, it can be distinguished from previous works in a number of ways. For example, software approaches such as those described in [13, 26, 35, 48, 49] only target traditional fixed architectures, and have no mechanism to be applied to reconfigurable targets such as an FPGA. While the RaS approach detailed here is perhaps most beneficial when targeting an FPGA, it is not bound to any particular hardware target, as evidenced by the two distinct infrastructures presented in Chapter 3. Additionally, many of these software approaches place the onus of scheduling on the compiler, blindly applying transformations at the source level. In this instance, the extra source-level statements may unintentionally hamper compiler optimizations. Also, source-level approaches often are required to intentionally decrease performance in order to maintain replica statements that would otherwise be removed by compiler optimizations. This makes sense intuitively, as replica operations, by definition, com- pute no ”new” values, and compilers are adept at removing such code. In many cases, the 104 hardening scheme must intentionally ”confuse” the compiler, by placing replica opera- tions in code segments that the compiler cannot optimize out. This process often intro- duces new statements that are not part of the original computation, and can negatively impact performance. Rather than control the scheduling of these operations themselves, these source- level approaches merely push this responsibility down to the processor, which often has no hardware support to achieve such tasks efficiently. In many cases, this gives rise to hybrid approaches, as the performance of such schemes is often unacceptable on normal commodity hardware. In contrast, RaS handles the scheduling of the entire computation implicitly, and therefore has finer-grained control over how to appropriately handle replication. In addition, for reconfigurable architectures, the RaS infrastructures have control over the resource configuration of the targets, allowing for better resource allocation. This syn- ergy between compiler techniques and resource configurations is a distinguishing fea- ture of this work, and is addressed in greater detail in Section 6.1. This aforementioned lack of hardware support for efficient resiliency gave rise to the hybrid approaches discussed above. While such approaches can be useful for proof- of-concept research, they are often impractical to implement as traditional fixed archi- tectures cannot make use of the custom hardware required by these hybrid schemes. In addition, much of the work described above, such as [21, 30, 31, 59, 58] present their results either in simulation only or target simple academic softcore architectures such as the Trimaran [9] or PicoBlaze [1]. The lack of real implementations on an existing family of FPGAs hampers the usefulness of such approaches. The RaS Open64-based infrastructure, however, is a basic C-to-RTL toolchain, and can be implemented on the Virtex family of FPGAs. While the C-to-RTL capabilities of the RaS infrastructure are not as mature or full-featured as commercial products such as 105 Catapult C [42], or SystemC [50], those systems were not developed with resiliency in mind, and lack the compiler-level support for it that RaS possesses. Finally, the hardware approaches presented have no knowledge of the source com- putation for which they are being applied. While approaches such as those in [32, 36] may offer an area savings over traditional approaches such as TMR, it is left to the pro- grammer to decide when and where to apply them. This creates, in effect, the very same problem as blindly applying TMR, albeit in perhaps a slightly less expensive fashion. 5.3 Chapter Summary This chapter presented a significant subset of the relevant research in the area of resiliency. Software, hardware and hybrid approaches were presented, across a num- ber of targets, including superscalar, VLIW and FPGA architectures. These works were compared to the RaS approach presented in this thesis, and the benefits and key differ- ences of RaS were detailed. 106 Chapter 6 Conclusion As feature sizes decrease and error rates increase, resilient computation will become increasingly important. Developing techniques that can provide this resiliency without significantly increasing area consumption – and by extension, power consumption – then becomes an essential research goal. The Resiliency-aware Scheduling approach presented in this thesis is such a technique, as it leverages existing hardware resources in a more efficient manner, when compared to traditional hardware or software hardening approaches. This chapter summarizes the major contributions of this thesis, and describes poten- tial avenues for refinements or extensions to this work. 6.1 Contributions The goal of this research included the development and evaluation of an automated approach to resiliency that combines compiler techniques, such as critical-path and scheduling analysis, with a coarse-grained control over the target architecture config- uration. RaS can, in an automated fashion, produce designs and resource configurations that require much less hardware than those derived from traditional resiliency methods. The underlying objective is a reduction in the amount of resources – either area, power, or processing cycles – required to provide this resiliency. The following sections outline the major contributions of this work to that end. 107 6.1.1 Resiliency-aware Scheduling A variant of priority list scheduling, Resiliency-aware Scheduling analyzes both the properties of a high-level computation and the possible hardware resources available in order to determine the optimal resource configuration and operation assignment for a given level of operational coverage, area, or performance. 6.1.2 Intrinsic Resiliency Fundamental to the RaS approach is the concept of intrinsic resiliency (IR). A compu- tation’s IR is a fixed value for specific set of resources, and can be measured by the number of operations that can be replicated due to the ILP stalls of the computation. In reconfigurable contexts, a computation’s IR is increasingly important for resiliency, as the programmer has fine-grained control over the target architecture. A computation’s IR can change as the underlying architecture parameters change, and this relationship is not necessarily intuitive. Exploring the relationship between a computation’s IR and the underlying resource configuration is the primary purpose of the RaS infrastructures, described in this thesis. 6.1.3 Hybrid TMR In order to exploit a computation’s intrinsic resiliency, the concept of Hybrid TMR (hTMR) was presented and a hardware hTMR implementation was shown. These flex- ible hTMR units decouple the resources of a TMR unit, allowing them to be scheduled independently, if necessary. The scheduling flexibility of RaS, and its ability to exploit hTMR units, allows for a reduction in area consumption for RaS-hardened configura- tions. The resource sharing provided by a computation’s IR increases the utilization of 108 existing hTMR resources, allowing the same work to be done with fewer resources, in potentially less time. In addition to a hardware implementation of hTMR, a code-generation algorithm was given, that converts a high-level C or FORTRAN code into VHDL description of the computation’s data path. 6.1.4 The RaS Infrastructures The last major contribution is the construction of two distinct infrastructures, one based on the Open64 compiler and the other based on the Vex VLIW toolchain. These RaS infrastructures provide an automated way to determine the level of intrinsic resiliency of a computation for a specified architecture configuration. In addition, the infrastructures provide area, performance and operational coverage estimates, allowing the programmer to explore the trade-offs between these three design metrics. These infrastructures were used to validate the efficacy of the RaS approach. For the Open64-based FPGA target, RaS-hardened designs were derived for a test suite of seven realistic kernel codes. For this test suite, the RaS-derived designs were 18% smaller than competing TMR designs, when using Look Up Tables as the design metric. The Vex-based VLIW infrastructure produced promising results as well. For the case study presented, the RaS approach produced codes that varied from15% to40% faster than a source-level hardening approach, as instruction issue width increases. When compared to a TMR approach, these RaS-hardened computations ranged from8% slower, to40% faster, due to better use of available resources. 109 6.2 Future Work While the RaS approach presented in this thesis is promising, there are a number of potential research avenues that can refine or complement this work. This section outlines a subset of these areas. Scheduling: While in general, process scheduling is NP-complete, it is possible that alternative and specific scheduling algorithms may be algorithmically very efficient. In addition, more complex scheduling rules could potentially lower area consumption for RaS designs, as a more intelligent assignment of operations to functional units could potentially reduce the number of multiplexers in an FPGA- based hardware design. Replica operation placement constraints: Currently, replica operations are scheduled within the framework of the original, unhardened schedule, potentially extending it if there is not enough intrinsic resiliency in the computation to allow for full replication. This has a potentially negative effect on expected latency in the face of errors, as the roll back costs of operations can be large, if replica oper- ations are placed ”far away” from their original operations. Placing constraints on the longest allowable ”distance” between an original operation and its replica could potentially reduce the overall expected latency. Area modeling: As the design space exploration of RaS is based on an area model, any improvements to this model could lead to more accurate selection of resource configurations, possibly reducing the area consumption of the resultant designs. The use of slice Look Up Tables as a design metric was chosen as it allows for validation with synthesized design, but an area model based on a finer- grained metric – such as raw transistor count – could lead to a more accurate area model. 110 A higher level of abstraction: The operation scheduling presented here is per- formed at the level of individual functional units (i.e. adders, multipliers, etc.). However, the same concepts could be used at a higher level of abstraction, with computations scheduled on accelerators. These accelerators are custom pieces of hardware, designed to perform a specific computational operation – such as a matrix multiplication, or fast Fourier transform – very quickly. In [19], the authors present a scheme to compose a virtual accelerator from many of these smaller accelerators. Such composition is feasible because many complex mathe- matical computations can be broken down in to such functions. The scheduling of such accelerators – within the context of the larger, virtual accelerator – is analo- gous to the scheduling of individual operations on functional units presented here. The RaS approach could be extended to this higher level of abstraction in order to provide resiliency for accelerator-enhanced computations. It is likely that resiliency will be an increasingly important design consideration. The need for area and power-efficient designs that can withstand errors is growing as error rates continue to rise. This is especially true in reconfigurable computing, as such platforms are becoming more popular in power-starved embedded applications, due to quicker design cycles and lower development costs. It is the author’s opinion that an automated tool that provides programmers with resilient computations based on the schedule of the program itself, thereby avoiding the overhead required by traditional methods, will some day become as ubiquitous as optimizing compilers. The work pre- sented in this thesis is a first step in that direction. 111 Reference List [1] Picoblaze 8-bit embedded microcontroller user guide, February 2012. [2] V .D. Agrawal, C.R. Kime, and K.K. Saluja. A Tutorial On Built-in Self-test. I. Principles. IEEE Design Test of Computers, 10(1):73 –82, March 1993. [3] I. Ahmad, Y .-K. Kwok, and M.-Y . Wu. Analysis, Evaluation, and Comparison of Algorithms for Scheduling Task Graphs on Parallel Processors. In Second Inter- national Symp. on Parallel Architectures, Algorithms, and Networks, pages 207 –213, June 1996. [4] R. Andraka. A Survey of CORDIC Algorithms for FPGA Based Computers. In Proc. of the 1998 ACM/SIGDA Sixth Intl. Symp. on Field Programmable Gate Arrays, FPGA ’98, pages 191–200, New York, NY , USA, 1998. ACM. [5] K. Arzen, A. Cervin, J. Eker, and L. Sha. An Introduction to Control and Schedul- ing Co-Design. In Proc. of the 39th IEEE Conf. on Decision and Control (CDC), volume 5, pages 4865–4870, 2000. [6] G. Asadi and M. B. Tahoori. Soft Error Rate Estimation and Mitigation for SRAM- based FPGAs. In Proc. of the 2005 ACM/SIGDA 13th Intl. Symp. on Field- Programmable Gate Arrays, FPGA ’05, pages 149–160, New York, NY , USA, 2005. ACM. [7] ASC. ASC Sequoia Benchmark Codes, 2012. [8] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler Transformations for High-Performance Computing. ACM Computing Surveys (CSUR), 26(4):345–420, December 1994. [9] M. Balakrishnan, A. Kumar, P. Ienne, A. Gangwar, and B. Middha. A Trimaran Based Framework for Exploring the Design Space of VLIW ASIPs with Coarse Grain Functional Units. Intl. Symp. on System Synthesis, 0:2–7, 2002. 112 [10] E. Barszcz, J. Barton, L. Dagum, P. Frederickson, T. Lasinski, R. Schreiber, V . Venkatakrishnan, S. Weeratunga, D. Bailey, D. Bailey, D. Browning, D. Brown- ing, R. Carter, R. Carter, S. Fineberg, S. Fineberg, H. Simon, and H. Simon. The NAS Parallel Benchmarks. Technical report, The International Journal of Super- computer Applications, 1991. [11] A. Benso, S. Di Carlo, G. Di Natale, L. Tagliaferri, and P. Prinetto. Validation of a Software Dependability Tool Via Fault Injection Experiments. In Proc. Seventh Intl. On-Line Testing Workshop, pages 3 –8, July 2001. [12] C. Bolchini. A Software Methodology For Detecting Hardware Faults in VLIW Data Paths. IEEE Trans. on Reliability, 52(4):458 – 468, dec. 2003. [13] C. Bolchini and F. Salice. A Software Methodology for Detecting Hardware Faults in VLIW Data Paths. IEEE Intl. Symp. on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 0:0170, 2001. [14] N.S. Bowen and D.K. Pradham. Processor- and Memory-based Checkpoint and Rollback Recovery. Computer, 26(2):22 –31, February 1993. [15] T. Calin, M. Nicolaidis, and R. Velazco. Upset Hardened Memory Design for Submicron CMOS Technology. IEEE Trans. on Nuclear Science, 43(6):2874 – 2878, December 1996. [16] C. Carmichael, M. Caffrey, and A. Salazar. Correcting Single-Event Upsets Through Partial Configuration. Technical report, Xilinx, June 2000. [17] K. J. Christensen. The Next Frontier for Communications Networks: Power Man- agement. Performance and Control of Next-Generation Communications Net- works, 5244(1):1–4, August 2003. [18] J.B. Clary and R.A. Sacane. Self-testing computers. Computer, 12(10):49 –59, October 1979. [19] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture Sup- port for Accelerator-rich CMPs. In Proc. of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, New York, NY , USA, June 2012. ACM. [20] K. D. Cooper, P. J. Schielke, and D. Subramanian. An Experimental Evaluation of List Scheduling. Technical report, Dept. of Computer Science, Rice University, 1998. [21] S. Cuenca-Asensi, A. Martinez-Alvarez, F. Restrepo-Calle, F.R. Palomo, H. Guzman-Miranda, and M.A. Aguirre. A Novel Co-Design Approach for Soft Errors Mitigation in Embedded Systems. IEEE Transactions on Nuclear Science, 58(3):1059 –1065, june 2011. 113 [22] P.E. Dodd, A.R. Shaneyfelt, K.M. Horn, D.S. Walsh, G.L. Hash, T.A. Hill, B.L. Draper, J.R. Schwank, F.W. Sexton, and P.S. Winokur. SEU-sensitive V olumes in Bulk and SOI SRAMs From First-principles Calculations and Experiments. IEEE Transactions on Nuclear Science, 48(6):1893 –1903, December 2001. [23] J.A. Fisher, P. Faraboschi, and C. Young. Embedded Computing: A VLIW Approach To Architecture, Compilers And Tools. Electronics & Electrical. Mor- gan Kaufmann, 2004. [24] J.A. Fisher, P. Faraboschi, and C. Young. VLIW Processors: Once Blue Sky, Now Commonplace. IEEE Solid-State Circuits Magazine, 1(2):10 –17, spring 2009. [25] P. Gill, W. Murray, D. Ponceleon, and M. Saunders. Preconditioners for Indefinite Systems Arising in Optimization. SIAM Journal on Matrix Analysis and Applica- tions, 13(1):292–311, 1992. [26] M.A. Gomaa and T.N. Vijaykumar. Opportunistic Transient-fault Detection. In Proc. 32nd Intl. Symp. on Computer Architecture, 2005, ISCA, pages 172 – 183. IEEE, June 2005. [27] Open64 Group. Open64 Compiler Whirl Intermediate Representation. Open64 Group, August 2007. [28] Open64 Group. The Open64 Compiler, February 2012. [29] T. Higuchi, M. Nakao, and E. Nakano. Radiation Tolerance of Readout Electronics for Belle II. Journal of Instrumentation, 7(02):C02022, February 2012. [30] J. Hu, F. Li, V . Degalahal, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Compiler-assisted Soft Error Detection Under Performance and Energy Con- straints in Embedded Systems. ACM Trans. on Embedded Computing Systems, 8(4):27:1–27:30, July 2009. [31] J.S. Hu, F. Li, V . Degalahal, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin. Compiler-directed Instruction Duplication for Soft Error Detection. In Proc. Design, Automation and Test in Europe, DATE, pages 1056 – 1057 V ol. 2, March 2005. [32] F. L. Kastensmidt, G. Neuberger, L. Carro, and R. Reis. Designing and Testing Fault-tolerant Techniques for SRAM-based FPGAs. In Proc. of the 1st Conf. on Computing Frontiers, CF ’04, pages 419–432, New York, NY , USA, 2004. ACM. [33] Y .K. Kwok and I. Ahmad. Benchmarking and Comparison of the Task Graph Scheduling Algorithms. Journal of Parallel and Distributed Computing, 59(3):381 – 422, December 1999. 114 [34] Hewlett-Packard Laboratories. Vex toolchain, 2012. [35] M.M. Latif, R. Ramaseshan, and F. Mueller. Soft Error Protection Via Fault- resilient Data Representations. In Workshop on Silicon Errors in Logic-System Effects, 2007. [36] F. Lima, L. Carro, and R. Reis. Designing Fault Tolerant Systems into SRAM- based FPGAs. In Proc. of the 40th annual Design Automation Conference, DAC ’03, pages 650–655, New York, NY , USA, 2003. ACM. [37] D. G. Malcolm, J. H. Roseboom, C. E. Clark, and W. Fazar. Application of a Tech- nique for Research and Development Program Evaluation. Operations Research, 7(5):pp. 646–669, September 1959. [38] S. Malik, R. Rozploch, and T. Austin. Addressing the Information- System Plat- form Design Challenges for the Late- and Post-Silicon Era. In Proc. of 2010 Design Automation Conference, June 2010. [39] D. Martin and G. Estrin. Models of Computations and Systems — Evaluation of Vertex Probabilities in Graph Models of Computations. J. ACM, 14:281–299, April 1967. [40] A. Martinez-Alvarez, S. Cuenca-Asensi, F. Restrepo-Calle, F.R.P. Pinto, H. Guzman-Miranda, and M.A. Aguirre. Compiler-Directed Soft Error Mitiga- tion for Embedded Systems. IEEE Trans. on Dependable and Secure Computing, 9(2):159 –172, March-April 2012. [41] C. McNairy and D. Soltis. Itanium 2 Processor Microarchitecture. IEEE Micro, 23(2):44–55, March-April 2003. [42] Mentor graphics. Catapult C Synthesis, 2008. [43] K. Morgan, M. Caffrey, J. Carroll, D. Gibelyou, P. Graham, W. Howes, J. John- son, D. McMurtrey, P. Ostler, B. Pratt, H. Quinn, and M. Wirthlin. Fault Toler- ance Techniques and Reliability Modeling for SRAM-Based FPGAs. In Radiation Effects in Semiconductors, Devices, Circuits, and Systems, chapter 10, pages 249– 272. CRC Press, August 2010. [44] K.S. Morgan, D.L. McMurtrey, B.H. Pratt, and M.J. Wirthlin. A Comparison of TMR With Alternative Fault-Tolerant Design Techniques for FPGAs. IEEE Trans. on Nuclear Science, 54(6):2065 –2072, December 2007. [45] S.S. Mukherjee, J. Emer, T. Fossum, and S.K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proc. 10th Pacific Rim Intl. Symp. on Dependable Computing, pages 37 – 42. IEEE, March 2004. 115 [46] R. Naseer and J. Draper. DF-DICE: A Scalable Solution for Soft Error Tolerant Circuit Design. In Proc. Intl. Symp. on Circuits and Systems, ISCAS, page 4 pp. IEEE, May 2006. [47] M. Nicolaidis. Time Redundancy Based Soft-error Tolerance to Rescue Nanometer Technologies. In Proc. 17th IEEE VLSI Test Symp., pages 86 –94. IEEE, April 1999. [48] N. Oh, S. Mitra, and E. McCluskey. ED4I: Error Detection by Diverse Data and Duplicated Instructions. IEEE Trans. on Computers, 51(2):180 –199, February 2002. [49] N. Oh, P. Shirvani, and E. McCluskey. Error Detection by Duplicated Instructions in Super-scalar Processors. IEEE Trans. on Reliability, 51(1):63–75, March 2002. [50] P.R. Panda. SystemC - A Modeling Platform Supporting Multiple Design Abstrac- tions. In Proc. The 14th Intl. Symp. on System Synthesis, pages 75 – 80, October 2001. [51] David Patterson. Limits of Instruction-Level Parallelism. Lecture slides, April 2001. [52] P.G. Paulin and J.P. Knight. Force-directed Scheduling for the Behavioral Synthe- sis of ASICs. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 8(6):661 –679, June 1989. [53] A. Piotrowski, D. Makowski, G. Jablonski, S. Tarnowski, and A. Napieralski. Hardware Fault Tolerance Implemented in Software at the Compiler Level With Special Emphasis on Array-variable Protection. In 15th Intl. Conf. on Mixed Design of Integrated Circuits and Systems, MIXDES, pages 115 –119, June 2008. [54] M. Purnaprajna and P. Ienne. Making Wide-issue VLIW Processors Viable on FPGAs. ACM Trans. on Architecture and Code Optimization, 8(4):33:1–33:16, January 2012. [55] M. Rebaudengo, M. Reorda, M. Violante, and M Torchiano. A Source-to-Source Compiler for Generating Dependable Software. In Proc. of the first IEEE Intl. Workshop on Source Code Analysis and Manipulation, pages 33–42, 2001. [56] S. K. Reinhardt and S. S. Mukherjee. Transient Fault Detection Via Simultaneous Multithreading. SIGARCH Computer Architecture News, 28(2):25–36, May 2000. [57] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Soft- ware Implemented Fault Tolerance. In Proc. of the Intl. Symp. on Code Generation and Optimization, CGO ’05, pages 243–254, Washington, DC, USA, March 2005. IEEE Computer Society. 116 [58] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukher- jee. Software-controlled Fault Tolerance. ACM Trans. on Architecture and Code Optimization, 2(4):366–396, December 2005. [59] G.A. Reis, J. Chang, N. Vachharajani, S.S. Mukherjee, R. Rangan, and D.I. August. Design and Evaluation of Hybrid Fault-detection Systems. In Proc. 32nd Intl. Symp. on Computer Architecture, ISCA, pages 148 – 159. IEEE, June 2005. [60] M. Scholzel. HW/SW Co-detection of Transient and Permanent Faults With Fast Recovery in Statically Scheduled Data Paths. In Design, Automation Test in Europe Conference Exhibition, DATE, pages 723 –728, March 2010. [61] B. Shirazi, M. Wang, and G. Pathak. Analysis and Evaluation of Heuristic Meth- ods for Static Task Scheduling. Journal of Parallel and Distributed Computing, 10(3):222 – 232, November 1990. [62] J. Sklansky. Conditional-Sum Addition Logic. IRE Trans. on Electronic Comput- ers, EC-9(2):226 –231, June 1960. [63] L. Sterpone, D. Sabena, S. Campagna, and M.S. Reorda. Fault Injection Analysis of Transient Faults in Clustered VLIW Processors. In 14th Intl. Symp. on Design and Diagnostics of Electronic Circuits Systems, DDECS, pages 207 –212. IEEE, April 2011. [64] M.R. Varela, K.B. Ferreira, and R. Riesen. Fault-tolerance for Exascale Systems. In Intl. Conf. on Cluster Computing Workshops and Posters, pages 1 –4. IEEE, September 2010. [65] J. von Neumann. Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components. Automata Studies, pages 43–98, 1956. [66] D. W. Wall. Limits of Instruction-level Parallelism. In Proc. of the Fourth Intl. Conf. on Architectural Support for Programming Languages and Operating Sys- tems, ASPLOS-IV , pages 176–188, New York, NY , USA, April 1991. ACM. [67] S. Wong, T. van As, and G. Brown. -VEX: A Reconfigurable and Extensible Softcore VLIW Processor. In Intl. Conf. on ICECE Technology, pages 369 –372, December 2008. [68] Xilinx Corp. Virtex-5 TM Platform FPGAs: Complete Data Sheet, 2007. [69] Xilinx Corp. Virtex-7 TM Series FPGAs: Overview, 2012. 117
Abstract (if available)
Abstract
Hostile environments, shrinking feature sizes and processor aging elicit a need for resilient computing. Traditional course-grained approaches, such as software Checkpoint and Restart (C/R) and hardware Triple Modular Redundancy (TMR), while exhibiting acceptable levels of fault coverage, are often wasteful of resources such as time, device/chip area or power. In order to mitigate these shortcomings, Resiliency-aware Scheduling (RaS), a source-level approach is introduced and described. Resiliency-aware Scheduling combines traditional compiler techniques such as critical path and dependency analysis with the ability to potentially modify the target architecture’s resource configuration. This new approach can, in many cases, offer operational coverage similar to traditional schemes, while enjoying a performance advantage and area savings. ❧ To support Resiliency-aware Scheduling, several novel concepts and contributions are introduced. First, this thesis introduces the concept of Intrinsic Resiliency (IR), a measure of available Temporal Redundancy (TR) due to a computation’s lack of instruction-level parallelism for a fixed set of resources. Second, Hybrid TMR (hTMR), a flexible hardware design scheme that allows for trade-offs between performance and resiliency is presented. By implementing hTMR units, an RaS-hardened design maintains operational coverage while providing more scheduling flexibility, and thereby potentially better performance, than designs using traditional monolithic TMR units. Lastly, two distinct Resiliency-aware Scheduling infrastructures are described: the first based on the Open64 compiler, targeting Field Programmable Gate Arrays (FPGA), and the second based on the Vex compiler toolchains, targeting reconfigurable softcore Very Long Instruction Word (VLIW) architectures. ❧ This thesis also presents promising experimental results from the use of these two infrastructures. For the FPGA target, when using slice Look Up Tables (LUT) as the design metric, the RaS designs synthesized from a test suite of realistic kernel codes were, on average, 19% smaller than the TMR designs with equivalent performance and operational coverage. For the softcore VLIW target, the RaS-hardened executables developed for a case study perform between 15% and 40% faster than a competing software hardening approach. The RaS-derived VLIW executables are also shown to scale better than TMR, with up to a 40% performance improvement over an equivalently sized TMR computation. ❧ While mainstream compilation and synthesis tools exclusively focus on raw execution time or silicon area usage, the increasing rates of soft errors will likely prompt tool designers to focus on error mitigation as an integral part of their design methodologies. The ability to handle resiliency in the same automated fashion as other design attributes, such as area or power consumption, is greatly needed as architectures evolve. The work described here is a first step in this direction.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Improving efficiency to advance resilient computing
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Radiation hardened by design asynchronous framework
PDF
High level design for yield via redundancy in low yield environments
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Dispersed computing in dynamic environments
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
Asset Metadata
Creator
Abramson, Jeremy D. (author)
Core Title
Resiliency-aware scheduling
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science (High Performance Computing and Simulations)
Publication Date
03/05/2013
Defense Date
01/10/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
area consumption,compilers,Errors,fault-tolerance,FPGA,OAI-PMH Harvest,reconfigurable,resiliency,scheduling,SEU,single-event upsets,VLIW
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Diniz, Pedro C. (
committee chair
), Nakano, Aiichiro (
committee member
), Raghavendra, Cauligi S. (
committee member
)
Creator Email
jdabrams@usc.edu,jeremy.d.abramson@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-223327
Unique identifier
UC11293872
Identifier
usctheses-c3-223327 (legacy record id)
Legacy Identifier
etd-AbramsonJe-1459.pdf
Dmrecord
223327
Document Type
Dissertation
Rights
Abramson, Jeremy D.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
area consumption
compilers
fault-tolerance
FPGA
reconfigurable
resiliency
scheduling
SEU
single-event upsets
VLIW