Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
(USC Thesis Other)
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IMPROVING THE EFFICIENCY OF CONFLICT DETECTION AND CONTENTION MANAGEMENT IN HARDWARE TRANSACTONAL MEMORY SYSTEMS by Woojin Choi ______________________________________________________________ A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) December 2012 Copyright 2012 Woojin Choi ii (No More Negative) Acknowledgements Polite conflict resolution policy in Transactional Memory often shows terrible performance. The receiver constantly sends negative acknowledgements, and the requester politely waits for the receiver. My policy was not that fancy, too. I was always starved, often livelocked, and sometimes deadlocked. My greedy algorithm frequently makes all our resources busy, but you guys never complain to me, politely waiting for me and earnestly cheering me. I cannot say anything to you, except thanks. My transaction is almost finishing. Now I am sending acknowledgements with my whole warm heart – I hope it’s not too late – to my peers, Lihang, Aditya, Fatemeh, and Gopi; and to the committed members, TJ, Young, Zafar, and Mahta; and to engineering staffs, Tim Barrett, Spundun Bhatt, and especially our magician, Jeff Sondeen. Dr. Aiichiro Nakano and Dr. Murali Annavaram have guided me and given me precious feedbacks through the dissertation process. I appreciate their invaluable advices. I also would like to thank Dr. Viktor Prasanna and Dr. Massoud Pedram for helping me as members of my guidance committee. My acknowledgement should be forwarded to my friends and family, who always encourage me and believe in me, with patient love, especially to my mom. I still remember Feb 6th, 2008. Without your help, I could not reach the here and now. Thanks, Jeff. Woojin iii Table of Contents (No More Negative) Acknowledgements ii List of Figures vi List of Tables viii Abstract ix Chapter 1 Introduction 1 1.1 Overview 1 1.2 Research Contribution and Impact 3 1.2.1 Adaptive Grain Signatures 4 1.2.2 Unified Signatures 5 1.2.3 Mileage-based Contention Management 6 1.2.4 Research Impact 7 1.3 Organization 9 Chapter 2 Background on Transactional Memory Systems 10 2.1 Introduction 10 2.2 Overview on Transactional Memory 10 2.2.1 Comparison of Transactional Memory with Lock Synchronization 10 2.2.2 Design Aspects of Transactional Memory Systems 12 2.3 Overview on Conflict Detection 14 2.4 Overview on Hardware Signatures 16 2.4.1 Hardware Signatures with Transactional Memory Systems 17 2.4.2 Analytical Model of False Positives 22 2.5 Overview on Contention Management 25 2.5.1 Reactive Contention Management 25 2.5.2 Proactive Contention Management 26 2.6 Related Work 28 2.6.1 Conflict Detection 28 2.6.2 Contention Management 30 iv Chapter 3 Evaluation Methodology 32 3.1 Performance Evaluation 32 3.1.1 Simulation Environment 32 3.1.2 Benchmark Programs 33 3.2 Hardware Estimation 34 Chapter 4 Adaptive Grain Signatures 36 4.1 Overview 36 4.2 Introduction 36 4.3 Motivation 37 4.4 Proposed Design 42 4.4.1 Hash Functions 44 4.4.2 Abort History Table 44 4.4.3 Adaptive Grain Signatures with Transactional Memory Systems 45 4.5 Experimental Results 46 4.5.1 Positive Distribution 46 4.5.2 Hit Rates of Abort History Table 49 4.5.3 Performance 50 4.5.4 Hardware Implementation 53 4.6 Related Work 54 4.7 Conclusions 55 Chapter 5 Unified Signatures 57 5.1 Overview 57 5.2 Introduction 57 5.3 Motivation 58 5.4 Proposed Design 60 5.4.1 Blind Unified Signatures 61 5.4.2 Unified Signature with Helper 66 5.5 Experimental Results 68 5.5.1 Performance of Unified Signatures 69 5.5.2 Sensitivity to Helper Size 77 5.5.3 False Positives in Unified Signatures 80 5.5.4 Comparison with Asymmetric Signatures 81 5.5.5 Hardware Implementation 84 5.6 Related Work 86 5.7 Conclusions 88 v Chapter 6 Mileage-based Conflict Management 90 6.1 Overview 90 6.2 Introduction 90 6.3 Motivation 92 6.4 Proposed Design 95 6.4.1 Mileage Instructions and a Mileage Unit 95 6.4.2 Mileage-based Reactive Contention Management 97 6.4.3 Mileage-based Proactive Contention Management 98 6.4.4 Dynamic Mileage Allocation 103 6.5 Experimental Results 105 6.5.1 Simulation Results with Mileage-based RCM 105 6.5.2 Simulation Results with Mileage-based PCM 110 6.5.3 Putting It All Together 113 6.6 Related Work 114 6.7 Conclusions 117 Chapter 7 Conclusions 118 References 122 vi List of Figures Figure 2-1 Synchronization with Locks and Transactions ......................................... 11 Figure 2-2 Conflict Detection with Tag Augmentation ............................................. 15 Figure 2-3 Conflict Detection with Hardware Signatures ......................................... 16 Figure 2-4 Operations with Hardware Signatures ...................................................... 17 Figure 2-5 Hardware Signatures in Transactional Memory Systems ........................ 18 Figure 2-6 False Positives with Hardware Signatures ............................................... 20 Figure 2-7 Distribution of False Positives ................................................................. 24 Figure 2-8 Contention Management .......................................................................... 27 Figure 4-1 Positive Distribution with Different Granularities ................................... 39 Figure 4-2 Conflict Scenario in Eager Systems ......................................................... 41 Figure 4-3 Adaptive Grain Signatures (512-bit Signature) ........................................ 42 Figure 4-4 Positive Distribution with Complex Adaptive Grain Signatures ............. 48 Figure 4-5 Execution Time Normalized to Perfect Signature .................................... 51 Figure 4-6 Speedup of Adaptive Grain Signature over PBX Signature .................... 52 Figure 5-1 Blind Unified Signatures .......................................................................... 61 Figure 5-2 Example of Read-Read Dependency ........................................................ 62 Figure 5-3 Distribution of Read-Read Dependencies with Blind Unified Signature (total 2K bits) ....................................................................................................... 65 Figure 5-4 Unified Signature with Helper Signature ................................................. 67 Figure 5-5 Execution Time Normalized to a Perfect Signature ................................. 70 Figure 5-5 Execution Time Normalized to a Perfect Signature ................................. 71 Figure 5-6 Speedup of Unified Signatures over Separate Read-/Write-Signatures ... 72 Figure 5-6 Speedup of Unified Signatures over Separate Read-/Write-Signatures ... 73 Figure 5-7 Average Speedup of Unified Signatures over Separate Signatures ......... 76 Figure 5-8 The Impact of Helper Size on the Unified Signatures with Helper.......... 78 Figure 5-8 The Impact of Helper Size on the Unified Signatures with Helper.......... 79 Figure 5-9 Execution Time with Asymmetric Signatures ......................................... 83 vii Figure 5-9 Execution Time with Asymmetric Signatures ......................................... 84 Figure 6-1 Performance-Criticality in Intruder ......................................................... 94 Figure 6-2 Mileage Unit ............................................................................................. 96 Figure 6-3 The Overview of Pseudo-Lock Insertion ................................................. 99 Figure 6-4 Pseudocode for PLITxStart, PLITxCommit, and PLITxSchedule ......... 102 Figure 6-5 Dynamic Mileage Unit ........................................................................... 103 Figure 6-5 Execution Time with RCMs (T: Time, S: Size, and M: Mileage) ......... 105 Figure 6-6 Breakdown of Wasteful Transaction Time (T: Time, S: Size, and M: Mileage) ............................................................................................................. 106 Figure 6-7 Execution Time with PCMs (n: no PCM, a: ATS, c: CAS, and p: PLI) 110 Figure 6-8 Breakdown of Wasteful Transaction Time (n: no PCM, a: ATS, c: CAS, and p: PLI) ......................................................................................................... 111 Figure 6-9 Speedup with Combination of RCMs (T: Time and M: Mileage) and PCMs (n: no PCM, a: ATS, c: CAS, and p: PLI) .............................................. 113 viii List of Tables Table 3-1 System Configuration ................................................................................ 32 Table 3-2 Benchmark Characteristics ........................................................................ 33 Table 3-3 Transaction Characteristics ........................................................................ 34 Table 3-4 Hardware Overhead for Implementing Bit Arrays .................................... 35 Table 4-1 Hit Rate of Abort History Table ................................................................ 49 Table 4-2 Synthesis Results of Adaptive Grain Signatures and PBX Signatures ...... 53 Table 4-3 Area Overhead of Adaptive Grain Signatures and PBX Signatures over Sun Rock Processor Core .................................................................................... 54 Table 5-1 Data-Set Characteristics (Locality-Sensitive H3 Signatures with Total 2K bits) ...................................................................................................................... 59 Table 5-2 False Positive Rates (Total 2K bits) .......................................................... 80 Table 5-3 Synthesis Results of Unified Signatures .................................................... 84 Table 6-1 Transaction Characteristics of Intruder ..................................................... 93 Table 6-2 The Overhead of Mileage Instructions ...................................................... 95 Table 6-3 Dynamic Priority Decision ...................................................................... 104 ix Abstract Chip Multiprocessors (CMPs) are becoming the mainstream due to the physical power limits of process technology. In this parallel era, software applications no longer automatically benefit from improvements in processor performance as they did in past decades. The benefit of CMPs can only be realized by environments that enable efficient creation of parallel applications. Transactional Memory (TM) is a promising paradigm that aims to simplify parallel programming by providing a programmer-friendly alternative to traditional lock- based synchronization. With TM, programmers just focus on the correctness of their parallel programs by composing applications in units of a transaction, a block of codes that execute atomically and in isolation. The underlying TM system is responsible for enforcing atomicity and extracting performance. By decoupling correctness and performance, TM can make parallel programming much easier and enable better programmer productivity than lock primitives. TM systems attempt to harvest high performance by executing multiple transactions in parallel. In TM systems, a conflict occurs when a memory block is accessed concurrently by two or more transactions and at least one of them is a write access. Detecting conflicts is critical to the correctness as well as performance of TM systems. In this dissertation, we propose two conflict detection mechanisms, x adaptive-grain signatures and unified signatures to improve the efficiency of conflict detection. Observing that some false positives can be helpful to performance by triggering the early abortion of a transaction which would encounter a true conflict later anyway, we propose an adaptive grain signature to improve performance by dynamically changing the range of address keys based on the history. With the use of adaptive grain signatures, we can increase the number of performance-friendly false positives as well as decrease the number of performance-destructive false positives. Instead of using separate read- and write-signatures, as is often done in TM systems, we implement a single signature, a unified signature, to track all read- and write- accesses. By merging read- and write-signatures, a unified signature can effectively enlarge the signature coverage without additional overhead. Within the constraints of a given hardware budget, a TM system with a unified signature outperforms a baseline system with the same-sized traditional signatures by reducing the number of falsely detected conflicts. Even though the unified signature scheme incurs read-read dependencies, we show that these false dependencies do not negate the benefit of unified signatures and can effectively be filtered out. A TM system with a 2K-bit unified signature with helper signature scheme achieves speedups of 15% over baseline TM with 33% less area and 49% less power. How to resolve or prevent the conflicts, or contention management is another building block of TM systems that significantly impacts TM performance. xi Traditionally, critical sections or transactions have been treated to execute in any order with no weights as long as the atomicity can be maintained. We have observed that some transactions are more important than others with respect to the performance based on the implemented algorithm. Based on this observation, we propose a mileage technique, a software/hardware cooperative approach with new instructions and a new functional unit to exploit performance-criticality among transactions. We propose Mileage-based contention management and can achieve average speedups of 15% over baseline contention management. 1 Chapter 1 Introduction 1.1 Overview The computer industry has changed their direction from scaling clock frequency to increasing the number of cores on a chip due to the limitations of single core processors [7], [64]. As a result, multicore architectures are omnipresent in servers, desktops, and even embedded systems [5], [48], [94]. In this parallel era, software applications no longer automatically benefit from improvements in processor performance as they did in past decades. The benefit of CMPs can only be realized by writing efficient parallel applications [89], [35]. For pursuing the parallel advantages, the hardware must help the software writers to compose the parallel application more easily, and the application programmers should look for an efficient method to extract more performance with the parallel hardware. To make efficient use of the increased number of cores on a chip, parallel programming is indispensable. Even with its long history, writing correct as well as efficient parallel programs still remains difficult because of inherent complexities. The Transactional Memory (TM) programming model has attracted considerable attention as a promising paradigm for alleviating the difficulty of parallel programming [1], [35], [41]. TM is a concurrency control mechanism that provides atomic and isolated execution for regions of code [39], [42], [60], [73]. Atomicity means that all instructions in a 2 transaction are either completed successfully or cancelled totally. Isolation means that all operations by a transaction appear invisible to other concurrently-running transactions until the transaction completes. With TM, programmers just focus on the correctness of their parallel program by defining transactions, the code blocks that execute atomically and in isolation. The underlying TM system is responsible for enforcing atomicity and extracting performance. By decoupling correctness and performance, TM can make parallel programming much easier and enable better programmer productivity than lock- based synchronization [65], [66], [76]. [77]. Motivated by this unrivaled merit, processor developers have begun providing their chips with TM support such as AMD ASF [27], IBM Blue Gene/Q [40], Intel Haswell [44], and Sun Microsystems Rock [92]. TM systems attempt to harvest high performance by executing multiple transactions concurrently. For maintaining correctness among concurrent transactions, only conflict-free transactions execute completely (commit). A conflict occurs when two or more transactions operate concurrently on the same data and at least one of them writes new data. If a conflict is detected, one of these conflicting transactions can commit, and the others restart after discarding their updates (abort) or stall. To detect conflicts, TM systems track the read-set and write-set of each transaction. The read-set denotes the set of locations that a transaction has read from, and the write- 3 set denotes the set of locations that a transaction has written to. We refer to the union of a read-set and a write-set of a transaction as its data-set. Conflict detection is critical to the correctness of TM systems. If conflicts are not detected thoroughly, coherence that is necessary for correct operation of a TM system cannot be guaranteed. Hence, the conflict detection mechanism in TM systems must never miss any conflicts. Conflict detection is also important for extracting high performance in TM systems. If conflicts are detected falsely, the correctly-running transaction will be aborted or stall, which can hurt TM performance. The management of conflicts also significantly impacts TM performance [31]. There are two alternative approaches for managing conflicts: Reactive Contention Management (RCM) [86], [87] and Proactive Contention Management (PCM) [10], [98]. RCM, or conflict resolution, cures a conflict after it has been detected. PCM, or transaction scheduling, prevents conflicts before they occur. 1.2 Research Contribution and Impact In the previous section, we briefly review the concept of TM systems and its basic building blocks. Even with its potential to tease the creation of parallel programs, TM will not be useful to mainstream markets if it does not meet performance requirements in an efficient implementation. Although prior research has addressed the issues of conflict detection and management, the corresponding solutions were 4 not adequately efficient in performance. Our proposed solutions of adaptive grain signatures, unified signatures, and mileage-based contention management focus on improving the efficiency of conflict detection and contention management in TM systems to enable HTM systems to be more readily adopted into architecture designs. This section presents a summary of our research contributions resulting from this work. 1.2.1 Adaptive Grain Signatures Chapter 4 describes a novel hardware signature design, adaptive grain signature. Based on the Bloom filter [11], a hardware signature is an area-efficient design for concisely representing a very large set of addresses. Even though signatures effectively overcome the problems with traditional conflict detection mechanisms, they introduce another problem, false positives. Due to its lossy encoding nature, a signature can produce a false positive, declaring conflicts even when no actual conflict exists. False positives can degrade performance by making correctly- running transactions abort or stall. Previous signature designs attempted to reduce the total number of false positives, assuming that false positives are detrimental to performance. However, we observe that false positives are not always harmful to performance. False positives can be categorized into good false positives, i.e. false positives which are instead helpful to TM performance, and bad false positives. A good false positive occurs when the 5 signature detects a conflict erroneously but the running transaction incurs a conflict anyway. This false but early conflict detection can improve performance by stopping the execution of a transaction that will eventually encounter a conflict. Based on this observation, we propose an adaptive grain signature, to improve performance by increasing the number of good false positives and by decreasing the number of bad false positives. An adaptive grain signature maintains the history of transaction aborts and dynamically changes the range of the address bit-field for calculating signature indices based on the abort history. 1.2.2 Unified Signatures TM-supporting hardware such as hardware signatures would hardly be helpful to the performance of single-threaded programs or parallel programs written with other programming techniques such as Message Passing Interface (MPI) [69]. Therefore, it is desirable to maximize TM performance within a given hardware budget. Previous signature designs consist of dedicated read- and write-signatures to monitor the read-set and write-set, respectively. However, the disparity between read-set size and write-set size stemming from application characteristics often introduces asymmetric occupancy between the two signature types. This imbalance results in a suboptimal utilization of a given hardware resource, causes more false positives, and therefore, degrades TM performance. 6 Chapter 5 presents a unified signature, tracking both read- and write-accesses with a single signature instead of separate read- and write-signatures to maximize TM performance with a limited hardware budget. By merging read- and write-signatures, a unified signature can effectively enlarge the signature size without additional hardware overhead and improve TM performance by reducing the number of false positives. A TM system with a unified signature, whose size is the sum of separate signatures, outperforms a baseline system that uses separate signatures. 1.2.3 Mileage-based Contention Management In Chapter 6, we show that even though critical sections are evenly important for correctness, they are unequally important for performance. Among the critical sections in a parallel application, some critical sections are more important than others with respect to the performance based on the implemented algorithm. To better exploit application-inherent characteristics, we propose a mileage technique, a software/hardware cooperative mechanism to exploit performance- criticality among critical sections. New instructions, MILEAGE and MRSTCNT, and a new hardware unit, mileage unit, are proposed to measure the progress of a running thread, and the overhead of mileage is measured in terms of the number of additional instructions executed, and hardware area and power. We present the efficiency of mileage-based atomic section ordering in the context of contention management in TM systems. We evaluate Mileage-based RCM and 7 Mileage-based PCM. Previous RCM schemes treat all transactions with no weights, and make a decision based on the information provided by the running transaction instance. The decision from Mileage-based RCM is made with the relative importance of each transaction from the program flow as well as dynamic flow, selecting the transaction with the smaller mileage value. For PCM approaches, conflicts can be prevented by throttling the number of concurrently-running transactions. We propose Pseudo-Lock Insertion (PLI) as a Mileage-based PCM. PLI consists of abort prediction and pseudo-locking. We use as an abort predictor a 4-bit saturating counter that keeps track of the aborts between multiple instances of the same static transaction. The implemented predictor is simple but effective enough to capture the highly-predictable phased execution of transactional applications. Also, it does not severely hurt the performance of other applications without phased execution. Pseudo-locking is used for serializing the transactions. Transactions predicted to cause aborts wait for a pseudo-lock release instead of starting execution. A pseudo-lock is accessed with ordinary reads and writes, so the overhead of atomic instructions is not required. Mileage information is used to throttle the critical transactions in order to extract more performance. 1.2.4 Research Impact Transactional Memory (TM) has attracted considerable attention as a promising paradigm for alleviating the difficulty of parallel programming. In order to make 8 TM more easily adoptable as a mainstream parallel programming model, our research has focused on improving its performance so as to enable the efficient creation of parallel applications. Adaptive grain signatures improve performance by increasing performance-friendly false positives as well as decreasing the number of performance-destructive false positives. Unified signature can effectively enlarge the signature coverage without additional overhead. A TM system with a 2K-bit unified signature with helper signature scheme achieves speedups of 15% over baseline TM with 33% less area and 49% less power. Finally, mileage enforces the execution ordering of transactions based on the observation that some transactions are more performance-critical than others. Mileage-based contention management achieves average speedups of 15% over baseline contention management with marginal hardware overhead. The proposed mechanisms are not only used for improving TM performance, but also extendable to other computer architecture designs. For example, a signature, or Bloom filter, is a popular technique used for maintaining membership. Our unified signatures can replace ordinary signatures to extract more performance by better utilizing a given hardware budget. Mileage can also be extended to general shared memory parallel applications or even to message-passing applications for managing threads, allocating shared resources, and reducing power consumption. 9 1.3 Organization The rest of this thesis dissertation is organized as follows. Chapter 2 presents foundational background information for this research. The concept of Transactional Memory, the conflict detection mechanisms, and the basic design of hardware signatures are presented. Chapter 3 describes our evaluation methodology. Chapter 4 and Chapter 5 elaborate adaptive grain signatures and unified signatures, respectively. Chapter 6 presents mileage-based conflict management. Finally, Chapter 7 concludes the thesis. 10 Chapter 2 Background on Transactional Memory Systems 2.1 Introduction This Chapter provides background information about this thesis. Section 2.2 presents the basic concept of Transactional Memory (TM). Section 2.3 describes conflict detection, an essential element to the correctness as well as performance of TM systems. Section 2.4 delves into the details of the hardware signature in TM systems. Section 2.5 presents contention management, the critical mechanism to TM performance. Finally, Section 2.6 reviews related work with conflict detection as well as contention management. 2.2 Overview on Transactional Memory 2.2.1 Comparison of Transactional Memory with Lock Synchronization Traditional parallel programming with locks is too complicated for the average developer [89]. In lock-based synchronization, a programmer is responsible for effecting both correctness and high performance. Figure 2-1 (a) presents an example of program execution with coarse-grain locks. The critical sections, code regions that are wrapped in a lock/unlock pair – A, B, and C in Figure 2-1 (a), are protected by a single lock and independent of each other. Even though there are no dependencies between them, all critical sections are serialized because they are protected with a single global lock. As a result, with coarse-grain locking, it is easy 11 to write correct parallel programs at the cost of poor performance. To harvest more performance, coarse-grain locks can be replaced with fine-grain locks to overlap the execution of critical sections. Fine-grain locking typically yields better performance but is error-prone. Synchronization with locks incurs several problems such as deadlock, convoying, priority inversion, and lack of composability [35], [42]. Figure 2-1 Synchronization with Locks and Transactions Programming with transactions is as easy as programming with coarse-grain locking; however, TM systems can reap performance similar to the performance of fine-grain locking [1], [35], [41], [66], [77]. Figure 2-1 (b) illustrates an example of program execution with transactions. A transaction is similar to a critical section except that 12 it is protected with TxStart and TxCommit, 1 instead of lock and unlock. With TM, programmers just focus on defining the transactions. The underlying TM system executes multiple transactions in parallel with the optimistic assumption that there are no dependencies between them. If any conflict between concurrently- running transactions is detected, the running transaction aborts or is stalled. In Figure 2-1 (b), transaction C from Thread 2 experiences a conflict and aborts. After the conflict has resolved (i.e., transaction C from Thread 1 commits), the aborted transaction C from Thread 2 restarts. 2.2.2 Design Aspects of Transactional Memory Systems To maintain atomicity and isolation properties, a TM system should provide three basic functionalities: version management, conflict detection, and conflict resolution. Version management denotes the mechanism that maintains two versions of data: the newly written value to a memory location within a transaction and the original value of that memory location. The new version, produced by execution of a transaction, will become globally visible only after the transaction commits. The old version, the original value before the transaction starts, will be used when the running transaction aborts. Lazy version management stores old versions in memory and new versions 1 When a processor encounters a TxStart instruction, its register state (e.g., program counter or PC, and architectural registers) is checkpointed to support rollback recovery on a transaction abort. When a TxCommit instruction executes, the processor releases the saved checkpoint. 13 elsewhere, for example in a redo log. Conversely, eager version management stores new versions in memory and old versions elsewhere, for example in an undo log. Conflict detection denotes the mechanism that tracks the read- and write-sets of a transaction and only commits conflict-free transactions. With lazy conflict detection, conflict detection is postponed until the end of each transaction. Before committing, a transaction checks that no other transaction is reading the data it wrote or writing the data it read. With eager conflict detection, a conflict is detected progressively as a transaction makes memory accesses. Finally, conflict resolution denotes the actions that the TM system will take when a conflict is detected. Once a conflict is detected, one of these competing transactions can continue its execution and the others stall or abort to maintain atomicity based on the conflict resolution policy. Stalling the requester, aborting the requester, and aborting the receiver are popular traditional resolution policies [12]. Those key mechanisms of TM can be implemented in software for Software TM (STM) systems ([32], [43], [54], [62], [63], [78], [79]), in hardware for Hardware TM (HTM) systems ([3], [18], [39], [51], [60], [91], [100]), or with a hybrid approach ([30], [47], [85]). The overall performance of STM is significantly worse than HTM [17]. So, we focus on HTM systems due to the inherent performance advantages over STM systems. 14 HTM systems can be categorized as lazy conflict detection and lazy version management systems (LL system, e.g., TCC [39], [56], [20]), eager conflict detection and lazy version management systems (EL system, e.g., Bulk [18], [19]), and eager conflict detection and eager version management systems (EE system, e.g., LogTM [60], [61], [95]). We use EE system as our baseline TM system because it is more likely to be adopted in commercial processors of the near future. LL systems represent a more radical approach than EE systems, as they fundamentally change how coherence and consistency is defined and implemented [30]. 2.3 Overview on Conflict Detection To maintain correctness in the face of concurrency, detecting conflicts among simultaneously running transactions is an essential element. In HTM systems, two methods are used for conflict detection, cache tag augmentation and hardware signatures. In this section, we briefly describe these two conflict detection mechanisms. Traditional HTM systems use cache tag augmentation for conflict detection [39], [60]. In this scheme, the cache tag array contains speculative-read and speculative- write bits per cache block. Figure 2-2 illustrates conflict detection with speculative bits. After beginning a transaction, each speculative-read (write) bit corresponding to the block address of load (store) is set. During transaction execution, the speculative bits are tested against incoming coherence requests to detect conflicts. If the speculative-write bit corresponding to an incoming read request (REQ) is set, a 15 read-write dependency is detected, and a conflict is declared. If the speculative-write bit corresponding to an incoming read-exclusive request (REQX) is set, a write-write dependency is detected, and a conflict is declared. If the speculative-read bit corresponding to an incoming REQX is set, a read-write dependency is detected, and a conflict is declared. Upon a transaction commit or abort action, all the speculative bits are gang-cleared. Figure 2-2 Conflict Detection with Tag Augmentation Tag augmentation, however, encounters several problems [95]. First, since the cache is an already well-optimized structure, it is hard to modify it to insert new functionalities, such as insert/test and gang-clear operations. Second, if a transactional cache block is evicted during transaction execution, the speculative bits are lost. To solve this problem, only a transaction with replaced cache block continues executing until commit, and other concurrent transactions are stalled [39], or a transaction with replaced cache block always sends a negative acknowledgement (NACK) when it receives any request that is missed in its cache [60]. Finally, tag augmentation shows other limitations concerning thread switching. 16 Figure 2-3 Conflict Detection with Hardware Signatures Recently, hardware signatures have been proposed as an efficient alternative to overcome the limitations of tag augmentation [18], [19], [58], [95]. Figure 2-3 illustrates conflict detection with hardware signatures. A signature is a separate structure from the cache, so no cache modifications are required. Differently from tag augmentation, hardware signatures can maintain book-keeping information even after replacing a transactional cache block in the middle of transaction execution. Hence, the problems with cache tag augmentation are easily solved with signatures. The details of hardware signatures are presented in the following section. 2.4 Overview on Hardware Signatures A signature is a space-efficient structure for summarizing a large number of addresses. In this section, we describe the structure of hardware signatures and the behavior of our baseline system, LogTM [60], with hardware signatures. 17 2.4.1 Hardware Signatures with Transactional Memory Systems Hash Functions Bit Arrays (b) Test Block Offset ... 0 0 1 0 ... 0 0 0 1 Declare a conflict if 1 h 0 h 1 Memory Address (Incoming Request) Block Address (a) Insert ... 0 0 1 0 ... 0 0 0 1 h 0 h 1 000...01 000...00 Memory Address (Local Access) n-bit Address Key n-bit Signature Index 2 n bits Figure 2-4 Operations with Hardware Signatures Signatures track each transaction’s access information. After beginning a transaction, each memory address is inserted into a signature. Figure 2-4 shows how a memory address is inserted to and tested with a signature. A signature consists of a set of hash functions and bit arrays. As shown in Figure 2-4 (a), the block address of the memory access is decoded based on a given hash function (h 1 or h 0 ) to calculate a signature index, and the corresponding bit in the signature is set for an insert operation. In Figure 2-4, a bit-extraction hash function is illustrated. Bit extraction selects a pre-defined bit-field (address key) from a block address as a signature index (hash value). To spread the frequently occurring patterns over all indices, more efficient hash functions such as bit-permutation hashing [18], XOR-based hashing [96], or H3 hashing [70], [80] can be used. Each hash function sets one signature bit per address. When receiving a coherence request, the signature is tested to check whether a conflict occurs (Figure 2-4 (b) Test). The bit array is accessed with the 18 decoded request address, and the signature declares a conflict if all the signature bits corresponding to the requesting address are set. (a) Insert Memory Address (Local Access, Load or Store) h 0 h 1 RD/ST (0/1) 1 0 0 0 ... 0 0 0 ... 0 Write-Signature 0 1 ... 0 0 0 ... 1 Read-Signature RD/ST (0/1) 01 n-bit Signature Index n bits 2 n bits 2 n bits 2 n bits 2 n bits (b) Test Declare a conflict if 1 Memory Address (Incoming Request, REQ or REQX) h 0 h 1 0 1 REQ/REQX (0/1) 0 ... 0 0 1 ... 0 0 0 Read-Signature Write-Signature Figure 2-5 Hardware Signatures in Transactional Memory Systems Now we consider the behavior of a TM system with read- and write-signatures. Figure 2-5 illustrates read- and write-signatures to detect conflicts between 19 transactions. 2 After beginning a transaction, the block addresses of loads are inserted into a read-signature, and the block addresses of stores are accumulated in the write- signature (Figure 2-5 (a) Insert). When receiving a coherence request, read- and/or write-signatures are tested to check whether a conflict occurs (Figure 2-5 (b) Test). A REQ is tested against the write-signature to detect a read-write dependency. A conflict is declared if the write- signature signals a positive. A REQX is tested against both the read-signature and the write-signature to detect a read-write and a write-write dependency. When any signatures signal a positive, a conflict is declared. In eager conflict detection systems such as LogTM [60], the receiver sends a NACK to the requester after detecting a conflict. When receiving a NACK, the requester decides whether to stall the running transaction or abort it, based on its conflict resolution policy [12]. Upon a transaction commit or abort operation, all the signature bits are cleared. A signature can concisely represent a very large set of addresses. During transaction execution, signature bits corresponding to accessed addresses are set. Once set, these bits are not reset until the transaction commits or aborts, so no conflicts are missed (no false negatives). However, because of its lossy encoding nature, a signature can 2 This is one of the possible implementations for detecting conflicts with read-/write-signatures. Several designs would be in consideration depending on frequency, complexity, size, and power consumption. 20 declare conflicts even when none exists (false positives). The conflicts due to false positives can degrade performance by making correctly-running transactions abort or stall. We can reduce the number of false positives by enlarging the signature [11], but this introduces a non-negligible hardware overhead. Figure 2-6 False Positives with Hardware Signatures The cause of false positives can be classified into aliasing and occupancy [70]. Figure 2-6 illustrates false positives. When the addresses x and y are different but the corresponding address keys are the same, the test results in a false positive (Figure 2-6 (a) Aliasing). False positives due to aliasing occur because only a portion of the memory address is used for calculating signature indices. A false positive due to occupancy occurs if every bit in the signature for the testing address (y) has already been set by previously-inserted addresses (x 1 and x 2 ), even though the 21 testing address was not inserted (Figure 2-6 (b) Occupancy). As more signature bits are set, the chances of an occupancy false positive increase. False positives cause a running transaction to stall or abort. Stalling due to false conflicts could degrade performance by limiting concurrency. Even worse, aborting due to false conflicts might severely degrade the performance of an eager version management TM system, such as LogTM [60]. In the LogTM system, old data which are overwritten by transactional stores are saved into an undo-log, and new data are stored in the memory hierarchy. If the transaction commits, then the stored old data are discarded. However, if the transaction aborts, the logged old data should be restored to their appropriate locations for maintaining correctness. A software handler is responsible for this log unrolling, and the resulting performance penalty is non-negligible. To limit the number of transaction aborts, the requester which receives a NACK stalls a while instead of immediately aborting and sends a NACKed request again with the hope that the conflicting transaction finished. If the conflict is persistent and a deadlock situation 3 occurs, the younger transaction among the conflicting transactions is aborted. A detailed description of the LogTM system can be found elsewhere [60], [61], [95]. 3 In LogTM systems, potential deadlock is detected by recognizing the situation in which one transaction is both waiting for a logically earlier transaction and causing the other logically earlier transaction to wait by sending a NACK. 22 2.4.2 Analytical Model of False Positives In this section, we present a formal evaluation of signatures in terms of the probability of false positives, which is discussed in [80]. The development of a more advanced model remains as future work. Assume that we have k independent hash functions and an m-bit array. Also assume that the inserted addresses are uniformly distributed amongst all m bits in the array. On a single insert operation, the probability of a bit position x being set by a hash function is 1/m. Since the k hash functions are assumed to be independent, the probability of the bit position x being not set by k hash functions is k k x m P 1 1 ) 1 , ( Eq. 2.1 After n insert operations, the probability that the bit position x still remains zero is kn n k x m P 1 1 ) , ( Eq. 2.2 Therefore, the probability that the bit position x is set after n insertions is kn n k x n k x m P P 1 1 1 1 ) , ( ) , ( Eq. 2.3 The test operation declares a positive when all k bits corresponding to the testing address are asserted. Hence, the probability of a positive is k kn n k x n k x n k x n k x m P P P P e] Pr[positiv k 1 1 1 ... ) , ( ) , ( ) , ( ) , ( 3 2 1 Eq. 2.4 23 At this time, we are interested in the probability of false positives. The positives calculated in Eq.2.4 contain true positives (total n) as well as false positives. After n address insertions, the combination of bits set by n addresses can produce N, the number of total positives. If N is much larger than n, then the probability that the address was not inserted, conditioned by a positive result is 1 N n N positive] | inserted Pr[not Eq. 2.5 Therefore, the probability of false positive is k kn m e] Pr[positiv positive] | inserted Pr[not positive] and inserted Pr[not positive] Pr[false 1 1 1 Eq. 2.6 Based on our analytical model, Figure 2-7 illustrates the relationship between false positives and a signature configuration. Figure 2-7 (a) shows the probability of false positives with k = 1, 2, 4, 8 hash functions, m = 1K-bit array, and up to 1000 inserted addresses. When the number of inserted addresses is more than 600, the signature with a 1-hash function produces the smallest number of false positives. Usually, the read-set and write-set sizes are smaller than 100 (details in Chapter 3), and Figure 2- 7 (b) emphasizes this case by inserting up to 100 addresses. In this figure, signatures with more hash functions produce less false positives. Finally, Figure 2-7 (c) illustrates the probability of false positives with k = 4 hash functions, m = 1K, 2K, 4K, 8K bits, and up to 1000 inserted addresses. The number of false positives decreases by enlarging the signature size. 24 (c) The Probability of False Positives with Four Hash Functions (max 1000 insert operations, m=1K, 2K, 4K, 8K bits) 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0 200 400 600 800 1000 Probability of False Postives Number of Addresses 1K bits 2K bits 4K bits 8K bits (a) The Probability of False Positives with 1K-bit Bloom Filter (max 1000 insert operations, k=1, 2, 4, 8) 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0 200 400 600 800 1000 Probability of False Positives Number of Addresses k=1 k=2 k=4 k=8 (b) The Probability of False Positives with 1K-bit Bloom Filter (max 100 insert operations, k=1, 2, 4, 8) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0 20406080 100 Probability of False Positives Number of Addresses k=1 k=2 k=4 k=8 Figure 2-7 Distribution of False Positives 25 The theoretical model used in this section, however, shows several limitations. First of all, we assume that the inserted addresses are randomly distributed so that hash functions can provide uniformly distributed hash values. However, real applications exhibit memory locality. Due to memory locality, inserted addresses can show strong correlations which lead to high false positive rates. We also assume k hash functions are independent but in reality, it is almost impossible to implement independent hash functions because hash functions can share some address bits as inputs. As a result, hardware signatures with realistic hash functions can introduce more false positives than that modeled in the mathematical evaluation. The simulation results will be shown in Chapter 4 and Chapter 5. 2.5 Overview on Contention Management TM systems attempt to harvest high performance by executing multiple transactions in parallel, and the concurrent execution of transactions can result in conflicts. The management of conflicts significantly impacts TM performance [31]. There are two alternative approaches for managing conflicts: Reactive Contention Management (RCM) [86], [87] and Proactive Contention Management (PCM) [10], [98]. 2.5.1 Reactive Contention Management RCM, or conflict resolution, resolves a conflict after it has been detected. With eager conflict detection, a conflict is detected when a requester transaction tries to make a memory access. The request is forwarded to a corresponding receiver 26 transaction, and the receiver decides whether to reject the request by sending a NACK or to make a concession based on its RCM policy. Traditional RCMs decide which transaction continues its execution based on information from the current instance. For example, Time-based RCM prioritizes the transaction that has started earlier over younger ones, and Size-based RCM selects as a winner the transaction which has accessed more memory blocks [86]. 2.5.2 Proactive Contention Management In a parallel application, the atomic execution of a code segment is common for maintaining correctness. Two programming alternatives for this purpose are traditional locks and TM. A locking mechanism is a pessimistic concurrency control that is a kind of prediction whose output is always-predicted-conflict. If mispredicted, concurrency is needlessly suppressed and the opportunity to gain more performance is lost. On the contrary, TM is an optimistic speculation scheme, predicting always-not-conflict. If transactions conflict and result in aborts, performance can be degraded due to the abort penalties. Instead of selecting one of two static extremes, a dynamic prediction scheme, Proactive Contention Management (PCM) [53], [98], can be used. PCM, or transaction scheduling, prevents conflicts before they occur. PCM can not only reduce the penalty of the aborted transaction, but also speed up contending transactions. In EE HTM systems, an aborted transaction prevents the conflicting transactions from continuing execution before rolling back the undo log finishes [60]. 27 The conflicting transactions wait for completion of the log undoing so that they can read non-speculative values from memory written back during the abort process. As the abort penalty is not only delaying the aborted transactions but also exposed to the contending transactions, the execution of the parallel application slows down dramatically, as shown in Figure 2-8 (a). Even though the transaction from Thread 1 is aborted, the contending transaction from Thread 2 cannot continue its execution during Thread 1 ’s abort process. Figure 2-8 Contention Management With PCM, conflicts can be prevented by serializing the execution of transactions that seem to be aborted. Figure 2-8 (b) presents how PCM can avoid a conflict. Instead of starting the predicted-aborted transaction, Thread 1 schedules it so as to prevent the conflict. The main topic of PCMs is to efficiently predict aborts and to serialize the predicted-aborted transactions. The execution history is used for 28 predicting aborts, and transactions are serialized with transaction migration, thread switching, and traditional locking. We describe how each PCM scheme fulfills this purpose in Section 2.6.2. 2.6 Related Work 2.6.1 Conflict Detection There are many applications using hardware signatures in computer design, such as cache hit/miss prediction [67], memory address disambiguation [82], and coherence directory design [99]. In this section, we focus on the previous work with signatures to detect conflicts among concurrent transactions. Signatures for conflict detection in TM systems were originally proposed by Bulk [18]. Ceze et al. [18] proposed performing conflict detection with signatures for thread-level speculation and TM systems. BulkSC [19] uses Bulk’s disambiguation module to enforce a sequential consistency memory model. SigTM [58] implements hardware signatures to accelerate the performance of STM systems. LogTM-SE [95] adopts Bulk’s hardware signatures to the LogTM system. These primary proposals focus on adjusting signatures to each TM system and use simple bit-selection or bit- permutation hash functions. Recently, several signature schemes have been proposed to design area-efficient signatures or to improve TM performance [70], [80], [96], [97]. Sanchez et al. [80] suggest a parallel Bloom filter. They implement a signature with single-ported 29 multiple-SRAMs instead of a multi-ported single-SRAM for hardware efficiency. Also, noticing the importance of hash functions, they advocate using high-quality hash functions such as H3 [15] for distributing signature indices more uniformly. However, H3 requires many XOR gates to calculate a signature index and thus introduces greater area and power overheads. [70], [96], and [97] give attention to memory access locality. Yen et al. [96] propose Page-Block XOR (PBX) signatures with an XOR-based hash function. They observed that the randomness of higher-order address bits (page bits) is smaller than that of lower-order address bits (block bits) because of locality. To increase the randomness, they permuted the page bits and block bits separately and calculated signature indices by XORing each page bit and block bit. With a PBX signature, their TM system harvests similar performance to the TM system with H3 signatures while consuming less hardware. Quislant et al. [70] also focused on locality to reduce false positives due to occupancy. The occupancy false positives increase as more addresses are inserted, and so more signature bits are set. Their modified H3 hash function produces signature indices so that memory addresses with spatial locality share some signature bits. As a result, a number of signature bits lower than the original H3 can be set and the false positives can be reduced. Each application and even each transaction from a program possesses its own locality properties. The dynamic signature scheme in [97] prepares a group of 30 signatures with different ranges of address bit-fields, and the most appropriate signature for each transaction is selected based on run-time information. However, because dynamic signatures require implementing multiple signatures, a high hardware overhead is incurred. 2.6.2 Contention Management After detecting a conflict, a TM system resolves the conflict with RCM. RCM can have a profound impact on TM performance, and many design alternatives for RCM have been studied [36], [37], [83], [86]. However, no single policy has been acknowledged as the universally best one for all applications. Shriraman et al. [Shriraman08], [86] propose a mixed conflict resolution scheme, resolving write- write conflicts eagerly and read-write conflicts lazily. If aborts between transactions are highly-predictable, sequential execution of them can be more efficient than concurrent execution. PCM prevents conflicts by throttling the number of concurrently-running transactions. Sequential execution of transactions can be achieved with transaction migration for sequentially processing the contended transactions on the same core [4], [33], thread switching to execute non-conflicting transactions [9], [10], [53], and traditional locking to throttle concurrency with a global lock [34], [53], [75]. CAR-STM [33] uses per-core transaction queues, and the predicted-aborted transactions are moved to the transaction queue of the contending transaction (enemy 31 transaction) so that their execution is serialized. Steal-on-abort [4] predicts the aborted transaction will be aborted again. The predicted-aborted transaction is queued behind the enemy transaction. Shrink [34] predicts an abort of a starting transaction based on the data-set of the previous transactions. If its data-set seems to be overlapping with the write-set of already-starting transactions, it is serialized via a global lock. Maldonado et al. [53] develop several scheduling mechanisms from locking to thread switching and provide kernel-level scheduling support. Most recently, HTM system research efforts have begun considering PCM approaches [9], [10], [26], [75], [98]. TxLinux [75] implements transactional locks for processing I/O operations inside a transaction and scheduling highly-conflicting transactions. In Adaptive Transaction Scheduling (ATS) [98], each thread estimates abort rates based on the occurrence of commits and aborts it has experienced. If the abort rate is higher than a predefined threshold, the starting transaction is serialized via a centralized queue. Conflict Avoidance Scheduling (CAS) [26] predicts an abort based on the abort history between threads. If a pair of threads has conflicted severely in the past, the transactions executed by them are serialized. Finally, Blake et al. [9], [10] propose thread switching-based PCM with an over-subscribed system and Bloom Filter-based abort predictor [10]. 32 Chapter 3 Evaluation Methodology This chapter describes the simulation environment and the benchmarks used to evaluate the performance of TM systems in this research. It also explains the method used to estimate the hardware overhead of the proposed designs. 3.1 Performance Evaluation 3.1.1 Simulation Environment Table 3-1 System Configuration Unit Core In-order, single-issue, 16 cores L1 Cache Split, 32KB, 4-way, 64B block, 1-cycle hit latency L2 Cache Unified, 16MB, 16-way, 64B block, 34-cycle hit latency Directory Full-bit sharer list, 6-cycle latency Memory 4GB, 200-cycle latency Interconnect 2D mesh, 64B links, 3-cycle hop latency Value We have implemented the proposed conflict detection and contention management mechanisms on the LogTM-SE system [95] provided by the GEMS toolset [55], which is driven by the Virtutech Simics simulator [52]. Table 3-1 summarizes the main system parameters modeled by this simulation environment. We model a 16- core Chip Multi-Processor (CMP) system. Each core has 32KB, 4-way, 64B block, private level-1 instruction and data caches. A 16-bank, 16MB, 16-way, 64B block, unified level-2 cache is shared by all cores. An on-chip directory holds a bit-vector of sharers and implements the MESI cache coherence protocol. The main memory 33 size is 4GB, and a packet-switched on-chip network connects the cores and cache banks. The same-sized separate read-/write-signatures with a PBX hash function [96] are used for the baseline TM system, against which our adaptive-grain signatures are evaluated in Chapter 4. The sizes of the signatures range from 64 bits to 32K bits. Both baseline and proposed signature designs use four hash functions. We also show results for a hypothetical perfect signature. For evaluating unified signatures in Chapter 5, a locality-sensitive H3 hash function [70] is used. The presented data were obtained from multiple simulation runs to reduce the impact of simulation variability of multi-threaded benchmarks [2]. 3.1.2 Benchmark Programs Table 3-2 Benchmark Characteristics Benchmarks Input bayes -v32 -r1024 -n2 -p20 -i2 -e2 15 genome -g512 -s32 -n32768 5 intruder -a10 -l4 -n2048 -s1 3 kmeans -m15 -n15 -t0.05 -i random-n2048-d16-c16 3 labyrinth -i random-x 32-y32-z3-n32 3 ssca2 -s13 -i1.0 -u1.0 -l3 -p3 3 vacation -n4 -q60 -u90 -r16384 -t4096 3 yada -a20 -i 633.2 6 # static txs We use STAMP benchmarks [59] for performance evaluation. We adjusted the STAMP benchmarks to correctly run on the GEMS/Simics simulator. However, we minimize modifications so that the original characteristics of each benchmark were preserved. Table 3-2 and 3-3 shows the characteristics of each benchmark. All the 34 characteristics presented in these tables are gathered on a TM system using a perfect signature. Table 3-2 shows the benchmark characteristics. The second column gives the input set, and the third column presents the number of static transactions exposed in the source code of each benchmark. Table 3-3 presents the dynamic characteristics of each benchmark. Transaction time (TX Time) in the second column shows the percentage of time each benchmark executes inside transactions. The third column presents the percentage of aborts from the total instances of transactions. The last two columns give the average read- set size and write-set size, measured as a number of cache blocks. Table 3-3 Transaction Characteristics Benchmarks bayes 61.30% 70.01% 101.40 (2,220.20) 56.31 (1,622.80) genome 72.60% 1.03% 25.46 (278.13) 3.58 (61.00) intruder 87.81% 69.37% 8.22 (41.13) 3.47 (23.20) kmeans 30.58% 4.18% 4.74 (6.00) 1.75 (2.00) labyrinth 91.65% 90.53% 120.63 (435.33) 77.91 (253.00) ssca2 11.23% 0.30% 3.00 (3.00) 2.00 (2.00) vacation 87.63% 0.08% 65.88 (113.00) 10.23 (23.00) yada 98.08% 31.48% 48.98 (588.73) 28.66 (373.27) Read-Set Size (max) Write-Set Size (max) TX Time (% ) Abort Rate (% ) 3.2 Hardware Estimation The hash functions of adaptive grain signatures and PBX signatures were developed with Verilog HDL codes and synthesized using Synopsys Design Compiler, targeting a TSMC 65nm technology. CACTI5.3 [90] was used to evaluate bit arrays. The bit 35 arrays were implemented as SRAMs with one read-port and one write-port, targeting 65nm technology. Because of limitations of CACTI, we evaluated signatures larger than or equal to 2K bits. In Chapter 5, we use TSMC 45 nm technology and an ARM memory compiler [6] to estimate the hardware overhead required by the hash functions and bit arrays. The bit arrays are implemented as SRAMs, targeting IBM 90 nm technology and scaling to the 45 nm technology. The smallest SRAM size is 256 bits due to limitations of the memory compiler. As a result, we evaluate signatures larger than or equal to a total of 2K bits (i.e., eight 256-bit SRAMs for the separate signatures). Table 3-4 shows area, delay, and power requirements for signature memories implemented with this scheme. The 128-bit array for the helper signature is implemented as a register using Verilog HDL. Table 3-4 Hardware Overhead for Implementing Bit Arrays 1-ported 2-ported 1-ported 2-ported 1-ported 2-ported 256 1,404 1,997 0.281 0.328 18.432 19.822 512 1,636 2,460 0.290 0.356 18.576 20.834 1K 2,087 3,448 0.308 0.412 18.855 22.836 2K 3,037 5,429 0.392 0.525 19.404 26.840 Array Size (bits) Area (um 2 ) Time (nsec) Power (mW) 36 Chapter 4 Adaptive Grain Signatures 4.1 Overview In this chapter, we show that some false positives can be helpful to performance by triggering the early abortion of a transaction which would encounter a true conflict later anyway. Based on this observation, we propose an adaptive grain signature to improve performance by dynamically changing the range of address keys based on the history. With the use of adaptive grain signatures, we can increase the number of performance-friendly false positives as well as decrease the number of performance- destructive false positives. 4.2 Introduction Several signature designs have been proposed to improve TM performance, and all of them attempt to reduce the total number of false positives, assuming that false positives are detrimental to performance [70], [80], [96], [97]. However, we observe that false positives are not always harmful to performance. In Section 4.3, we show that false positives can be categorized into good false positives, i.e., false positives which are instead helpful to TM performance, and bad false positives. A good false positive occurs when the signature detects a conflict erroneously but the running transaction incurs a conflict anyway. This false but 37 early conflict detection can improve performance by stopping the execution of a transaction that will eventually encounter a conflict. In Section 4.4, we propose a novel signature design, adaptive grain signature, to improve performance by increasing the number of good false positives and decreasing the number of bad false positives. This adaptive grain signature scheme maintains the history of transaction aborts and dynamically changes the range of the address bit-field for calculating signature indices based on the abort history. Finally, we evaluate our design using the Wisconsin GEMS simulator [55] with STAMP benchmarks [59] in Section 4.5. The proposed signatures are also implemented with Verilog HDL code to estimate their hardware overhead of the proposed design, targeting TSMC 65nm technology. Section 4.5 contains the synthesis results of adaptive grain signatures and baseline signatures. 4.3 Motivation Positives from hardware signatures are divided into true positives and false positives. We categorize false positives into bad positives and good positives. A false positive that is destructive to TM performance is called a bad positive. The case when concurrent transactions never conflict, but a signature declares a conflict falsely is classified as a bad positive. However, false positives are not always harmful to performance. A good, or early, positive occurs when a signature declares a false positive in advance and that running transaction eventually incurs a true positive. 38 Even though the signature detects a conflict falsely, this early conflict detection can improve performance by stopping the execution of an ultimately conflicting transaction. In the eager version management TM system, the transaction aborts are expensive because the aborted transactions have to be recovered by writing old data back to their appropriate locations in memory hierarchy [60]. By declaring conflicts early, we can reduce the amount of data that must be restored upon abort. Early declaration of conflicts can also resolve some deadlock situations. 39 0% 20% 40% 60% 80% 100% Perfect g = 1g = 2g = 3g = 4g = 8 Granularity bayes negative bad good ugly 0% 20% 40% 60% 80% 100% Perfect g = 1g = 2g = 3g = 4g = 8 Granularity kmeans negative bad good ugly 0% 20% 40% 60% 80% 100% Perfect g = 1g = 2g = 3g = 4g = 8 Granularity vacation negative bad good ugly 0% 20% 40% 60% 80% 100% Perfect g = 1g = 2g = 3g = 4g = 8 Granularity genome negative bad good ugly 0% 20% 40% 60% 80% 100% Perfect g = 1g = 2g = 3g = 4g = 8 Granularity ssca2 negative bad good ugly 0% 20% 40% 60% 80% 100% Perfect g = 1g = 2g = 3g = 4g = 8 Granularity yada negative bad good ugly Figure 4-1 Positive Distribution with Different Granularities 40 Figure 4-1 shows the distribution of positives for the STAMP benchmarks [59], with several granularities. To measure the number of positives from each granularity, we run benchmarks with impractical perfect signatures. At the same time, an assistant signature is working concurrently to determine the class of positives. Assistant signatures are also a kind of perfect signature with coarser granularities. Granularity g means how many bits are truncated when constructing the address key for the assistant signature. For example, an assistant signature with g = 4 takes a block address truncated with least significant 4 bits as its address key. Each positive is classified as: (i) Good positive if the perfect signature declares a positive and the assistant signature already declared a positive for the running transaction, (ii) Bad positive if the perfect signature does not declare any positives until completion but the assistant signature declared a positive, and (iii) Ugly positive if the perfect signature and assistant signature declare a positive at the same time. We change the granularity of each assistant signature to show the relationship between good positives and address granularities. As shown in Figure 4-1, a large amount of good positives can be harvested, especially from bayes, genome, vacation, and yada. As the granularity increases, some negatives become bad positives, and some ugly positives turn into good positives. The good positives in Figure 4-1 include only the good false positives due to aliasing. In reality, even occupancy false positives can be classified as good positives if a true positive would result later anyway. 41 Figure 4-2 Conflict Scenario in Eager Systems The main reason that the notion of good positives exists is related to memory locality. Figure 4-2 shows the possible conflict scenarios in eager version management systems. In this figure, we assume that transaction TX 1 and TX 2 are older than TX 3 . Figure 4-2 (a) shows the occurrence of conflicts without false positives. After NACKing the request from TX 2 , TX 3 keeps executing until it receives a NACK from the older one, TX 1 . Because a deadlock situation occurs (e.g., the younger transaction receives a NACK from the older transaction, but the younger one already sent a NACK to the other older one), TX 3 should be aborted. In Figure 4-2 (b), a false but early positive makes TX 3 stall instead. Because of this good positive, the execution of TX 2 can proceed and the deadlock situation can be resolved. Even though TX 3 should be aborted anyway, stopping the execution of TX 3 early can speed up the abort process by reducing the amount of old data logged by TX 3 . If 42 conflicting transactions contain spatial locality among their data accesses (memory accesses b and i in Figure 4-2), we can harvest good positives with coarser-grain signatures. In Section 4.4, we propose adaptive grain signatures to increase the number of good positives and decrease the number of bad positives. The relationship between good positives and TM performance can be found in Section 4.5. 4.4 Proposed Design Figure 4-3 Adaptive Grain Signatures (512-bit Signature) To improve TM performance by exploiting good and bad positives, we propose an Adaptive Grain Signature (AGSig) mechanism. AGSig adds a multiplexer, a small register (SEL register), and an Abort History Table (AHT) to ordinary signatures. An AHT contains the starting PC (i.e., the address of TxStart instruction) of a transaction which has aborted others. Based on the AHT information, the range of address bit-field as an address key is decided for each transaction, and the 43 multiplexer selects the resulting range. During the execution of each transaction, the same range of address field is used for any insert or test operation. Figure 4-3 (a) illustrates a simple version of adaptive grain signature (AGSig-S). The AHT is accessed once per each transaction at the beginning of a transaction. The AHT is checked with the starting PC of that transaction to decide whether it is in the AHT. Initially, the AHT is empty (AHT miss and the output is zero), and all the transactions start with fine-grain signatures (i.e., least significant bit-field of block address is used as an address key). If a transaction aborts other transactions, its starting PC is recorded in the AHT. When the processor meets that transaction again (AHT hit and the output is one), the output of AHT is saved in SEL register and that value is used as the select signal for the multiplexer (i.e., the coarse-grain address filed is picked out) during the whole life time of that transaction. Figure 4-3 (b) illustrates a more complex adaptive grain signature (AGSig-C) scheme. AGSig-C is similar to AGSig-S except that each AHT entry has a saturating counter. Initially, every counter value is zero. If a transaction aborts others, the starting PC of that transaction is recorded in an AHT entry, and its counter value is set to one. Whenever that transaction conflicts again, the appropriate counter increases. This counter value is used for the select signal of the multiplexer. 44 4.4.1 Hash Functions The quality of the hash function directly influences the performance of a signature. Even though any kind of hash functions can be adapted for AGSig, the hardware overhead must be considered. The proposed signature is implemented by parallel Bloom filter with four hash functions. We implement an XOR-based hash function, similar to PBX signature [96], for the first three hash functions to increase randomness. Because the randomness of lower-order address bits is sufficient [96], we use a simple bit-selection hash function for the last one to reduce hardware cost. 4.4.2 Abort History Table The AHT contains the starting PC of transactions that have aborted others. We can decide whether the running transaction should be updated in the AHT by two means: (i) When a transaction aborts, it informs the killer transaction about its abortion by sending an abort message. The receiver of this message records the starting PC of its running transaction in its AHT when receiving that message. (ii) Each transaction maintains a NACK counter and increments the counter whenever it receives NACKs from a younger transaction. When the NACK counter value exceeds the threshold, the transaction assumes that a deadlock situation might occur and records its starting PC in its AHT. We use abort messages to implement our system, which is similar to direct messaging [73]. 45 The number of entries in the AHT is also an important design parameter that affects both TM performance and hardware overhead. Even though the number of dynamic aborts could be very large, the number of AHT entries can be small because the AHT tracks the PCs of transactions (see Table 3-3). We use a four-entry FIFO queue as our AHT. 4.4.3 Adaptive Grain Signatures with Transactional Memory Systems After aborting the other transaction, the starting PC of a killer transaction is recorded in the AHT. AGSig increases its granularity when the killer transaction is encountered again. With AGSig, we can improve TM performance by two means. First, by changing the granularity to the coarser level (coarser bit-field), the chance to detect good positives is increased due to locality. By stalling the finally-aborted transactions early, we can reduce the overhead of saving old data into an undo log and restoring old data from an undo log. Second, by using different address bit- fields to access signatures after aborting other transactions, false conflicts upon even a first encounter (bad positive) can be removed. Therefore, the adaptive grain signature design increases the number of good positives and decreases the number of bad positives. The LogTM system supports unbounded nesting by saving the current transaction’s signature into an undo log [95]. When a nested transaction (child) begins, the signature of the current transaction (parent) is copied into the header of an undo log. To detect conflicts between incoming requests and the parent transaction, a child 46 transaction keeps updating the hardware signature without clearance. When a child transaction commits, the parent’s signature is retrieved from the undo log into the hardware signature (open-nesting), or the hardware signature is continuously updated to detect the child transaction’s conflicts (closed-nesting). If the granularities of parent and child transactions are different, a false negative could occur. By making the nested transactions inherit the outermost transaction’s multiplexer select signal, we can support transactional nesting and guarantee no false negatives. AGSig can support context switching with a small modification to the LogTM system. To support context switching, LogTM copies the signature of the de- scheduled transaction into the undo log [95]. For AGSig, the value stored in SEL register of the de-scheduled transaction is also saved in the undo log. However, the information in the AHT is cleared upon context switching. The purpose of the AHT is only to improve performance, so removing the information in the AHT is harmless to correctness. By retrieving the signature as well as the select signal of the multiplexer, we can correctly detect conflicts after the de-scheduled transaction is re- loaded. 4.5 Experimental Results 4.5.1 Positive Distribution In this section, we present the relationship between good positives and AGSig. Unfortunately, it is impossible to measure how many good positives are harvested 47 with AGSig because a requester stalls its execution until the conflict is resolved when it receives a NACK. So, we cannot determine the category of each positive during run-time. Instead, similar to Section 4.3, we indirectly measure how many good positives can be harvested with AGSig. We run benchmarks on the TM system using a perfect signature. Concurrently, AGSig is working together as an assistant signature to determine the class of positives. Figure 4-4 gives the distribution of positives for AGSig-C with several signature sizes. Similar to Figure 4-1, bayes, genome, vacation, and yada exhibit a high ratio of good positives. These benchmarks are relatively large (see fifth and sixth columns in Table 3-2) and locality-intensive [70]. Based on Figure 4-4, we can infer that AGSig can effectively utilize memory locality to improve performance. As the signature size increases, the number of false positives decreases. So, the portion of true (ugly) positives increases and the portion of good or bad positives decreases. AGSig-S shows similar positive distribution. 48 0% 20% 40% 60% 80% 100% Signature Size (bits) bayes negative bad good ugly 0% 20% 40% 60% 80% 100% Signature Size (bits) genome negative bad good ugly 0% 20% 40% 60% 80% 100% Signature Size (bits) kmeans negative bad good ugly 0% 20% 40% 60% 80% 100% Signature Size (bits) ssca2 negative bad good ugly 0% 20% 40% 60% 80% 100% Signature Size (bits) vacation negative bad good ugly 0% 20% 40% 60% 80% 100% Signature Size (bits) yada negative bad good ugly Figure 4-4 Positive Distribution with Complex Adaptive Grain Signatures 49 4.5.2 Hit Rates of Abort History Table Table 4-1 Hit Rate of Abort History Table Sig Size (bits) 64 18.51% 93.03% 76.15% 82.57% 99.91% 89.36% 128 32.61% 90.12% 77.14% 83.87% 99.69% 88.06% 256 43.80% 84.53% 77.14% 83.15% 99.12% 86.42% 512 44.39% 83.17% 75.19% 82.98% 98.03% 80.68% 1K 44.64% 72.87% 76.61% 84.15% 96.98% 71.32% 2K 37.01% 72.38% 77.48% 82.64% 97.53% 67.25% 4K 24.61% 72.11% 74.08% 83.57% 97.88% 66.84% 8K 23.36% 71.25% 76.14% 84.19% 97.69% 67.08% 16K 16.09% 72.36% 75.84% 82.62% 97.88% 67.65% 32K 16.58% 71.86% 77.91% 85.84% 97.69% 67.64% ssca2 vacation yada bayes genome kmeans We have observed that the number of static transactions which experience aborts is relatively small (see the third column in Table 3-3). So, our AHT can efficiently maintain abort information even with its small number of entries. Table 4-1 gives the hit rate of the AHT with AGSig-C. The AHT hit rate of each benchmark is reasonably high except bayes. Because bayes has a large number of static aborts compared to the number of AHT entries, our small-sized AHT cannot effectively contain all abort information for bayes. The hit rates of kmeans and ssca2 show less variation across different signature sizes, similar to their performance which is insensitive to the signature size (details in Section 4.5.3). The hit rate of the other benchmarks (genome, vacation, and yada) decreases as the signature size increases. Because the aborts more frequently occur with smaller signatures and the aborted transactions access the AHT again, the AHT hit ratio increases with smaller signatures. On the other hand, a larger signature decreases the number of aborts, and 50 this small number of aborts introduces a relatively larger number of AHT misses, mostly cold misses. The AHT hit rate with AGSig-S shows little difference to that with AGSig-C. 4.5.3 Performance Figure 4-5 shows the execution time of TM systems with PBX signature and AGSig, normalized to that of the perfect signature. For a more detailed comparison of PBX and AGSig, Figure 4-6 shows the speedup of AGSig over PBX. In almost all cases, AGSig-C produces better performance than AGSig-S or PBX. Some benchmarks show similar performance behavior. The performance of kmeans and ssca2 is not largely dependent on the signature design or even on the signature size. The fraction of time spent inside transactions for these benchmarks is very small (see second column in Table 3-2). Also, the number of conflicts is relatively small (see second column in Table 3-3). Because of these characteristics, kmeans and ssca2 present little variation with signature size. From Figure 4-1, kmeans and ssca2 produce a small number of good positives. Therefore, our AGSig scheme hardly harvests more performance compared with the PBX scheme. Finally, ssca2 contains lots of global barriers. The execution of transactions enveloped in global barriers is serialized, and the influence of conflicts is therefore neutralized. 51 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 64 128 256 512 1K 2K 4K 8K 16K 32K Execution Time Signature Size (bits) bayes PBX AGSig‐S AGSig‐C 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 64 128 256 512 1K 2K 4K 8K 16K 32K Execution Time Signature Size (bits) genome PBX AGSig‐S AGSig‐C 0.50 0.75 1.00 1.25 1.50 64 128 256 512 1K 2K 4K 8K 16K 32K Execution Time Signature Size (bits) kmeans PBX AGSig‐S AGSig‐C 0.50 0.75 1.00 1.25 1.50 64 128 256 512 1K 2K 4K 8K 16K 32K Execution Time Signature Size (bits) ssca2 PBX AGSig‐S AGSig‐C 0.00 2.50 5.00 7.50 10.00 12.50 15.00 17.50 20.00 22.50 25.00 64 128 256 512 1K 2K 4K 8K 16K 32K Execution Time Signature Size (bits) vacation PBX AGSig‐S AGSig‐C 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 64 128 256 512 1K 2K 4K 8K 16K 32K Execution Time Signature Size (bits) yada PBX AGSig‐S AGSig‐C Figure 4-5 Execution Time Normalized to Perfect Signature 52 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 64 128 256 512 1K 2K 4K 8K 16K 32K Speedup Signature Size (bits) bayes AGSig‐S AGSig‐C 0.80 0.90 1.00 1.10 1.20 64 128 256 512 1K 2K 4K 8K 16K 32K Speedup Signature Size (bits) genome AGSig‐S AGSig‐C 0.80 0.90 1.00 1.10 1.20 64 128 256 512 1K 2K 4K 8K 16K 32K Speedup Signature Size (bits) kmeans AGSig‐S AGSig‐C 0.80 0.90 1.00 1.10 1.20 64 128 256 512 1K 2K 4K 8K 16K 32K Speedup Signature Size (bits) ssca2 AGSig‐S AGSig‐C 0.80 0.90 1.00 1.10 1.20 64 128 256 512 1K 2K 4K 8K 16K 32K Speedup Signature Size (bits) vacation AGSig‐S AGSig‐C 0.80 0.90 1.00 1.10 1.20 1.30 1.40 64 128 256 512 1K 2K 4K 8K 16K 32K Speedup Signature Size (bits) yada AGSig‐S AGSig‐C Figure 4-6 Speedup of Adaptive Grain Signature over PBX Signature 53 Bayes, genome, vacation, and yada show dramatic changes as the signature size increases, and they are also sensitive to the kind of signature. They spend large amounts of time in transactions and experience many conflicts and aborts. These characteristics make them sensitive to the quality of signatures. Figure 4-1 shows that a large amount of good positives exists in these benchmarks. AGSig-C can effectively reflect good positives and frequently outperforms PBX signatures. Compared with the other benchmarks in this group, genome is less dependent on the signature design, but AGSig-C still shows better performance than the other designs. 4.5.4 Hardware Implementation Table 4-2 Synthesis Results of Adaptive Grain Signatures and PBX Signatures 2K 0.473 0.040 0.513 0.100 0.573 0.140 0.613 4K 0.479 0.040 0.519 0.100 0.579 0.150 0.629 8K 0.521 0.040 0.561 0.070 0.591 0.160 0.681 16K 0.538 0.040 0.578 0.070 0.608 0.160 0.698 32K 0.559 0.040 0.599 0.080 0.639 0.170 0.729 2K 83.542 0.007 83.549 0.575 84.117 0.648 84.190 4K 86.727 0.008 86.735 0.574 87.301 0.654 87.381 8K 106.357 0.009 106.366 0.575 106.932 0.654 107.012 16K 353.389 0.009 353.398 0.576 353.965 0.651 354.040 32K 356.964 0.010 356.974 0.577 357.541 0.654 357.618 2K 12,153.845 67.680 12,221.525 1,738.080 13,891.925 1,924.560 14,078.405 4K 25,156.105 73.440 25,229.545 1,745.280 26,901.385 1,932.840 27,088.945 8K 41,375.186 80.280 41,455.466 1,747.440 43,122.626 1,940.040 43,315.226 16K 73,970.924 86.400 74,057.324 1,753.560 75,724.484 1,951.560 75,922.484 32K 124,460.846 92.160 124,553.006 1,759.680 126,220.526 1,960.200 126,421.046 Area [um 2 ] Power [mW] Delay [nsec] Hashing Total Sig Size (bits) Bit Arrays PBX Signature AGSig-S AGSig-C Hashing Total Hashing Total Table 4-2 shows delay, power consumption, and area requirements for AGSig and PBX signatures. Due to its additional hardware such as an AHT, AGSig exhibits a larger hardware overhead than PBX signature in all cases. However, as the signature 54 size increases, the difference between AGSig and PBX becomes negligible because not the hash function but the bit array is the main hardware overhead. The area overheads of signatures should be considered in the context of the whole processor area. We compare signature areas relative to the core area of the Sun Rock [22], which is fabricated in Texas Instrument 65nm technology. The Rock’s core area is 14mm 2 and it supports four hardware threads. We assume that each thread has read- and write-signatures. Table 4-3 shows the area overheads of signatures. In consideration of the processor area, the difference of area requirements between AGSig and PBX is largely irrelevant. Table 4-3 Area Overhead of Adaptive Grain Signatures and PBX Signatures over Sun Rock Processor Core 2K 0.70% 0.79% 0.80% 4K 1.44% 1.54% 1.55% 8K 2.37% 2.46% 2.48% 16K 4.23% 4.33% 4.34% 32K 7.12% 7.21% 7.22% Sig Size (bits) AGSig-C Overhead AGSig-S Overhead PBX Overhead 4.6 Related Work All the previous signature designs reviewed in Section 2.6.1 attempt to improve TM performance by reducing the total number of false positives, assuming that false positives are destructive to performance [70], [80], [96], [97]. On the contrary, adaptive grain signatures distinguish good false positives from bad false positives 55 and try to increase the number of good positives as well as reduce the number of bad positives. The dynamic signature scheme in [97] is the closest one to our proposed adaptive grain signature. Based on the observation that each application and even each transaction from an application possess its own locality properties, the dynamic signature prepares a set of signatures with different ranges of address bit-field and the most appropriate signature for each transaction is selected based on run-time information. However, it still tries to reduce the total number of false positives. Also, its design with multiple signatures requires a higher implementation cost compared to adaptive grain signatures. 4.7 Conclusions Conflict detection is an essential element for maintaining correct concurrency in TM systems. Hardware signatures have been proposed as an area-efficient mechanism for conflict detection in TM systems. One of the problems with a hardware signature scheme is that it can declare false positives, detecting conflicts falsely even when no actual conflict exists. Previous signature designs attempt to reduce the total number of these false positives. In this chapter, we showed that some false positives in TM systems could be beneficial to performance by prematurely stopping the execution of a transaction which will eventually encounter a conflict. Based on this observation, we presented 56 an Adaptive Grain Signature scheme, AGSig, to improve performance by dynamically changing the range of the address key based on history. With the help of AGSig, we can increase the number of performance-friendly false positives as well as decrease the number of performance-destructive false positives. 57 Chapter 5 Unified Signatures 5.1 Overview In this chapter, we propose a simple and effective signature design, the unified signature. Instead of using separate read- and write-signatures, as is often done in TM systems, we implement a single signature to track all read- and write-accesses. By merging read- and write-signatures, a unified signature can effectively enlarge the signature coverage without additional overhead. Within the constraints of a given hardware budget, a TM system with a unified signature outperforms a baseline system with the same-sized traditional signatures by reducing the number of falsely detected conflicts. A TM system with a 2K-bit unified signature with helper signature scheme achieves speedups of 15% over baseline TM with 33% less area and 49% less power. 5.2 Introduction The TM-supporting hardware in a HTM system (e.g., hardware signatures) would hardly be helpful to the performance of single-threaded programs or parallel programs written with other programming techniques such as Message Passing [69]. Hence, it is desirable to maximize TM performance within a given hardware budget. Several signature designs have been proposed to improve TM performance with well-tuned hash functions [21], [70], [80], [96]. All these designs consist of 58 dedicated read- and write-signatures, similar to Figure 2-5. However, the disparity between read-set size and write-set size stemming from application characteristics often introduces asymmetric occupancy between the two signature types (details in Section 5.3). This imbalance results in a suboptimal utilization of a given hardware resource, causes more false positives, and therefore, degrades TM performance. To maximize TM performance with a limited hardware budget, we propose a unified signature, tracking both read- and write-accesses with a single signature instead of separate read- and write-signatures (details in Section 5.4). By merging read- and write-signatures, a unified signature can effectively enlarge the signature size without additional overhead and improve TM performance by reducing the number of false positives. A TM system with a unified signature, whose size is the sum of separate signatures, can outperform the baseline system that uses separate signatures. 5.3 Motivation Signatures efficiently resolve the problems with cache tag augmentation [95], but introduce another problem: false positives. Declaring a conflict when none exists, a false positive causes a transaction to unnecessarily abort or stall, frequently degrading performance. Several signature designs for TM systems have been proposed to reduce the number of false positives [21], [70], [80], [96], and all the proposed signatures use the same-sized separate read- and write-signatures. Table 5- 1 presents the data-set characteristics for the STAMP benchmarks on a 16-core TM 59 system using locality-sensitive H3 signatures [70]. We use total 2K-bit signatures (i.e., a 1K-bit read-signature and a 1K-bit write-signature). The second and third columns in Table 5-1 show the average read-set size and the write-set size, measured as a number of cache blocks. The fourth and fifth columns give the average number of signature bits set when transactions are finished, and the last two columns are false positive rates of each signature. Table 5-1 Data-Set Characteristics (Locality-Sensitive H3 Signatures with Total 2K bits) R-Set (max) W-Set (max) R-Sig (max) W-Sig (max) R-Sig W-Sig bayes 96.95 (2224.67) 53.10 (1619.33) 135.93 (981.93) 44.94 (797.27) 65.33% 84.39% genome 25.49 (264.73) 3.59 (61.00) 74.46 (544.53) 11.99 (160.00) 72.80% 0.09% intruder 11.01 (69.33) 3.12 (27.27) 22.79 (201.07) 4.51 (91.07) 0.34% 0.06% kmeans 6.23 (8.00) 1.75 (2.00) 21.18 (32.00) 5.26 (8.00) 0.51% 0.00% labyrinth 38.49 (127.00) 24.04 (83.00) 147.24 (246.47) 89.56 (163.40) 1.71% 89.11% ssca2 3.00 (3.00) 2.00 (2.00) 11.96 (12.00) 7.95 (8.00) 0.00% 0.00% vacation 65.88 (150.13) 18.49 (71.67) 137.41 (353.80) 40.12 (165.53) 20.63% 0.56% yada 49.14 (574.60) 28.57 (361.13) 184.20 (739.87) 94.42 (560.60) 88.27% 26.10% Data-Set Size (#blocks) Signature Bits Set (#bits) False Positiv e Rate (% ) Benchmarks As shown in Table 5-1, if more local accesses are inserted into the signature (i.e., if the data-set becomes larger), more signature bits are set. If every bit in the signature corresponding to the testing address has already been set by other previously-inserted addresses, the signature declares a false positive even though that testing address was never inserted. Therefore, the probability of false positives increases as more signature bits are set if the signature size is fixed. We can reduce the number of false positives by enlarging signatures, as shown in Section 2.4.2. However, large signatures require additional hardware overhead, which is not desirable if a limited area or power budget is targeted for implementing TM support. 60 Also as shown in Table 5-1, we can observe that the characteristics of the read-set and write-set are certainly different. For most cases, the read-set size is bigger than the write-set size. The imbalance of each data-set size induces asymmetric occupancies between the signature types, and usually introduces more false positives for read-signatures. Thus, if the same-sized separate read- and write-signatures are used, the signature space is not being used efficiently. Even if asymmetric signatures (e.g., a large read-signature and a small write-signature) are used, the small-sized write-signature can be a performance bottleneck with some benchmarks which have large write-sets. As long as separate read-/write-signatures are implemented with a fixed hardware resource, it is difficult to satisfy all benchmarks, each of which has its own unique data-set characteristics. 5.4 Proposed Design In this section, we present the proposed unified signature designs. Section 5.4.1 describes a blind unified signature to maximize TM performance within a limited hardware budget by effectively enlarging the signature size, and investigates the performance effect of read-read dependencies resulting from blind unified signatures. Section 5.4.2 describes a unified signature with helper signature to remove the read- read dependencies from the blind unified signature. 61 5.4.1 Blind Unified Signatures Figure 5-1 Blind Unified Signatures A simple and powerful way to use a given hardware budget for signatures is a blind unified signature (USIG-B). Instead of using separate read- and write-signatures, a single signature maintains all the conflict information. There is no distinction between a load and store (REQ and REQX). Figure 5-1 shows insert and test operations with USIG-B, respectively. All the memory addresses of local accesses are inserted in the unified signature, and each incoming request is tested against this 62 single signature. With these blind insert and test operations, some control logic, such as multiplexer and gates shown in Figure 2-5, is no longer required. Therefore, the implementation of USIG-B is simpler than a design with dedicated read-/write- signatures. More importantly, the actual size of unified signatures becomes the sum of separate signatures, which is larger than each separate signature assuming a fixed hardware budget (e.g., a 2K-bit unified signature instead of a 1K-bit read-signature and a 1K-bit write-signature). This increased signature size can reduce false positives by distributing signature bits to a larger space. Unfortunately, this scheme can declare another kind of false positive, read-read dependency. A read-read dependency occurs when a REQ shares signature bits not with store addresses but with load addresses, since there is no distinction in a unified scheme. False positives due to read-read dependencies are unavoidable with USIG-B. (c) Conflict Scenario with ASeparate Signatures NACK Conflict detected ld st ld st Stalled Thread 1 Thread 2 REQX REQ ACK (b) Conflict Scenario with Unified Signature Thread 1 Conflict detected ld st ld Stalled Thread 2 Retry ld REQ REQ NACK (a) Code Example L1: sethi %hi(0x2d400), %o5 L2: ld [ %o5 + 0x380 ], %g1 L3: cmp %o0, %g1 L4: bcs,a 0x12cd8 <L6> L5: mov %g1, %o0 L6: add %o0, 1, %g1 L7: retl L8: st %g1, [ %o5 + 0x380 ] Figure 5-2 Example of Read-Read Dependency Read-read dependencies, however, rarely result in a harsh performance loss. Numerous shared variables are first read and then written back with a modified value 63 [81], [71]. Figure 5-2 (a) presents one of the transactions from the ssca2 benchmark, compiled with the gcc compiler for the SPARC ISA. The memory address loaded from the beginning of the transaction (ld at line 2) is written back at the end of the transaction (st at line 8). Figure 5-2 (b) and 5-2 (c) show possible conflict scenarios when two threads Thread 1 and Thread 2 run the transaction in Figure 5-2 (a) concurrently. With USIG-B, Thread 1 falsely sends a NACK to the REQ from Thread 2 due to a read-read dependency (Figure 5-2 (b)). However, even with ordinary read-/write-signatures, transactions experience a conflict because a read- write dependency appears later (Figure 5-2 (c)). Thread 1 tries to write back the shared data, but REQX is NACKed since that address was already inserted in the read-signature of Thread 2 . As a result, the conflict will occur anyway even without the read-read dependency. In this case, the read-read dependency hardly degrades the TM performance. Sometimes, this false but early conflict detection can improve performance by stopping the execution of a transaction that will eventually encounter a true positive [21]. In the LogTM system, transaction aborts are expensive because of log unrolling [60]. Early declaration of conflicts can reduce the amount of data that must be restored upon abort and can also resolve some deadlock situations. To measure the impact of read-read dependencies, we classify read-read dependencies into constructive dependencies and destructive dependencies. If a transaction detects a true dependency after incurring a read-read dependency, we define this read-read dependency as a constructive read-read dependency. Otherwise - if a transaction that experienced read-read dependencies never incurs a true 64 dependency until completion, we categorize it as a destructive read-read dependency. Constructive read-read dependencies can be divided into constructive read-write dependencies and constructive true dependencies. If a transaction with a read-read dependency encounters an associated read-write dependency later, like the case in Figure 5-2, we call it a constructive read-write dependency. After a transaction encounters a read-read dependency, it can experience a true dependency which has no relation to the memory address of the read-read dependency. In this case, the read-read dependency also would not hurt TM performance because it will incur a conflict anyway. We call such a case a constructive true dependency. Figure 5-3 shows the distribution of constructive true dependencies (Constructive True), constructive read-write dependencies (Constructive R-W), and destructive read-read dependencies (Destructive R-R) among total read-read dependencies for the STAMP benchmarks. The number in parentheses means the occurrence of read- read dependencies from total positives. To measure the number of read-read dependencies, we run the benchmarks with a hypothetical perfect signature, which never incurs false positives. At the same time, a 2K-bit USIG-B is operating concurrently to determine each category of read-read dependencies. After USIG-B detects a read-read dependency, each read-read dependency is classified as: (i) destructive read-read dependency if the perfect signature does not declare any positives until completion, (ii) constructive read-write dependency if the perfect signature detects a true positive whose memory address is the same as an address incurring a read-read dependency, and (iii) constructive true dependency if the 65 perfect signature detects a true positive whose memory address is different from that of any read-read dependency. As shown in Figure 5-3, the occurrence of read-read dependencies is fairly low when using the 2K-bit USIG-B. Also, many read-read dependencies are classified as constructive dependencies. In the case of ssca2, all read-read dependencies end with constructive read-write dependencies. On the other hand, 51.35% of read-read dependencies from genome remain destructive. Figure 5-3 Distribution of Read-Read Dependencies with Blind Unified Signature (total 2K bits) Separate read-/write-signatures never incur read-read dependencies but suffer from false positives due to their limited size. The imbalance between the occupancy of read- and write-signatures frequently introduces a higher false positive rate at the read-signature side. As long as we use dedicated signatures, it is impractical to assign a given hardware budget to each signature appropriately because each application, and even each transaction from an application, shows different data-set 66 characteristics. On the contrary, unified signatures incur read-read dependencies but can reduce the false positives by enlarging the effective signature size. As long as the gain from reducing false positives is larger than the loss due to the read-read dependencies, the unified signature scheme is a good alternative to ordinary separate signatures. We will examine the impact of read-read dependencies and the performance improvement with unified signatures in Section 5.5. 5.4.2 Unified Signature with Helper USIG-B can improve TM performance by effectively enlarging the signature size within a given hardware budget. But, the false positives due to read-read dependencies can prevent USIG-B from achieving greater performance. A unified signature with helper signature (USIG-H) can glean more performance gain by filtering out read-read dependencies from USIG-B. As illustrated in Figure 5-4, the proposed design is constructed with a main signature and a helper signature. USIG-B is used as a main signature. All the load and store addresses are inserted into the main signature and each incoming REQ or REQX is tested with that signature. A helper signature is a small-sized special write-signature. Like ordinary write- signatures, store addresses are inserted into the helper. The write-enable signal for the helper is turned on only if the local access is a store (this is not shown in Figure 5-4 (a)). Differently from ordinary write-signatures, only REQ is tested with the helper. We design the helper signature to share the hash function with the main signature to reduce the hardware overhead. For example, if we have a 2K-bit main 67 signature with two hash functions and a 128-bit helper signature with one hash function, the least significant 7 bits out of a 10-bit signature index from the last hash function of the main signature are used as a signature index for the helper. (b) Test h 1 0 1 ... 0 0 0 ... 1 0 0 ... 1 Helper Signature 0 ... 0 1 ... 0 h 0 Unified Signature Memory address (Incoming request, REQ or REQX) 1 0 REQ/REQX (0/1) 1 Declare a conflict if 1 (a) Insert Memory address (Local access, Load or Store) Block offset h 1 0 1 ... 0 0 0 ... 1 Block address Main Unified Signature 2 n+1 bits 2 n+1 bits 0 0 ... 1 Helper Signature 0 ... 2 m bits m bits 0 1 ... 0 (n+1) bit signature index h 0 (n+1) bits Figure 5-4 Unified Signature with Helper Signature Upon a test operation for a REQX, only the main signature is tested against it for detecting a conflict. When receiving a REQ, this incoming request is tested with both main and helper signatures, and a conflict is declared only when both signatures show a hit. If the main signature is missed but the helper signature is hit, a false 68 positive occurs at the helper side, which is simply ignored. If the main one is hit but the helper is missed, that means a read-read dependency occurs at the main side, and this positive can be ignored. Regardless of their size, the main and helper signatures never incur false negatives because they are Bloom filters. So, we can safely ignore the positive from one signature when the other is missed. With the assistance of the helper, the proposed scheme can effectively remove read-read dependencies from USIG-B, and harvest more TM performance at the cost of additional overhead. USIG-H re-introduces extra control logic and requires an additional signature, but as shown later, the performance improvement merits this small additional hardware cost of USIG-H. 5.5 Experimental Results In this section, we present and discuss the simulation results. Section 5.5.1 evaluates the performance of unified signatures compared to separate signatures. Section 5.5.2 studies the effect of changing the helper signature size on the performance of USIG- H. Section 5.5.3 shows why unified signatures can perform better than separate signatures, presenting the false positive rates of each design. We compare the unified signature designs to asymmetric signatures in Section 5.5.4. Finally, Section 5.5.5 measures the hardware overhead of each signature design in terms of area, delay, and power consumption. 69 5.5.1 Performance of Unified Signatures To evaluate the unified signatures, we first present the execution time of the TM system with each signature design, varying the total signature size. Figure 5-5 shows the execution time of TM systems with separate read-/write- signatures (BASE), multiset signature (USIG-M) [71], blind unified signatures (USIG-B), and unified signatures with helper (USIG-H) normalized to that of the perfect signature (lower is better). Figure 5-6 presents the speedup of unified signatures over the baseline signature for a more detailed comparison between unified signatures and the baseline signature (higher is better). 70 Figure 5-5 Execution Time Normalized to a Perfect Signature 71 Figure 5-5 Execution Time Normalized to a Perfect Signature As the signature size is enlarged, TM performance is usually improved because the number of false positives is reduced [11]. However, the performance of kmeans or ssca2 is not largely dependent on the signature design or even on the signature size. As we found in Table 3-3, the fraction of time spent inside transactions for these benchmarks is relatively small and the number of conflicts is also small. Because of these characteristics of kmeans and ssca2, little variation occurs regardless of signature configuration. 72 Figure 5-6 Speedup of Unified Signatures over Separate Read-/Write-Signatures 73 Figure 5-6 Speedup of Unified Signatures over Separate Read-/Write-Signatures Bayes, genome, intruder, labyrinth, vacation, and yada show dynamic changes as the signature size increases. These benchmarks spend large amounts of time inside transactions and experience many conflicts and aborts. These characteristics make them sensitive to the quality of signatures. As shown in Figure 5-5, the trend of unified signatures for these benchmarks is very similar to that of the baseline signature except that the graphs for unified signatures are shifted to the left by one size unit. That means TM performance with unified signatures is almost the same as the performance with separate signatures whose total size is twice larger than that of a unified signatures. Based on this observation, we can conclude that unified signatures can more efficiently utilize a given hardware budget. As shown in Table 5-1, bayes and yada have relatively large data-set size. In consequence, the performance benefit of unified signatures is persistent until their size reaches 8K bits. On the contrary, genome, intruder, labyrinth, and vacation have relatively small 74 data-set size. So, the difference of execution time between baseline signatures and unified signatures fades away as the signature size is larger than 1K bits. USIG-B shows impressive performance improvements even with its simple structure. For small to medium size, USIG-B usually outperforms baseline signatures. In the case of bayes, the execution time with the baseline signature is flat from 512 bits to 2K bits. So, USIG-B hardly improves the performance of bayes from 512 bits to 1K bits because there is no gain with graph shifting. Similarly, the advantage of unified signatures vanishes with other benchmarks after the performance with separate signatures reaches the performance with the perfect signature. As baseline signatures become large enough to remove almost all false positives, the adverse effect of read- read dependencies from USIG-B starts dominating TM performance. For example, the performance of a TM system running genome with USIG-B starts degrading after 2K bits. Since genome contains many destructive read-read dependencies as shown in Figure 5-3, genome's performance degradation with USIG-B is not negligible. As expected, USIG-H can reap more performance than USIG-B by removing those read-read dependencies. USIG-H can nicely fill in the gap between the performance of USIG-B and baseline signatures, for example, genome and labyrinth. For an application, such as bayes or yada, whose data-set size is relatively large, USIG-H harvests more performance than baseline signature or USIG-B, because USIG-H can effectively enlarge the signature size compared to the baseline signature or remove read-read dependencies from USIG-B. Because the write-set size is usually small 75 (see Table 5-1), a small helper signature efficiently filters out read-read dependencies in cooperation with the main signature. USIG-H can successfully remove not only destructive read-read dependencies but also constructive read-read dependencies. Sometimes, a constructive read-read dependency can be helpful to TM performance as it acts as a conflict predictor and reduces the rollback overhead if the running transaction is eventually aborted [21]. As a result, USIG-B occasionally performs better than USIG-H. For this case, programmers may decide which signature mode performs better after performance profiling. This information can be used to disable the helper signature in USIG-H and run in USIG-B mode by adding a multiplexer in Figure 5-4 (b). USIG-M usually outperforms baseline signatures, showing the same trend of unified signatures. In this design, the basic structure is similar to USIG-B except that one of the bit arrays is implemented as a dual-ported SRAM and has two distinct hash functions. The selected dual-ported bit array performs the role of distinguishing the read-set from write-set by asserting different bits. With this strategy, the dual-ported bit array can assert more bits if a certain address is read and then written. Observed in [71], about 30% of transactional accesses were both read and written. When more bits are asserted, there is a larger probability to incur false positives. As a result, the performance of this scheme slightly lags behind the other unified signatures when the signature size is restricted (genome, labyrinth, and vacation) or the benchmark has a large data-set (bayes and yada). As the signature size becomes large enough, USIG-M marginally outperforms USIG-H. However, the performance difference is 76 negligible with large signature budgets. Also USIG-M requires more hardware overhead to implement a dual-ported SRAM, presented in Section 5.5.5. Figure 5-7 Average Speedup of Unified Signatures over Separate Signatures Figure 5-7 illustrates the average speedup of all benchmarks at each signature size. TM systems with unified signatures harvest more performance than TM systems with separate read-/write-signatures except for the 8K-bit USIG-B case, the only instance read-read dependencies overwhelm the advantage of unified signatures. Among three kinds of unified signatures, USIG-H generally shows the best performance. A TM system with a 2K-bit USIG-H can achieve an average speedup of 15.60% over baseline TM systems. With this size, a TM system with USIG-H experiences an average slowdown of 15.49% compared with a TM system with a hypothetical perfect signature. 4K-bit USIG-H shows 10.18% speedup over baseline signatures and 3.53% slowdown over the perfect signature. Besides performance 77 gain, the advantages of unified signatures also include hardware savings. In Section 5.5.5, we measure the area, delay, and power overhead of baseline and unified signatures. 5.5.2 Sensitivity to Helper Size In the previous section, we have selected 128 bits as the size of the helper signature in USIG-H. In this section, we present the relationship between TM performance and the size of helper signatures. Figure 5-8 shows the speedup of TM systems with USIG-H compared to the baseline system. The helper size ranges from 128 bits (USIG-H128) to 1K bits (USIG-H1024). 78 Figure 5-8 The Impact of Helper Size on the Unified Signatures with Helper 79 Figure 5-8 The Impact of Helper Size on the Unified Signatures with Helper As shown in Figure 5-8, USIG-H with larger helper does not always outperform USIG-H128. The main reasons are two-fold. First, the write-set size of a transaction is usually small. For example, SigTM [58] recommends using 128-bit write- signature in their asymmetrically-sized separate signatures. Similar to their observation, 128 bits are sufficient for the role of the helper signature, a special write-signature. Second, the purpose of the helper signature is not detecting conflicts, but detecting the read-read dependencies incurred by the main signature. The false positives generated by the small-sized helper can be easily ignored unless the much larger main signature also incurs false positives at the same time. Since there is no distinct merit for a large helper signature, we fix 128 bits as our helper size to take advantage of its small hardware overhead. 80 5.5.3 False Positives in Unified Signatures Table 5-2 False Positive Rates (Total 2K bits) R-Sig W-Sig Total USIG-M USIG-B USIG-H bayes 63.33% 84.39% 66.06% 58.87% 50.74% 55.51% genome 72.80% 0.09% 41.58% 18.57% 31.64% 8.07% intruder 0.34% 0.06% 0.31% 0.31% 0.43% 0.05% kmeans 0.51% 0.00% 0.37% 1.54% 1.88% 0.10% labyrinth 1.71% 89.11% 2.23% 2.15% 3.53% 0.67% ssca2 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% vacation 20.63% 0.56% 15.90% 5.08% 9.25% 2.37% yada 88.27% 26.10% 72.89% 47.49% 41.70% 38.54% BASE USIG Benchmarks The false positive rate serves as an indicator to estimate the quality of a signature. 4 Table 5-2 presents the false positive rates of baseline separate signature (BASE), USIG-M, USIG-B, and USIG-H. The false positive rates of read-/write-signatures in Table 5-1 are shown again in the second and third columns to facilitate the comparison. The fourth column gives the total false positive rate of the baseline signature. The fifth, sixth, and seventh columns present the false positive rates of USIG-M, USIG-B and USIG-H, respectively. USIG-B can effectively decrease false positive rates of BASE by utilizing the signature space. However, USIG-B can suffer from another kind of false positives, read-read dependencies. With labyrinth, the false positive rate with USIG-B is higher 4 We should note that the false positive rates are not directly correlated to TM performance due to the good positive effect [Choi10a]. 81 than that with BASE. As expected, USIG-M and USIG-H can further reduce the number of false positives of USIG-B by filtering out read-read dependencies. 5.5.4 Comparison with Asymmetric Signatures The read-set size is usually larger than the write-set size for most applications. To efficiently utilize a given hardware budget, implementing asymmetric signatures that devote more bits to the read-signature is a better choice instead of using the same- sized read-/write-signatures [58]. Figure 5-9 illustrates the execution time of TM systems with differently-sized read-/write-signatures, normalized to that of the perfect signature (lower is better). In this figure, the x-axis represents the read- signature size of BASE or the unified signature size in USIG-B and USIG-H. The size of the read-signature ranges from 1K bits to 8K bits, but the size of the write- signature is fixed at 128 bits (BASE-W128) to 1K bits (BASE-W1024). So, BASE- W128 and USIG-H consume the same hardware budget (e.g., 1K-bit read-signature and 128-bit write-signature for BASE-W128 and 1K-bit main signature and 128-bit helper signature for USIG-H). For genome, intruder, kmeans, or ssca2, performance is rarely affected by the size of the read-signature or write-signature. kmeans or ssca2 does not show any sensitivity to signature size. In the case of genome or intruder, the given signature size is already large enough to show performance similar to that with a perfect signature. 82 The performance of labyrinth or vacation is hardly changed with the read-signature size as long as the read-signature is larger than or equal to 1K bit, but it is sensitive to the write-signature size. Due to its relatively large write-set, the write-signature size should be at least 1K bits for labyrinth (512 bits for vacation) to harvest reasonable performance. The performance of bayes or yada is dominated by both read-signature size and write-signature size. For bayes, asymmetric signatures show poor performance as compared with USIG-B or USIG-H. Independently of the read-signature size, the performance of BASE-W128, BASE-W256, or BASE-W512 is flattened due to its small write-signature size. Because the write-set of bayes is very large, those small write-signatures incur many false positives. By increasing the write-signature size to 1K bits, BASE-W1024 harvests more performance than unified signatures at the expense of more hardware overhead. This situation is very similar to yada. Even in yada, the performance of BASE-W1024 still lags behind that of unified signatures. As observed in Figure 5-9, each benchmark has different data-set characteristics. Also, each transaction from a benchmark can show different characteristics compared to the other transactions from that benchmark. As long as we implement separate read-/write-signatures, it is impossible to find the sweet spot for satisfying every benchmark. On the contrary, unified signatures can fully utilize the given hardware budget by merging read- and write-signatures. 83 Figure 5-9 Execution Time with Asymmetric Signatures 84 Figure 5-9 Execution Time with Asymmetric Signatures 5.5.5 Hardware Implementation Table 5-3 Synthesis Results of Unified Signatures 2K 4K 8K 2K 4K 8K 2K 4K 8K Hash 212 254 275 0.330 0.370 0.390 0.089 0.101 0.121 Bit Array 11,233 13,089 16,694 0.281 0.290 0.308 147.456 148.608 150.840 Total 11,445 13,343 16,969 0.611 0.660 0.698 147.545 148.709 150.961 Hash 298 335 363 0.400 0.400 0.410 0.124 0.151 0.158 Bit Array 7,368 9,708 14,540 0.356 0.412 0.525 76.562 79.401 85.052 Total 7,666 10,043 14,903 0.756 0.812 0.935 76.686 79.552 85.210 Hash 250 272 287 0.380 0.400 0.380 0.100 0.121 0.124 Bit Array 6,544 8,347 12,149 0.290 0.308 0.392 74.304 75.420 77.616 Total 6,795 8,619 12,435 0.670 0.708 0.772 74.404 75.541 77.740 Hash 252 274 288 0.390 0.410 0.390 0.100 0.121 0.124 Bit Array 7,356 9,158 12,960 0.320 0.320 0.392 74.561 75.677 77.873 Total 7,608 9,432 13,248 0.710 0.730 0.782 74.662 75.798 77.997 USI G -M USI G -B USI G -H Signature Size (bits) Area (um 2 ) Time (nsec) Power (mW) BASE We have analyzed the performance aspects of unified signatures. In this section, we estimate the hardware cost to implement unified signatures [24]. Table 5-3 shows hardware overheads for separate signatures (BASE) and unified signatures (USIG-M, USIG-B, and USIG-H) in terms of area, delay, and power consumption. As shown in 85 Figure 2-5, Figure 5-1, and Figure 5-4, the hash function used by unified signatures requires one more signature index bit than the hash function used by a separate signature, even though their total sizes are the same. Hence, the hardware overhead to implement hash functions for unified signatures is always a little bit larger than that for separate signatures. However, the main overhead of a signature design is the bit array, not the hash function, independently of different designs and sizes. So, the difference of the hardware overhead required for the hash functions is negligible. The number of SRAMs used by a baseline signature is twice as many as that used by a unified signature because the baseline signature consists of separate read-signature and write-signature. As a result, the overhead of bit arrays for each signature is different even though their sizes are the same. The overhead of the peripheral logic in SRAMs (e.g., decoders, sense amplifiers, and precharge logic) is amortized over the array size as SRAM sizes scale. Therefore, all unified signature schemes require smaller hardware footprint and less power consumption than separate signatures, by implementing half the number of twice-larger SRAMs. USIG-B yields the smallest hardware overhead due to its simplest structure. The additional overhead in USIG-M stems from the dual-ported bit array. USIG-H requires more hardware resources than USIG-B because of the additional 128-bit helper. The area and power overhead of SRAM increases as the number of ports increases, shown in Table 3-4, and this overwhelms the 128-bit bit array. So, the hardware requirement of USIG-H is less than that of USIG-M. With a 2K-bit 86 signature size, we can save 33.02% area and 48.02% power with USIG-M. USIG-B occupies 40.63% less area and consumes 49.57% less power. Finally, USIG-H exhibits 33.57% area savings and 49.40% power savings. 5.6 Related Work All the previous signature designs in Section 2.5 have focused on improving the quality of hash functions using separate read- and write-signatures. On the contrary, the unified signature scheme [23] improves TM performance by changing the structure of signatures. Instead of using traditional separate signatures, it merges them into a single signature to track both read- and write-sets. As a result, this scheme enlarges the effective signature size and improves TM performance by reducing false positives. A parallel Bloom filter [80] also changes the structure of signatures, but its main purpose is to improve area-efficiency of signatures with single-ported multiple SRAMs instead of a multi-ported single SRAM. Unified signatures can also be implemented with a Parallel Bloom filter. The disparity between read-set size and write-set size is noted by [58]. Because the read-set size is often bigger than the write-set size, Minh et al. recommend using differently-sized signatures, i.e., a large read-signature with a small write-signature. However, we have observed that small write-signatures can introduce many false positives with some benchmarks which have large write-sets. As long as separate read-/write-signatures are used with a fixed hardware resource, it is difficult to 87 satisfy all benchmarks, each of which has its own unique data-set characteristics. In Section 5.5.4, we have shown the performance impact with asymmetrically sized read-/write-signatures. Simultaneously to [23], Quislant et al. [71] propose multiset signatures, the most similar signature design to unified signatures. Paying attention to the asymmetric space usage from separate signatures, they also join read- and write-signatures into a single one. However, merging read- and write-signatures introduces another kind of false positives, read-read dependencies. [23] and [71] differ in how to handle read- read dependencies. The multiset signature design implements a bit array as a dual- ported SRAM and provides dedicated hash functions for read-set and write-set. As a result, the selected bit array can distinguish a read-address from a write-address by asserting a different bit for each address, removing read-read dependencies. Differently from multiset signatures, [23] back up the main unified signature with a special small-sized write-signature, helper signature, to filter out read-read dependencies. Store addresses are inserted into the helper signature like a write- signature, but only REQ is tested against the helper signature. If the main signature is hit but the helper is missed during a test operation, that means a read-read dependency occurs, and this positive can be safely ignored. In multiset signatures, the dual-ported bit array uses different hash functions for load and store, so it can assert more bits if a certain address is read and then written. As more bits are asserted, the probability to incur false positives is increased. 88 Consequently, multiset signatures show powerful performance if the signature size is large enough, but do not harvest more performance when the signature size is restricted or the benchmarks have large data-sets. Also, due to the dual-ported SRAM, the hardware overhead for multiset signatures is more severe than that of unified signatures. In Section 5.5, we have compared our unified signatures to multiset signatures in terms of performance and hardware overhead. 5.7 Conclusions We observe that the occupancy of read- and write-signatures differs quite significantly for most applications. This asymmetric occupancy of each signature type can introduce many false positives because the separate signatures cannot efficiently utilize a given signature budget. Based on this observation, we have proposed a simple and efficient signature design, unified signatures. Instead of separating signatures for each data-set, we build a single signature to track all memory accesses. By efficiently utilizing a given hardware budget, a TM system with a unified signature outperforms a baseline system that uses separate signatures. However, unified signatures introduce another problem: read-read dependencies. Fortunately, read-read dependencies do not negate the benefit of unified signatures for practical signature sizes. Even better, a small helper signature can effectively remove read-read dependencies from unified signatures. 89 In this chapter, we described and compared three kinds of unified signature designs, multiset unified signature [71], blind unified signatures, and unified signatures with helper. All the proposed unified signatures are promising. They not only improve performance but also save hardware cost. A TM system with a 2K-bit unified signature with helper signature achieves speedups of an average of 15% over TM systems with traditional dedicated read-/write-signatures. At the same time, a 2K-bit unified signature with helper signature can be implemented with 33% less area and 49% less power than separate signatures. Our results show that the unified signature approach is a step ahead in improving TM performance by efficiently utilizing a given signature budget. 90 Chapter 6 Mileage-based Conflict Management 6.1 Overview Atomic sections, critical sections protected by locks in traditional parallel programs or a transaction in TM systems, have conventionally been treated to execute in any order with no weights as long as the atomicity can be maintained. We have observed that some atomic sections are more important than others with respect to the performance based on the implemented algorithm. In this chapter, we propose a mileage-based atomic section ordering. The Mileage mechanism is a software/hardware cooperative approach to exploit performance-criticality among atomic sections from a parallel application. We show performance-criticality among atomic sections and the efficiency of mileage in the context of contention management in TM systems. Mileage-based contention management achieves average speedups of 15% over baseline contention management. 6.2 Introduction Parallel programming decomposes an application into threads that execute concurrently and cooperatively [28]. A parallel application must coordinate the activity of its threads with synchronization, such as global event synchronization with barriers and mutual exclusion with critical sections, to ensure that the dependencies within the program are enforced. We call an atomic section a critical section protected by a lock for mutual exclusion in traditional parallel programs or a 91 transaction in a TM system which maintains atomicity. Atomic sections have been traditionally treated to execute in any order as long as the atomicity can be maintained [45]. In this chapter, we suggest that even though atomic sections are evenly important for correctness, they are unequally important for performance. Among the atomic sections in a parallel application, some atomic sections are more important than others with respect to the performance based on the implemented algorithm; for example, in the producer-consumer dependency relationship, the producer is performance-critical compared to the consumer [68]. It is worthy to distinguish the performance-critical ones from others for speeding up execution of a parallel application. For this purpose, we propose a mileage-based atomic section ordering. The Mileage mechanism is a software/hardware cooperative approach to exploit performance-criticality among atomic sections of a parallel application. Mileage consists of newly-defined instructions (MILEAGE and MRSTCNT) and a new functional unit (mileage unit). After programmers or compilers insert mileage instructions into their code to express the relative importance of each atomic section, mileage units track the mileage value of each thread. This information is used to extract more performance with resource allocation, thread scheduling, and conflict arbitration. Section 6.4.1 describes the Mileage mechanism. 92 We show performance-criticality among atomic sections and the efficiency of mileage in the context of contention management in TM systems [25]. Previous contention management schemes treat all transactions with no weights, and make a decision based on the information provided by the running transaction instance. In this chapter, we propose new Mileage-based RCM (Section 6.4.2) and Mileage- based PCM (Section 6.4.3) schemes. 6.3 Motivation Atomic sections have historically been treated equally without weight and executed in any order as long as the atomicity is maintained by mutual exclusion or mechanisms inherent in a TM system. However, we suggest that atomic sections are not evenly important for performance due to the dependency relationship between them, and therefore it is recommended to be aware of the effect of the order of atomic sections. Intruder clearly shows the performance-criticality of each atomic section. Intruder implements a network intrusion detection algorithm, scanning network packets for matches against a known set of intrusion signatures [38]. Intruder contains three static transactions 5 ; txid0 (TMSTREAM_GETPACKET) captures packet fragments from a network stream and stores them in the fragment queue; txid1 (TMDECODER_PROCESS) reassembles fragments based on a map to construct a packet and puts it into the decoder queue; finally, txid2 5 In this chapter, the static transaction id (txid) is assigned according to the appearance-order in the source code. 93 (TMDECODER_GETCOMPLETE) moves the packet from the decoder queue to the intrusion detector module. In this benchmark, txid1 is a consumer with respect to txid0 and a producer with respect to txid2. Table 6-1 Transaction Characteristics of Intruder TXID txid0 7.20% 4.00 1.00 2,944.13 (11.53%) 1,774.40 (6.95%) txid1 78.97% 17.59 7.89 9,115.47 (35.69%) 13,185.27 (51.62%) txid2 13.83% 3.09 1.55 13,481.07 (52.78%) 10,581.00 (41.43%) Aborted (%) Aborting (%) W-Set Size R-Set Size TX Time Table 6-1 details the dynamic characteristics of three transactions from intruder. The second column gives the percentage of transaction time, the third and fourth columns present read-set and write-set sizes in the number of cache blocks accessed, the fifth column (Aborted) gives the number of aborted instances of the given txid, and the last column (Aborting) presents the number of aborts that the given txid incurs. As shown in Table 6-1, most aborts are caused by txid1 or txid2 (93.05%). The producer is often, but not always, more performance-critical than the consumer in the dependency relationship. Figure 6-1 shows the speedup of TXID RCMs over Time RCM (TIME). TXID schemes are basically Time-based resolution approaches except that the highest priority is given to a the specified static transaction. For example, TXID0 always prioritizes txid0 over txid1 or txid2. If a conflict occurs between two instances of txid0, Time-based resolution is used. The other conflicts are also resolved with Time RCM. 94 Figure 6-1 Performance-Criticality in Intruder Txid0 is a producer with respect to txid1, but txid0 generates hardly any conflicts. As a result, TXID0 yields a marginal difference in results compared to those of TIME. Assigning the highest priority to txid2 (TXID2), txid2’s aborts are decreased at the cost of increasing the number of txid1’s aborts because txid2 is frequently conflicting with txid1. Unfortunately, this results in significant performance degradation. Txid1 acts as a producer with respect to txid2 and aborting txid1 impedes the progress of the parallel application. Indeed, txid1 has a long transactional time and large data-set size, so the abort penalty of txid1 is much larger than txid2. Based on this observation, we can conclude that txid1 is most performance-critical among the three transactions of intruder. The performance gap is remarkable even with this simple policy – more than 40% speedup of TXID1 over TXID2. All atomic sections are equally important for 95 maintaining correctness, but each atomic section is not evenly critical with respect to system performance. To speed up parallel applications in this environment, a high priority must be given to the performance-critical atomic sections. 6.4 Proposed Design 6.4.1 Mileage Instructions and a Mileage Unit In this section, we present a simple but effective technique, mileage, to assign a priority to each atomic section based on its performance-criticality. For this purpose, two mileage instructions and a mileage unit are proposed. Table 6-2 The Overhead of Mileage Instructions Benchmarks bayes 126,559,444 535 0.00% genome 40,506,195 5,081 0.01% intruder 38,356,828 11,289 0.03% kmeans 11,463,554 1,396 0.01% labyrinth 454,074,971 94 0.00% ssca2 35,342,886 50,700 0.14% vacation 12,025,261 4,527 0.04% yada 75,383,251 5,864 0.01% Total Insts Total M-Insts Ov erhead (% ) We define new instructions, MILEAGE and MRSTCNT to express the performance- criticality of atomic sections in a parallel application. MILEAGE has one operand, mileage identifier (mid). A mid indicates how far a thread progresses in the parallel region, monotonically increasing during program execution. MRSTCNT is used to clear the current weight of mid. MILEAGE and MRSTCNT were inserted manually 96 based on source code analysis and performance profiling. We insert these instructions into the source code to provide a high priority to the long-running producer atomic sections. Table 6-2 gives the number of mileage instructions executed. The overhead of mileage instructions is trivial compared to the total number of instructions. A more detailed explanation about where to insert mileage instructions in each benchmark is presented in Section 6.5. mid mcnt Mileage Register = + 0 1 0 mid 1 Figure 6-2 Mileage Unit For measuring the mileage of a thread, we implement a mileage unit in each core, illustrated in Figure 6-2. A mileage unit maintains the current mid (mid register) and a mileage counter (mcnt register), which tracks the number of times that MILEAGE has been executed with the current mid as its operand. When MILEAGE with a new mid is executed, the new mid is stored in the mid register and the mcnt register is cleared. Every time the same mid appears again, the mcnt register is incremented. MRSTCNT is used to clear the mcnt register. When two threads contend with each other, the thread with the smaller mileage value (mid concatenated with mcnt) receives higher priority. 97 To estimate the hardware overhead of the proposed mileage unit, we have developed the design using Verilog HDL and synthesized using Synopsys Design Compiler, targeting IBM 32 nm technology. We assume a 16-bit mid register and a 16-bit mcnt register. The hardware overhead is negligible; the mileage unit occupies 242.89 μm 2 of area and consumes 473.20 μW dynamic power and 321.14 nW leakage power. 6.4.2 Mileage-based Reactive Contention Management For applying mileage to a practical example, we first propose Mileage-based Reactive Contention Management (RCM) in HTM systems. When a conflict is detected, one of the competing transactions can continue its execution and the others stall or abort to maintain atomicity [36], [37], [83], [87], [86]. Traditional RCMs decide which transaction continues its execution based on information from the current instance. For example, Time-based RCM selects the transaction starting earlier as a winning transaction, and Size-based RCM gives the priority to the transaction which has accessed more memory blocks. The decision from Mileage-based RCM is made with the relative importance of each transaction from the program flow (mid) as well as dynamic flow (mcnt). On a conflict, Mileage RCM chooses the transaction with the smaller mileage value. We use time-based tie-breaking when transactions that have the same mileage value contend with each other. From our experiments, mileage RCM provides prominent performance improvements with some benchmarks that have performance-critical 98 transactions, such as bayes and intruder. In Section 6.5.1, we compare Mileage- based RCM to traditional ones and analyze the performance of each benchmark. 6.4.3 Mileage-based Proactive Contention Management In a parallel application, the atomic execution of a code segment is common for maintaining correctness. Two programming alternatives for this purpose are traditional locking and TM. A locking mechanism is a pessimistic concurrency control that is a kind of prediction whose output is always-predicted-conflict. If mispredicted, concurrency is needlessly suppressed and the opportunity to gain more performance is lost. On the contrary, TM is an optimistic speculation scheme, predicting always-not-conflict. If transactions conflict and result in aborts, performance can be degraded due to the abort penalties. Instead of selecting one of two static extremes, a dynamic prediction scheme, Proactive Contention Management (PCM), can be used [4], [9], [10], [26], [33], [34], [53], [98]. We propose Pseudo-Lock Insertion (PLI) as a Mileage-based PCM, throttling the execution of less-critical transactions. A brief overview of PLI is shown in Figure 6- 3. We prepare a per-core abort predictor to decide whether to schedule or not, and a system-wide pseudo-lock variable to serialize the execution of transactions. If a transaction is predicted-aborted, a global pseudo-lock variable (m global ) is set to valid upon starting execution. If the transaction which has set the pseudo-lock variable commits, it releases the pseudo-lock by invalidating it. Every time a thread encounters a transaction, it first checks the abort predictor. If the transaction is 99 predicted-aborted and m global is valid, the thread waits until m global becomes invalid before starting a transaction. Otherwise, the transaction begins. Figure 6-3 The Overview of Pseudo-Lock Insertion We have observed that some applications show a phased execution of transactions [98]. The same static transactions usually access the same shared data, and threads usually enter the same static transaction at roughly the same time. For example, 90.31% of aborts from labyrinth are caused by txid1. 98.64% of these aborts are caused by the other instance of txid1. In that case, the aborts between the same txid can be easily predicted. Based on this observation, we propose txid-based abort prediction. The abort predictor is implemented with a 4-bit saturating counter. On every abort, the local txid, the txid being aborted, and enemy txid, the txid that incurs 100 that abort, are compared. If they are the same, the counter is incremented. When a transaction commits, the counter is decremented. The most-significant bit of the counter is used to predict an abort similar to branch prediction [57]. For comparing txids between competing transactions, the requester embeds its txid into the request message and the receiver piggybacks its txid in the NACK message. The implemented predictor is simple but effective enough to capture the highly- predictable phased execution of transactional applications. Also, it does not severely hurt the performance of other applications that do not exhibit phased execution. Although more complex prediction mechanisms to consider the whole dependency relationship are possible [10], they incur expensive hardware costs. Our abort predictor has an extremely small hardware overhead (89.39 μm 2 area and 76.62 μW/132.19 nW of dynamic/leakage power). The global pseudo-lock variable m global is accessed with ordinary LOAD and STORE instructions instead of atomic instructions. Such non-atomic instructions can result in incoherent values. As a result, a transaction that is able to run concurrently can be uselessly scheduled or a transaction that would rather be scheduled can start execution. However, correctness is not affected because the system coherence and consistency is still maintained by the underlying TM system. The scheduled transaction reads the invalid global variable eventually, and the conflicting transactions will abort and then restart anyway. Indeed, even after a transaction has already set the global variable, a higher-priority transaction can start, thereby 101 extracting more performance, ignoring the pseudo-lock. Meanwhile, the overhead required for PCM is reduced by not using expensive atomic instructions. Now, we explain how pseudo-lock insertion (PLI) employs the mileage concept. The detailed algorithm of PLI is presented in Figure 6-4. In this scheme, each core requires a single-bit owner-flag (f owner ) that indicates which thread on this core has set the pseudo-lock variable, and abort counter (n aborts ) to count the consecutive aborts. The abort counter is incremented at each abort and cleared at commit. If a transaction is predicted-aborted and experiences a number of consecutive aborts (n aborts ) greater than a threshold (n threshold ), it updates a global mileage variable (m global ) to its own mileage value from the mileage unit (m local or local mileage) upon restart and sets its own owner-flag (f owner ). The transaction invalidates the global mileage variable on commit if it is the original acquirer of m global (its f owner is set). Threads check this scheduling condition before starting transactions. If the scheduling condition is satisfied and m global is valid, the thread waits until m global becomes invalid. However, a transaction with higher priority (smaller mileage value) ignores m global so as to extract more performance by seizing the chance to compete with low priority transactions. The overwritten global variable is not restored to the original one and the over-writer never sets its f owner . The original acquirer assumes the responsibility when it commits. After intensive simulations, we select three for the n threshold value as that value showed the best average performance. As a result, each core requires one additional bit for f owner , and two bits for n aborts to implement Mileage-based PCM. 102 1: procedure PLITxStart 2: if predicted-abort then 3: if n aborts ≥ n threshold and m global is valid then 4: if m local < m global then 5 : overwrite m global to m local 6: TxStart 7: else 8 : PLITxSchedule 9: else 10 : validate m global to m local 11 : set f owner 12 : TxStart 13 : else 14 : TxStart 15 : end procedure 1: procedure PLITxCommit 2: if b ower is set then 3 : invalidate m global 4: reset f owner 5: reset n aborts 6: TxCommit 7: end procedure 1: procedure PLITxSchedule 2: repeat 3 : backoff 4 : read m global 5: until m global is valid or m local ≥ m global 4: if m global is invalid then 5 : validate m global to m local 6: set f owner 7: TxStart 8: else if m local < m global then 9 : overwrite m global to m local 10 : TxStart 11 : end procedure Figure 6-4 Pseudocode for PLITxStart, PLITxCommit, and PLITxSchedule 103 6.4.4 Dynamic Mileage Allocation Dynamic Mileage Unit PC 0 0 txid 0 1 0 txid 1 0 1 txid 2 txid n Mileage Signature Memory Address Figure 6-5 Dynamic Mileage Unit So far, we assume that mileage instructions are inserted by compilers or programmers based on source code analysis and performance profiling. In this section, we briefly describe Dynamic Mileage, automatically tracking the mileage value of each transaction at runtime. The main purpose of the mileage scheme is finding a highly-contending producer transaction to give it the highest priority as well as finding a highly-contending consumer transaction to give it the lowest priority. For pursuing this purpose, each core must track the dependencies between transactions. We propose a Dynamic Mileage Unit (DMU), illustrated in Figure 6-5. Each core has a DMU, and each entry in a DMU has a producer bit (P-bit), a consumer bit (C- bit), and a mileage-signature. The number of entries in a DMU is the same as the number of static transactions in the application. A mileage-signature is a Bloom filter and contains the previous write-signature of the corresponding static 104 transaction. Every time the currently-running transaction (e.g., txid 2) reads or writes, its memory address is tested against all the mileage-signatures in that core. If there is a hit, the C-bit of txid 2 and the P-bit of the conflicting transaction (e.g., txid 1) are set. At a commit, the write-signature of the committing transaction is copied to its mileage-signature. When a transaction starts, its priority is decided based on the corresponding P-bit and C-bit in the DMU. After deciding the priority, P-/C-bits are cleared. The corresponding action is summarized in Table 6-3. Table 6-3 Dynamic Priority Decision P-bit C-bit This transaction has no relation with Do nothing. contentions, so no priority is required. This is a pure producer transaction, Clear the mcnt register at the and is recommended to receive beginning, and increment the mcnt a high priority. register after commit. This is a pure consumer transaction, Increment the mcnt register and is recommended to receive at the beginning. a low priority. This is neither a pure producer nor Do nothing. a pure consumer, and needs not to change a priority. 1 1 Description Action 00 0 1 01 The proposed Dynamic Mileage has several limitations. First, the design requires additional hardware overhead to implement the DMU per each core. Because the mileage-signature is a Bloom filter, it can introduce false positives, generating incorrect priorities. We leave the improvement and evaluation of Dynamic Mileage as future work. 105 6.5 Experimental Results In this section, we present and discuss the simulation results. Section 6.5.1 shows the results of Mileage-based RCM, comparing with other RCMs, and the behavior of conflict-intensive benchmarks is analyzed. Section 6.5.2 provides the results of Mileage-based PCM and shows when and why PCM is effective. Finally, we present results for a combination of RCM and PCM in Section 6.5.3. 6.5.1 Simulation Results with Mileage-based RCM Figure 6-6 Execution Time with RCMs (T: Time, S: Size, and M: Mileage) Figure 6-5 shows the execution time of each RCM normalized to that of Time-based RCM on a given benchmark (lower is better). When conflicts occur, Time-based RCM assigns the highest priority to the transaction starting earlier, and Size-based RCM assigns the highest priority to the transaction accessing more memory blocks. Mileage-based RCM is the proposed scheme in Section 6.4.2. The execution time is 106 broken down to the time spent outside of transactions (Non-TX), the time consumed for committed transactions (Useful-TX), and the transactional overhead (Wasteful- TX). Transactional overhead is broken down again in Figure 6-6: Stall means the waiting time after a conflict is detected; Discard is the execution time of a transaction which was eventually aborted; Backoff corresponds to the backoff delay after aborting before restarting; and Abort is the time needed for the abort process (e.g., undoing log). Mileage RCM provides prominent performance improvements with benchmarks that have performance-critical transactions, such as bayes and intruder. To provide further insight, we focus on the contention-intensive benchmarks such as bayes, intruder, and labyrinth in this section. Figure 6-7 Breakdown of Wasteful Transaction Time (T: Time, S: Size, and M: Mileage) A description of intruder can be found in Section 6.3. Txid1 of intruder is the most performance-critical transaction; the highly-contending longest and largest transaction. We insert MRSTCNT before txid1 and MILEAGE after it so as to give the highest priority to txid1. As a result, 64.41% of txid1’s aborts are removed with 107 Mileage-RCM compared to Time-RCM. By prioritizing txid1, Discard (increasing when a long transaction aborts) and Abort (depending on the aborts of a large transaction) decrease as shown in Figure 6-6. Labyrinth emulates Lee’s routing algorithm, finding a shortest interconnection between two points [93]. In iteration, each thread grabs a start and end point from a work queue (txid0) and connects them with expansion and backtracking (txid1). Txid1 of labyrinth is a routing transaction that calculates the path and adds it to the global grid. It is the longest transaction (99.83% of transaction time) and has the largest write-set (226.00 cache blocks in average). To give the highest priority to txid1, we insert MRSTCNT before txid1 and MILEAGE after txid1. One of the interesting characteristics of labyrinth is that an instance of txid1 usually conflicts against another instance of the same static transaction. With this characteristic, Mileage RCM hardly improves the performance of labyrinth because all dynamic instances of txid1 have the same mileage value. The other RCMs show similar performance because all instances of txid1 are always serialized anyway, exhibiting a phased behavior. As a result, RCMs do not help the performance of labyrinth. However, PCM can harvest more performance by utilizing the always-conflicting characteristics of txid1. In Section 6.5.2, we show the performance improvement of labyrinth using PCMs. In Eager-Eager (EE) HTM systems, an aborted transaction prevents conflicting transactions from continuing execution before rolling back the undo log finishes [60]. 108 The conflicting transactions wait for completion of log undoing so that they can read non-speculative values from memory written back during the abort process. As the abort penalty is not only delays the aborted transactions but also affects contending transactions, the execution of the parallel application can slow down significantly. Therefore, it is preferable to give the higher priority to the transaction whose abort penalty is large, so we usually select the longer transactions with larger data-set as performance-critical ones. One exceptional case is bayes, an algorithm for learning the dependency structure of Bayesian networks [59]. In iteration, a thread is given a variable to analyze, and adds a dependency between variables to the network. Txid11 (TMfindBestInsertTask) of bayes scans a shared data structure to find the best task. It is the longest transaction and has the largest write-set among the 15 static transactions. Interestingly, every write by txid11 modifies local variables only [81], [101]. Txid11 incurs many conflicts, but all of them are due to its shared reads. Because txid11 is a logically read-only transaction (pure consumer transaction), the aborting penalty of txid11 is exposed to none of the system, just delaying the execution of txid11. Indeed, if we give the higher priority to txid11, the conflicting transactions are starved because of its long-running time (55.78% of transaction time). We give the low priority to txid11 in order to prioritize the others. The other transactions form a dependency chain, so we insert MILEAGE appropriately. 109 Another observation from bayes is that the learning speed is increased as each thread is loosely synchronized per iteration. The performance of bayes is sensitive to the order in which dependencies are learned [59]. To exploit this property, we do not insert MRSTCNT in bayes. This results in reduced transactional overhead in all categories (Figure 6-6). As shown in Figure 6-5, the non-transactional time of bayes is also reduced with Mileage RCM. The reason for a smaller non-transactional time in this case is that Mileage RCM provides better load balancing, i.e., the waiting time of each thread at the end of a parallel region is reduced. The other benchmarks do not show any significant difference with different kinds of RCMs because their conflicts are relatively rare and they spend a small amount of time inside transactions. So, the description about their behavior is omitted in this section but can be found elsewhere [59]. Mileage RCM increases the transactional overhead of ssca2 in Figure 6-6, but it does not yield any significant performance improvement because the portion of wasteful transactional time itself is relatively small. The simulation results show that Mileage RCM achieves average speedups of 11.58% over Time RCM (11.20% over Size RCM). We have run the simulation with the other RCMs such as Karma, Eruption, and Polite [Schere05], but no one shows better performance than RCMs presented here. These results are thus omitted for conciseness. 110 6.5.2 Simulation Results with Mileage-based PCM Figure 6-8 Execution Time with PCMs (n: no PCM, a: ATS, c: CAS, and p: PLI) Figure 6-7 shows the execution time of each PCM normalized to that of Mileage- based RCM (no PCM). With Adaptive Transaction Scheduling (ATS) [98], each thread tracks the frequency of commits and aborts it has executed. This information is used to predict whether an abort will occur with the starting transaction instance. Conflict Avoidance Scheduling (CAS) [26] predicts an abort based on the abort history between threads. If a pair of threads has conflicted severely in the past, the transactions executed by them are serialized. 111 Figure 6-9 Breakdown of Wasteful Transaction Time (n: no PCM, a: ATS, c: CAS, and p: PLI) The wasteful transactional time is also broken down in Figure 6-8. Each category is the same as Figure 6-6 except Schedule; Schedule means the waiting time before starting a transaction because it is predicted-aborted. PCMs can improve the performance of some benchmarks such as intruder and labyrinth. But they hurt the performance of bayes and yada. Intruder and labyrinth exhibit a phased execution of transactions. Txid1 in labyrinth is the longest and largest transaction, and always conflicts with the other instance of txid1. In this case, PCM is a good design alternative to remove wasteful transaction time by preventing aborts. Each PCM decides whether to schedule a transaction based on the output of the abort predictor. If aborts are frequent or rare, it is easy to train the abort predictor. The well-regulated behavior of intruder or labyrinth is promptly captured by the abort predictor, and PCMs reap performance by reducing transactional overhead. PCMs reduce aborts (Discard, Backoff, and Abort) as well 112 as conflicts (Stall) because they serialize the execution of competing transactions. ATS and PLI track aborts at a transaction-granularity, but CAS treat aborts at a thread-granularity. This coarse-grained abort prediction causes the speed of CAS to lag. Bayes has many static transactions. The complex interleaving of their instances prevents the transaction scheduler from predicting aborts correctly. ATS and CAS over-serialize the transactions from bayes, increasing scheduling overhead, shown in Figure 6-8. Mileage-based PCM cautiously serializes transactions, but still experiences marginal performance degradation with bayes. Yada also shows the same behavior as bayes. Yada executes a Delaunay mesh refinement algorithm [59]. In iteration, a bad triangle is fetched from the work queue by txid0 (TMHEAP_REMOVE), its retriangulation is performed on the mesh by adding a new point, and bad triangles generated during retriangulation are added to the work queue by txid4 (TMREGION_TRANSFERBAD). Txid2 (TMREGION_REFINE) calculates retriangulation of a bad triangle. The abort rate of yada is relatively small because bad triangles that are far apart in the mesh hardly interfere with each other, and the refinement process is completely independent [46]. The awkward abort rate of yada hinders the training of the abort predictor. PLI schedules transactions less aggressively compared to ATS and CAS. So, the Schedule cycles of PLI are usually less than the others as shown in Figure 6-8. The careful scheduling of PLI harvests less performance for intruder and labyrinth, but it 113 is harmless for the other benchmarks without phased execution. Mileage-based PCM achieves average speedups of 10.67% over ATS (14.32% over CAS). 6.5.3 Putting It All Together Figure 6-10 Speedup with Combination of RCMs (T: Time and M: Mileage) and PCMs (n: no PCM, a: ATS, c: CAS, and p: PLI) In this section, we present the performance impact of the combination of RCMs and PCMs. Figure 6-9 illustrates the speedup of each combination, normalized to Time- based RCM without PCM (higher is better). Mileage RCM speeds up benchmarks that have performance-critical transactions, such as bayes or intruder. PCMs can improve the performance of benchmarks that show phased execution, such as intruder and labyrinth. However, PCMs can damage other cases. The performance of bayes or yada is severely degraded. Therefore, PCM should be carefully implemented. Our proposed scheme, Mileage-based RCM with PLI can achieve average speedups of 15.88% over baseline contention management. 114 6.6 Related Work The most relevant topic to mileage is Accelerated Critical Sections (ACS) [88] and Bottleneck Identification and Scheduling (BIS) [45]. Observing that critical sections can serialize the execution of threads, ACS utilizes the fast core on an Asymmetric CMP (ACMP) to speed up the execution of critical sections. BIS extends ACS in order to accelerate all serializing bottlenecks in parallel applications, e.g., critical sections, barriers, and slow pipeline stages in a pipelined parallel application. In BIS, an identifier for each serializing bottleneck is augmented into new BIS instructions by software, and is used to keep track of the performance-critical bottlenecks. The critical bottleneck is executed on the fast core. The major difference between ACS/BIS and mileage is that they focus on the dynamic behavior of bottlenecks without respect to the software algorithm. Mileage concentrates on the implemented algorithm, controlling the execution order of atomic sections (critical sections or transactions). Mileage also uses new instructions with a mileage identifier, but its main purpose is to express the relative importance of bottlenecks exposed in the software. The limitation of mileage is that it targets only atomic sections. However, since mileage explores a complementary approach to ACS and BIS, the basic concept of mileage can cooperate with them to harvest more performance. ACS and BIS exploit an ACMP platform, so it is hard to gain more performance using Symmetric CMPs, the target system of mileage. 115 Thread-criticality is another relevant topic to mileage [49], [50], [14], [8]. The main purpose of thread-criticality is to find the slowest (critical) threads from a parallel application so as to save energy consumption by slowing down faster threads or improve the performance by speeding up the slower threads. For measuring the criticality of each thread, one can track the number of active cycles between barriers [49], the number of loop iterations [50], [14], or the weighted cache miss latencies [8]. The major difference between thread-criticality approaches and mileage is that the former seek performance-criticality at a thread-granularity, and often target global event synchronization. Mileage targets performance-criticality at the atomic section-granularity and lock-style synchronization. We have shown the efficiency of mileage in the context of Transactional Memory (TM). After detecting a conflict, a TM system resolves the conflict with Reactive Contention Management (RCM). RCM can have a profound impact on TM performance, and many design alternatives for RCM have been studied [36], [37], [83], [86]. However, no single policy has been acknowledged as the universally best one for all applications. As shown in Section 6.5.1, Mileage RCM provides impressive performance improvements with benchmarks that have performance- critical transactions. At the same time, mileage RCM shows no severe speeddown across all other benchmarks we evaluated. Shriraman et al. [Shriraman08], [86] propose a mixed conflict resolution scheme, resolving write-write conflicts eagerly and read-write conflicts lazily. Our Mileage-based RCM is orthogonal to this concept, so it can be combined with such schemes as a priority mechanism. Spear et 116 al. [87] also emphasize the importance of programmer-determined priority between transactions. However, their solution, supporting read visibility, is specific to Software TM (STM), so is not easily extendible to critical sections in non- transactional parallel applications. We propose PLI, which has a txid-based abort predictor and serializes predicted- aborted transactions with pseudo-locking. Our abort predictor is simple but efficient enough to capture the phased behavior of benchmarks. Serializing transactions with a lock is not a new concept [72]. If a transaction acquires a global lock, then the other transactions should wait until it releases. However, in our scheme, a higher- priority transaction can start execution even without acquiring the global lock. Because PLI accesses the global variable with ordinary reads and writes, the overhead for atomic instructions is removed. In this chapter, we compared PLI to ATS and CAS. ATS and CAS require a central scheduling module (e.g., centralized queue in ATS or scheduling manager in CAS). These central modules require additional hardware overhead. If they are implemented in software, access to the central module requires synchronization. Even though PLI also uses a global variable, the overhead to access it is negligible. Finally, Blake et al. [9], [10] propose thread switching-based PCM with an over-subscribed system. Unfortunately, we cannot directly compare PLI to them because our simulation environment does not support over-subscription. However, based on inspection, their design hardly works well with under-subscription and requires a noticeable hardware overhead for its Bloom Filter-based abort predictor [10]. 117 6.7 Conclusions In this chapter, we suggest that each atomic section is evenly important for correctness, but unequally critical for performance, based on the software algorithm. We show that performance-criticality of atomic sections exists in some parallel applications and the execution order of atomic sections can impact the performance of a parallel application. To better exploit application-inherent characteristics, we propose mileage, a software/hardware cooperative mechanism to exploit performance-criticality among atomic sections. New instructions and a new hardware unit are proposed, and the overhead of mileage is measured in terms of the number of additional instructions executed, and hardware area and power. The effectiveness of the mileage mechanism is shown in the contexts of contention management of Transactional Memory. Mileage introduces negligible overhead in terms of additional instructions, power consumption, and area. Even with its simple design, mileage can harvest more than an average of 15% performance improvement on the STAMP benchmarks compared to a conventional system. 118 Chapter 7 Conclusions Transactional Memory (TM) has attracted considerable attention as a promising paradigm for alleviating the difficulty of parallel programming. By decoupling correctness and performance, TM can make parallel programming much easier and enable better programmer productivity. Improving TM performance is very important for TM to be accepted as a mainstream programming model. In this thesis, we have reviewed the concept of TM, the requirements for implementing TM systems, and conflict detection and contention management, essential elements of TM systems. Hardware signatures have been considered as an area-efficient structure to detect conflicts among concurrent transactions. However, signatures can degrade TM performance by producing false positives, falsely declaring conflicts. Hence, increasing the efficiency of signatures is a crucial issue for TM. We have evaluated our proposed signature designs: adaptive grain signatures and unified signatures. Adaptive grain signatures focus on improving the quality of hash functions. Due to memory locality, some false positives can be helpful to TM performance. Adaptive grain signatures can improve TM performance by increasing the number of these performance-friendly false positives. Simulation results show that TM systems with adaptive grain signatures frequently outperform baseline TM systems. 119 Unified signatures harvest more performance by modifying the structure of ordinary signatures. Instead of using separate read-/write-signatures, a unified signature merges them and effectively enlarges the signature size without additional hardware overhead. The simulation results show that a TM system with a unified signature, whose size is the sum of separate signatures, outperforms a baseline TM system that uses separate signatures. Contention management is another TM dimension that also significantly impacts TM performance. Observing that some transactions are more important than others with respect to the performance based on the implemented algorithm, we propose a mileage technique, a software/hardware cooperative approach with new instructions and a new functional unit to exploit performance-criticality among transactions. We propose Mileage-based reactive contention management and proactive contention management and evaluate them against traditional contention management schemes which treat all transactions without any algorithm knowledge. Mileage-based contention management performs better than traditional ones by exploiting the performance criticality of each transaction. The benefit of multi-core processors can only be realized by providing environments that enable the efficient development of parallel applications. For pursuing this goal, the hardware must help programmers to compose parallel applications more easily, and application programmers must then establish methods for extracting more performance with the parallel hardware. We have studied TM as a promising 120 parallel programming alternative, and there are several aspects of TM to be investigated for future research. TM can help programmers to easily write parallel programs and has the opportunity to extract more performance by increasing concurrency. However, the power consumption of TM systems is not yet pervasively studied. Because the deluge of parallel computing is caused by the unavoidable power consumption of traditional performance scaling with sequential programs, TM will not be a viable solution if it incurs too much power consumption compared to alternatives. We should carefully investigate the power consumption of each building block of TM, and how to reduce its impact on power. Second, an effective solution of Input / Output processing (IO) with TM – especially HTM – has not yet been suggested. The absence of IO support restricts the employment of TM or makes TM programming cumbersome, undermining the potential of TM – writing parallel programs easily. The correlation between performance and transaction interleaving is not yet clear. As can be seen from yada with adaptive grain signatures or unified signatures, false conflicts affect the execution time of the whole program by changing the transaction interleaving. It is important to reveal the relation between performance and transaction interleaving due to conflicts as a first step to predictable TM performance. 121 Finally, the mileage technique can be extended to general shared memory parallel applications for managing threads, allocating shared resources, and reducing power consumption. Since mileage already expresses the progress of each thread, it can be used not only for performance-criticality at the granularity of critical sections but also for performance-criticality at the granularity of threads. Automatic mileage profiling is also an important issue so that the mileage technique can be easily adopted in conventional computing systems. 122 References [1] A.-R. Adl-Tabatabai, C. Kozyrakis, and B. Saha, “Unlocking Concurrency,” in ACM Queue, December/January 2006-2007. [2] A. Alameldeen and D. Wood, “Variability in Architectural Simulations of Multi-threaded Workloads,” in Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. [3] C. Ananian, K. Asanovic, B. Kuszmaul, C. Leiserson, and S. Lie, “Unbounded transactional memory,” in Proceedings of the 11th International Symposium on High-Performance Computer Architecture 2005. [4] M. Ansari, M. Lujan, C. Kotselidis, K. Jarvis, C. Kirkham, and I. Watson, “Steal-on-abort: Improving Transactional Memory Performance through Dynamic Transaction Reordering,” in Proceedings of the 4th High Performance Embedded Architectures and Compilers, 2009. [5] “The ARM Cortex-A9 Processors,” White Paper, ARM, September 2009. [6] ARM, “Embedded Memory IP,” [Online]. Available: http://www.arm.com/products/physical-ip/embedded-memory-ip/index.php [7] L. Barroso, “The Price of Performance,” in ACM Queue, September 2005. [8] A. Bhattacharjee and M. Martonosi, “Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors”, in Proceedings of the 36th International Symposium on Computer Architecture, 2009. [9] G. Blake, R. G. Dreslinski, and T. Mudge, “Proactive Transaction Scheduling for Contention Management,” in Proceedings of the 42nd International Symposium on Microarchitecture, 2009. [10] G. Blake, R. G. Dreslinski, and T. Mudge, “Bloom Filter Guided Transaction Scheduling,” in Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture, 2011. 123 [11] B. Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors,” in Communications of the ACM, July 1970. [12] J. Bobba, K. Moore, H. Volos, L. Yen, M. Hill, M. Swift, and D. Wood, “Performance Pathologies in Hardware Transactional Memory,” in Proceedings of the 34th International Symposium on Computer Architecture, 2007. [13] S. Borkar and A. A. Chien, “The Future of Microprocessors,” Communications of the ACM, 54(5), 2011. [14] Q. Cai, J. Gonzalez, R. Rakvic, G. Magklis, P. Chaparro, and A. Gonzalez, “Meeting Points: Using Thread Criticality to Adapt Multicore Hardware to Parallel Regions,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008. [15] J. Carter and M. Wegman, “Universal Classes of Hash Functions,” in Proceedings of the 9th Annual Symposium on Theory of Computing, 1977. [16] J. Casazza, “Intel Core i7-800 Processor Series and the Intel Core i5-700 Processor Series Based on Intel Microarchitecture (Nehalem).” White Paper, Intel, 2009. [17] C. Cascaval, C. Blundell, M. Michael, H. Cain, P. Wu, S. Chiras, and S. Chatterjee, “Software Transactional Memory: Why Is It Only a Research Toy?,” in ACM Queue, September 2008. [18] L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas, “Bulk Disambiguation of Speculative Threads in Multiprocessors,” in Proceedings of the 33rd International Symposium on Computer Architecture, 2006. [19] L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas, “BulkSC: Enforcement of Sequential Consistency,” in Proceedings of the 34th International Symposium on Computer Architecture, 2007. [20] H. Chafi, J. Casper, B. D. Carlstrom, A. McDonald, C. C. Minh, W. Baek, C. Kozyrakis, and K. Olukotun, “A Scalable, Non-blocking Approach to Transactional Memory, ” in Proceedings of the 13th of High Performance Computer Architecture, 2007. 124 [21] W. Choi and J. Draper, “Locality-Aware Adaptive Grain Signatures for Transactional Memories,” in Proceedings of the 24th International Parallel and Distributed Processing Symposium, 2010. [22] W. Choi, Y. H. Kang, T.-J. Kwon, and J. Draper, “Implementation of Adaptive Grain Signatures for Transactional Memories,” in Proceedings of International Symposium on Circuits and Systems, 2010. [23] W. Choi and J. Draper, “Unified Signatures for Improving Performance in Transactional Memory,” in Proceedings of the 25th International Parallel and Distributed Processing Symposium, 2011. [24] W. Choi and J. Draper, “Implementation of Unified Signatures for Transactional Memory Systems,” in Proceedings of the 54th International Midwest Symposium on Circuits and Systems, 2011. [25] W. Choi, L. Zhao, and J. Draper, “Mileage-based Contention Management in Transactional Memory,” in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012. [26] D. Choi, S. H. Kim, and W. W. Ro, “Conflict Avoidance Scheduling Using Grouping List for Transactional Memory,” in Proceedings of the 26th International Parallel and Distributed Processing Symposium, 2012. [27] J. Chung, L. Yen, S. Diestelhorst, M. Pohlack, M. Hohmuth, D. Christie, and D. Grossman, “ASF: AMD64 Extension for Lock-free Data Structures and Transactional Memory,” in Proceedings of the 43rd International Symposium on Microarchitecture, 2010. [28] D. E. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers. [29] A. Darwiche. Bayesian Networks. Communications of the ACM, 53(12), 2010. [30] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nussbaum, “Hybrid Transactional Memory,” in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006. 125 [31] B. Demsky, “Using Discrete Event Simulation to Analyze Contention Managers,” International Journal of Parallel Programming, 39(6), 2011. [32] D. Dice, O. Shalev, and N. Shavit, “Transactional Locking II,” in Proceedings of the 20th International Symposium on Distributed Computing, 2006. [33] S. Dolev, D. Hendler, and A. Suissa, “CAR-STM: Scheduling-Based Collision Avoidance and Resolution for Software Transactional Memory,” in Proceedings of the 27th Symposium on Principles of Distributed Computing, 2008. [34] A. Dragojevic, R. Guerraoui, A. V. Singh, and V. Singh, “Preventing versus Curing: Avoiding Conflicts in Transactional Memories,” in Proceedings of the 28th Symposium on Principles of Distributed Computing, 2009. [35] U. Drepper, “Parallel Programming with Transactional Memory,” in ACM Queue, September 2008. [36] R. Guerraoui, M. Herlihy, and B. Pochon, “Polymorphic Contention Management,” in Proceedings of the 19th International Symposium on Distributed Computing, 2005. [37] R. Guerraoui, M. Herlihy, and B. Pochon, “Toward a Theory of Transactional Contention Managers,” in Proceedings of the 24th Symposium on Principles of Distributed Computing, 2005. [38] B. Haagdorens, T. Vermeiren, and M. Goossens, “Improving the Performance of Signature-Based Network Intrusion Detection Sensors by Multi-threading,” in Proceedings of the 5th International Workshop on Information Security Applications, 2004. [39] L. Hammond, V. Wong, M. Chen, B. Carlstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, “Transactional Coherence and Consistency,” in Proceedings of the 31st International Symposium on Computer Architecture, 2004. [40] R. A. Haring, M. Ohmacht, T. W. Fox, M. K. Gschwind, P. A. Boyle, N. H. Christ, C. Kim, D. L. Satterfield, K. Sugavanam, P. W. Coteus, P. Heidelberger, 126 M. A. Blumrich, R. W. Wisniewski, A. Gara, and G. L. Chiu. The IBM Blue Gene/Q Compute Chip. IEEE Micro, 32(2), 2012. [41] T. Harris, J. Larus, and R. Rajwar, “Transactional Memory,” Morgan & Claypool Publishers, 2010. [42] M. Herlihy and J. Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” in Proceedings of the 20th International Symposium on Computer Architecture, 1993. [43] M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer, III, “Software Transactional Memory for Dynamic-Sized Data Structures,” in Proceedings of the 22nd Symposium on Principles of Distributed Computing, 2003. [44] Intel. Intel Architecture Instruction Set Extensions Programming Reference, February 2012. [45] J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt, “Bottleneck Identification and Scheduling in Multithreaded Applications,” in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012. [46] M. Kulkarni, L. P. Chew, and K. Pingali, “Using Transactions in Delaunay Mesh Generation,” in Proceedings of the Workshop on Transactional Memory Workloads, 2006. [47] S. Kumar, M. Chu, C. J. Hughes, P. Kundu, and A. Nguyen, “Hybrid Transactional Memory,” in Proceedings of the 11th Symposium on Principles and Practice of Parallel Programming, 2006. [48] N. Kurd, S. Bhamidipati, C. Mozak, J. Miller, T. Wilson, M. Nemani, and M. Chowdhury, “Westmere: A Family of 32nm IA Processors,” in Proceedings of the International Solid-State Circuits Conference, 2010. [49] J. Li, J. F. Martinez, and M. C. Huang, “The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors,” in Proceedings of the 10th International Symposium on High-Performance Computer Architecture, 2004. 127 [50] C. Liu, A. Sivasubramaniam, M. Kandemir, and M. J. Irwin, “Exploiting Barriers to Optimize Power Consumption of CMPs,” in Proceedings of the 19th International Parallel and Distributed Processing Symposium, 2005. [51] M. Lupon, G. Magklis, and A. Gonzalez, “FASTM: A Log-based Hardware Transactional Memory with Fast Abort Recovery,” in Proceedings of 18th International Conference on Parallel Architectures and Compilation Techniques, 2009. [52] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A Full System Simulation Platform,” in IEEE Computer, February 2002. [53] W. Maldonado, P. Marlier, P. Felber, A. Suissa, D. Hendler, A. Fedorova, J. L. Lawall, and G. Muller, “Scheduling Support for Transactional Memory Contention Management,” in Proceedings of the 15th Symposium on Principles and Practice of Parallel Programming, 2010. [54] V. J. Marathe, W. N. S. III, and M. L. Scott, “Adaptive Software Transactional Memory,” Technical Report 868, Computer Science Department, University of Rochester, 2005. [55] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood, “Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset,” in Computer Architecture News, September 2005. [56] A. McDonald, J. Chung, B. D. Carlstrom, C. C. Minh, H. Chafi, C. Kozyrakis and K. Olukotun, “Architectural Semantics for Practical Transactional Memory,” in Proceedings of the 33rd International Symposium on Computer Architecture, 2006. [57] S. McFarling, “Combining Branch Predictors,” WRL Technical Note TN-36, Digital Equipment Corporation, June, 1993. [58] C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun, “An Effective Hybrid Transactional Memory 128 System with Strong Isolation Guarantees,” in Proceedings of the 34th International Symposium on Computer Architecture, 2007. [59] C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun, “STAMP: Stanford Transactional Applications for Multi-Processing,” in Proceedings of the IEEE International Symposium on Workload Characterization, 2008. [60] K. Moore, J. Bobba, M. Moravan, M. Hill, and D. Wood, “LogTM: Log-based Transactional Memory,” in Proceedings of the 12th International Symposium on High-Performance Computer Architecture, 2006. [61] M. Moravan, J. Bobba, K. Moore, L. Yen, M. Hill, B. Liblit, M. Swift, and D. Wood, “Supporting Nested Transactional Memory in LogTM.” in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006. [62] A. Natarajan and N. Mittal, “False Conflict Reduction in the Swiss Transactional Memory (SwissTM) System,” in Proceedings of International Parallel and Distributed Processing Symposium, Workshops and PhD Forum (IPDPSW), 2010. [63] M. Olszewski, J. Cutler, and J. Steffan, “JudoSTM: A Dynamic Binary- Rewriting Approach to Software Transactional Memory,” in Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. [64] K. Olukotun and L. Harmmond, “The Future of Microprocessors,” in ACM Queue, September 2005. [65] V. Pankratius, A.-R. Adl-Tabatabai, and F. Otto, “Dose Transactional Memory Keep Its Promises? Results from an Empirical Study,” Technical Report#2009- 12, Institute for Program Structures and Data Organization, University of Karlsruhe, Germany, 2009. [66] V. Pankratius and A.-R. Adl-Tabatabai, “A Study of Transactional Memory vs. Locks in Practice,” in Proceedings of the 23rd symposium on Parallelism in algorithms and architectures, 2011. 129 [67] J.-K. Peir, S.-C. Lai, S.-L. Lu, J. Stark, and K. Lai, “Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching,” in Proceedings of the 16th International Conference on Supercomputing, 2002. [68] R. V. Polanczyk. Extending the Semantics of Scheduling Priorities. Communications of the ACM, 55(8), 2012. [69] M. Quinn, “Parallel Programming in C with MPI and OpenMP,” McGraw-Hill Companies, 2003. [70] R. Quislant, E. Gutierrez, O. Plata, and E. Zapata, “Improving Signatures by Locality Exploitation for Transactional Memory,” in Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, 2009. [71] R. Quislant, E. Gutierrez, O. Plata, and E. Zapata, “Multiset Signatures for Transactional Memory,” in Proceedings of the 25th International Conference on Supercomputing, 2011. [72] R. Rajwar and J. R. Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” in Proceedings of the 36th International Symposium on Microarchitecture, 2001. [73] R. Rajwar and J. Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. [74] H. E. Ramadan, C. J. Rossbach, D. E. Porter, O. S. Hofmann, A. Bhandari, and E. Witchel, “MetaTM/TxLinux: Transactional Memory for an Operating System,” in Proceedings of the 34th International Symposium on Computer Architecture, 2007. [75] C. J. Rossbach, H. E. Ramadan, O. S. Hofmann, D. E. Porter, A. Bhandari, and E. Witchel, “TxLinux and MetaTM: Transactional Memory and the Operating System,” Communications of the ACM, 51(9), 2008. 130 [76] C. Rossbach, O. Hofmann, and E. Witchel, “Is Transactional Memory Programming Actually Easier?” in Proceedings of the 8th Annual Workshop on Duplicating, Deconstructing, and Debunking, 2009. [77] C. J. Rossbach, O. S. Hofmann, and E. Witchel, “Is Transactional Programming Actually Easier?” in Proceedings of the 15th Symposium on Principles and Practice of Parallel Programming, 2010. [78] B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg, “McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime,” in Proceedings of the 11th Symposium on Principles and Practice of Parallel Programming, 2006. [79] B. Saha, A.-R. Adl-Tabatabai, and Q. Jacobson, “Architectural Support for Software Transactional Memory,” in Proceedings of the 39th International Symposium on Microarchitecture, 2006. [80] D. Sanchez, L. Yen, M. Hill, and K. Sankaralingam, “Implementing Signatures for Transactional Memory,” in Proceedings of the 40th International Symposium on Microarchitecture, 2007. [81] S. Sanyal, S. Roy, A. Cristal, O. Unsal, and M. Valero, “Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory,” in Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, 2009. [82] S. Sethumadhavan, R. Desikan, D. Burger, C. Moore, and S. Keckler, “Scalable Hardware Memory Disambiguation for High ILP Processors,” in Proceedings of the 36th International Symposium on Microarchitecture, 2003. [83] W. N. Scherer and M. L. Scott, “Advanced Contention Management for Dynamic Software Transactional Memory,” in Proceedings of the 24th Symposium on Principles of Distributed Computing, 2005. [84] N. Shavit and D. Toutou, “Software Transactional Memory,” in Proceedings of the 14th Symposium on Principles of Distributed Computing, 1995. 131 [85] A. Shriraman, M. F. Spear, H. Hossain, V. J. Marathe, S. Dwarkadas, and M. L. Scott, “An Integrated Hardware-Software Approach to Flexible Transactional Memory,” in Proceedings of the 34th International Symposium on Computer Architecture, 2007. [86] A. Shriraman and S. Dwarkadas, “Refereeing Conflicts in Hardware Transactional Memory,” in Proceedings of the 23rd International Conference on Supercomputing, 2009. [87] M. F. Spear, L. Dalessandro, V. J. Marathe, and M. L. Scott, “A Comprehensive Strategy for Contention Management in Software Transactional Memory,” in Proceedings of the 14th Symposium on Principles and Practice of Parallel Programming, 2009. [88] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt, “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009. [89] H. Sutter and J. Larus, “Software and the Concurrency Revolution,” in ACM Queue, September 2005. [90] S. Thoziyoor, N. Muralimanohar, J. Ahn, and N. Jouppi, “CACTI 5.1,” Technical Report HPL-2008-20, Hewlett Packard Labs, April 2008. [91] S. Tomic, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris, and M. Valero, “EazyHTM: EAger-LaZY Hardware Transactional Memory,” in Proceedings of the 42nd International Symposium on Microarchitecture, 2009. [92] M. Tremblay and S. Chaudhry, “A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout Thread CMT SPARC Processor,” in Proceedings of the International Solid-State Circuits Conference, 2008. [93] I. Watson, C. Kirkham, and M. Lujan, “A Study of a Transactional Parallel Routing Algorithm,” in Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques, 2007. 132 [94] D. Wendel, R. Kalla, R. Cargoni, J. Clables, J. Friedrich, R. Frech, J. Kahle, B. Sinharoy, W. Starke, S. Taylor, S. Weitzel, S. Chu, S. Islam, and V. Zyuban, “The Implementation of POWER7: A Highly Parallel and Scalable Multi-Core High-End Server Processor,” in Proceedings of the International Solid-State Circuits Conference, 2010. [95] L. Yen, J. Bobba, M. Marty, K. Moore, H. Volos, M. Hill, M. Swift, and D. Wood, “LogTM-SE: Decoupling Hardware Transactional Memory from Caches,” in Proceedings of the 13th International Symposium on High- Performance Computer Architecture, 2007. [96] L. Yen, S. Draper, and M. Hill, “Notary: Hardware Techniques to Enhance Signatures,” in Proceedings of the 41st International Symposium on Microarchitecture, 2008. [97] L. Yen, “Signatures in Transactional Memory Systems,” PhD thesis, University of Wisconsin, February 2009. [98] R. M. Yoo and H.-H. S. Lee, “Adaptive Transaction Scheduling for Transactional Memory Systems,” in Proceedings of the 20th Symposium on Parallelism in Algorithms and Architectures, 2008. [99] J. Zebchuk, V. Srinivasan, M. Qureshi, and A. Moshovos, “A Tagless Coherence Directory,” in Proceedings of the 42nd International Symposium on Microarchitecture, 2009. [100] C. Zilles and L. Baugh, “Extending Hardware Transactional Memory to Support Nonbusy Waiting and Nontransactional Actions,” in Proceedings of the 1st Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, Jun 2006. [101] F. Zyulkyarov, S. Stipic, T. Harris, O. S. Unsal, A. Cristal, I. Hur, and M. Valero, “Discovering and Understanding Performance Bottlenecks in Transactional Applications,” in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010.
Abstract (if available)
Abstract
Chip Multiprocessors (CMPs) are becoming the mainstream due to the physical power limits of process technology. In this parallel era, software applications no longer automatically benefit from improvements in processor performance as they did in past decades. The benefit of CMPs can only be realized by environments that enable efficient creation of parallel applications. ❧ Transactional Memory (TM) is a promising paradigm that aims to simplify parallel programming by providing a programmer-friendly alternative to traditional lock-based synchronization. With TM, programmers just focus on the correctness of their parallel programs by composing applications in units of a transaction, a block of codes that execute atomically and in isolation. The underlying TM system is responsible for enforcing atomicity and extracting performance. By decoupling correctness and performance, TM can make parallel programming much easier and enable better programmer productivity than lock primitives. ❧ TM systems attempt to harvest high performance by executing multiple transactions in parallel. In TM systems, a conflict occurs when a memory block is accessed concurrently by two or more transactions and at least one of them is a write access. Detecting conflicts is critical to the correctness as well as performance of TM systems. In this dissertation, we propose two conflict detection mechanisms, adaptive-grain signatures and unified signatures to improve the efficiency of conflict detection. ❧ Observing that some false positives can be helpful to performance by triggering the early abortion of a transaction which would encounter a true conflict later anyway, we propose an adaptive grain signature to improve performance by dynamically changing the range of address keys based on the history. With the use of adaptive grain signatures, we can increase the number of performance-friendly false positives as well as decrease the number of performance-destructive false positives. ❧ Instead of using separate read- and write-signatures, as is often done in TM systems, we implement a single signature, a unified signature, to track all read- and write-accesses. By merging read- and write-signatures, a unified signature can effectively enlarge the signature coverage without additional overhead. Within the constraints of a given hardware budget, a TM system with a unified signature outperforms a baseline system with the same-sized traditional signatures by reducing the number of falsely detected conflicts. Even though the unified signature scheme incurs read-read dependencies, we show that these false dependencies do not negate the benefit of unified signatures and can effectively be filtered out. A TM system with a 2K-bit unified signature with helper signature scheme achieves speedups of 15% over baseline TM with 33% less area and 49% less power. ❧ How to resolve or prevent the conflicts, or contention management is another building block of TM systems that significantly impacts TM performance. Traditionally, critical sections or transactions have been treated to execute in any order with no weights as long as the atomicity can be maintained. We have observed that some transactions are more important than others with respect to the performance based on the implemented algorithm. Based on this observation, we propose a mileage technique, a software/hardware cooperative approach with new instructions and a new functional unit to exploit performance-criticality among transactions. We propose Mileage-based contention management and can achieve average speedups of 15% over baseline contention management.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Improving reliability, power and performance in hardware transactional memory
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Proactive detection of higher-order software design conflicts
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Reliable cache memories
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Improving efficiency to advance resilient computing
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Design of low-power and resource-efficient on-chip networks
Asset Metadata
Creator
Choi, Woojin
(author)
Core Title
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
11/12/2012
Defense Date
08/29/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conflict detection,contention management,OAI-PMH Harvest,parallel processors,transactional memory
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey (
committee chair
), Annavaram, Murali (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
woojinch@usc.edu,woojinch1732@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-109829
Unique identifier
UC11288364
Identifier
usctheses-c3-109829 (legacy record id)
Legacy Identifier
etd-ChoiWoojin-1286.pdf
Dmrecord
109829
Document Type
Dissertation
Rights
Choi, Woojin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
conflict detection
contention management
parallel processors
transactional memory