Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hardware techniques for efficient communication in transactional systems
(USC Thesis Other)
Hardware techniques for efficient communication in transactional systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HARDWARE TECHNIQUES FOR EFFICIENT COMMUNICATION IN TRANSACTIONAL SYSTEMS by Lihang Zhao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2014 Copyright 2014 Lihang Zhao Dedication I dedicate this dissertation to my parents Gu Zhao and Li Deng and my wife Tian Tian. Their unconditional love, support and encouragement accom- pany me through this long journey. ii Acknowledgments Embarking on a journey of PhD is one of the improvisations in my life, prompted by growing interests in research, cruel dissatisfaction of limited expertise and frustration from job hunting amidst the 2008 economic downturn. Looking back the way I had come, I am so fortunate to have made such a rewarding decision. It is a bold decision. Were there a slight lack of the generous support, help and advice from so many people, this very thesis could not have been possible. First of all, my gratitude goes to my advisor, Dr. Jeffrey Draper. His support and advisement are indispensable to this thesis and a blessing to me. He gives me the pre- cious freedom in pursuing novel ideas and, enlightens me during the dark moments. Perhaps more importantly, I learned a lot from him to be a decent and upbeat person. I will always be grateful for having worked with him. My thanks also go to my committee members, Dr. Murali Annavaram, Dr. Sandeep Gupta, Dr. Aiichiro Nakano, and Dr. Timothy Pinkston for their constructive shepherding and instrumental feedback. I want to thank all my talented colleagues at our MARINA group. My special thanks go to Woojin Choi and Bilal Zafar, two knowledgeable and patient seniors who have provided valuable suggestions and answered my endless questions. Woojin’s work on the simulation infrastructure contributes significantly to the works in this thesis. I also would like to thank Lizhong Chen, my collaborator and friend. I have learnt a lot from our collaboration and our chats. iii I am fortunate to have Tracy Tam and Peter Lau, whose spiritual companionship is so important and cherished. My thanks are due to my amazing friends at ISI: Congxin Cai, Lixing Huang, Chengjie Zhang, Xun Fan, Lin Quan, Xue Cai, Hao Shi, Xiyue Deng, Zi Hu and many others. Of course, I wouldn’t be here without the unconditional love and support from my parents throughout the course of my entire life. Lastly, thanks be to the God who listens to my prayer and looks after me through all the ups and downs. iv Table of Contents Dedication ii Acknowledgments iii List Of Tables ix Abstract xiv 1 Introduction 1 1.1 Problems and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 An Analytical Model for Communication Cost . . . . . . . . . 4 1.2.2 Selective Eager-Lazy HTM . . . . . . . . . . . . . . . . . . . . 4 1.2.3 In-Network Traffic Filtering for HTM . . . . . . . . . . . . . . 5 1.2.4 Predictive Unicast and Notification . . . . . . . . . . . . . . . 6 1.2.5 Consolidated Conflict Detection . . . . . . . . . . . . . . . . . 7 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Transactional Memory Background 9 2.1 Parallel Programming: the Grand Challenge . . . . . . . . . . . . . . . 10 2.2 Transactional Memory Basic Concept . . . . . . . . . . . . . . . . . . 11 2.3 Hardware Transactional Memory and Its Taxonomy . . . . . . . . . . . 15 2.3.1 Eager vs. Lazy Version Management . . . . . . . . . . . . . . 15 2.3.2 Eager vs. Lazy Conflict Detection . . . . . . . . . . . . . . . . 17 2.3.3 Reactive vs. Proactive Conflict Resolution . . . . . . . . . . . . 18 2.3.4 Eager vs. Lazy HTM . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Contemporary HTM Designs . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Lazy HTM Designs . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Eager HTM Designs . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 Hybrid Policy HTM Designs . . . . . . . . . . . . . . . . . . . 29 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 v 3 An Analytical Model for Communication Cost 31 3.1 Categorization of On-Chip Communication . . . . . . . . . . . . . . . 31 3.2 Relation of Different Types of Communication Cost . . . . . . . . . . . 33 3.3 Quantification of Communication Cost . . . . . . . . . . . . . . . . . . 34 3.4 Key Factors to Reduce Communication Cost . . . . . . . . . . . . . . . 36 4 Selective Eager-Lazy HTM 38 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 The Static SEL-TM Design . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 The Dynamic SEL-TM Design . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . 48 4.3.2 Conflict Counting . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Transaction Profiling . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.4 Lazy Address Selector . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Performance Analysis of Static SEL-TM . . . . . . . . . . . . 55 4.4.3 Performance Analysis of Dynamic SEL-TM . . . . . . . . . . . 57 4.4.4 Impact on Network Traffic . . . . . . . . . . . . . . . . . . . . 60 4.4.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 In-Network Traffic Regulation for HTM 65 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1.1 HTM-NOC Interplay . . . . . . . . . . . . . . . . . . . . . . . 66 5.1.2 Communication in Conflict Detection . . . . . . . . . . . . . . 67 5.1.3 False Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 The TMNOC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 NOC-aware HTM . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.2 Coordination between HTM and NOC . . . . . . . . . . . . . . 73 5.2.3 In-Network Conflict Tracking . . . . . . . . . . . . . . . . . . 76 5.2.4 The TMNOC Logic . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.5 Operation Walk-through . . . . . . . . . . . . . . . . . . . . . 80 5.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.2 Reduction in Directory Blocking . . . . . . . . . . . . . . . . . 86 5.3.3 Reduction in Network Energy . . . . . . . . . . . . . . . . . . 86 5.3.4 Reduction of Network Traffic . . . . . . . . . . . . . . . . . . 88 5.3.5 Impact on Performance . . . . . . . . . . . . . . . . . . . . . . 89 5.3.6 Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3.7 Area Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 93 vi 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6 Predictive Unicast and Notification 96 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Predictive Unicast and Notification . . . . . . . . . . . . . . . . . . . . 101 6.3.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.2 Unicast Destination Prediction . . . . . . . . . . . . . . . . . . 103 6.3.3 Handling Misprediction . . . . . . . . . . . . . . . . . . . . . 105 6.3.4 Notification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.5 Protocol Support for PUNO . . . . . . . . . . . . . . . . . . . 108 6.3.6 Operation Example . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.2 Reduction in Transaction Abort . . . . . . . . . . . . . . . . . 112 6.4.3 Reduction in Network Traffic . . . . . . . . . . . . . . . . . . 114 6.4.4 Reduction in Directory Blocking . . . . . . . . . . . . . . . . . 116 6.4.5 Impact on Performance . . . . . . . . . . . . . . . . . . . . . . 117 6.4.6 Transaction Execution Efficiency . . . . . . . . . . . . . . . . 119 6.4.7 Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . . 120 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7 Consolidated Conflict Detection 123 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 127 7.2.1 Detecting Data Races in Cache Coherence . . . . . . . . . . . . 127 7.2.2 Conflict Detection in HTM . . . . . . . . . . . . . . . . . . . . 128 7.2.3 The Communication Overhead of Conflict Detection . . . . . . 129 7.3 Consolidated Conflict Detection . . . . . . . . . . . . . . . . . . . . . 132 7.3.1 Morphable Ownership Tracker . . . . . . . . . . . . . . . . . . 133 7.3.2 Transaction Status Table . . . . . . . . . . . . . . . . . . . . . 135 7.3.3 The C2D Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.4.1 Implementing MOS-Tracker . . . . . . . . . . . . . . . . . . . 139 7.4.2 Analysis of False Positive in C2D . . . . . . . . . . . . . . . . 142 7.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.5.2 Impact on Network Traffic . . . . . . . . . . . . . . . . . . . . 146 7.5.3 Reduction in Network Energy . . . . . . . . . . . . . . . . . . 148 7.5.4 Impact on Performance . . . . . . . . . . . . . . . . . . . . . . 149 7.5.5 Reduction on Conflict Detection . . . . . . . . . . . . . . . . . 151 7.5.6 Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . 151 vii 7.5.7 Hardware Cost Estimation . . . . . . . . . . . . . . . . . . . . 154 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8 Conclusion 155 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.2 Looking-forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.3 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Bibliography 160 viii List Of Tables 4.1 Summary of conflict resolution policy in SEL-TM . . . . . . . . . . . . 46 4.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Benchmark Input Parameters and Characteristics . . . . . . . . . . . . 54 5.1 Benchmark Input Parameters and Characteristics . . . . . . . . . . . . 83 5.2 System configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Result of area overhead estimation . . . . . . . . . . . . . . . . . . . . 94 6.1 Benchmark input parameters . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 System configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3 Area and power overhead estimation . . . . . . . . . . . . . . . . . . . 121 7.1 Data sharing activities and conflict requests at different combinations of coherence state and transactional sharer vector (TSV) . . . . . . . . . . 135 7.2 Benchmark input parameters . . . . . . . . . . . . . . . . . . . . . . . 144 7.3 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4 Notification traffic in Eager-C2D . . . . . . . . . . . . . . . . . . . . . 147 ix List of Figures 2.1 Examples of parallel programming using locks and transactions. . . . . 12 2.2 Examples of eager conflict detection using a coherence protocol. . . . . 18 2.3 Lazy HTM vs. Eager HTM on a multiple reader and single writer scenario. 21 2.4 Lazy HTM vs. Eager HTM on a multiple writer scenario. . . . . . . . . 22 2.5 Architecture of a processor with TCC-support [HWC + 04] . . . . . . . 24 2.6 Operation of hardware signature to support conflict detection. . . . . . . 28 3.1 Categorization of Communication in Transactional System. . . . . . . . 33 4.1 Average store position with respect to the transaction length. . . . . . . 41 4.2 Normalized Number of transaction aborts on three most contended mem- ory blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Architectural overview of SEL-TM (SEL-TM specific modules in circle frames). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 The control flow of processing transaction load/store in SEL-TM. . . . 45 4.5 Architectural overview of the SEL Manager. . . . . . . . . . . . . . . . 48 4.6 Algorithm to update HIT and TSR. . . . . . . . . . . . . . . . . . . . . 49 4.7 Static SEL-TM execution time. . . . . . . . . . . . . . . . . . . . . . . 56 4.8 Transaction abort count on top3 conflict hot spots in memory. . . . . . . 58 4.9 Dynamic SEL-TM execution time. . . . . . . . . . . . . . . . . . . . . 58 4.10 Speedup on 16-core processor over sequential execution. . . . . . . . . 60 x 4.11 Normalized transactional on-chip traffic. . . . . . . . . . . . . . . . . . 61 4.12 Impact of HIT configuration on SEL-TM performance.“LRU” stands for “Least Recently Used”. . . . . . . . . . . . . . . . . . . . . . . . . 62 4.13 Normalized performance vs. Number of address being lazily managed. . 63 5.1 On-chip network traffic categorization of STAMP benchmark. . . . . . 67 5.2 HTM conflict detection. R: Requester; DIR: home node directory; S: Sharer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Breakdown of GETS/GETX coherence requests from transactions to the directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Breakdown of on-chip transactional traffic. . . . . . . . . . . . . . . . . 71 5.5 Format of CT-Register. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.6 Extended coherence protocol messages to support coordination between HTM and NOC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.7 (a) Router microarchitecture (TMNOC-specific structures in bold rect- angles); (b) CT-Buffer entry format; (c) Router pipeline organization (TO: TMNOC Operation). . . . . . . . . . . . . . . . . . . . . . . . . 75 5.8 Flowchart depicting transactional request filtering in TMNOC logic. REQTYPE and TXREQ are the fields in the in-transit request message. DAS is from the matching conflict trace. . . . . . . . . . . . . . . . . . 79 5.9 Operation examples. (a) and (b): TMNOC-base. (c) and (d): TMNOC- aggressive. All the requests, responses and coherence states are with regard to the same cache block. Dir: directory. . . . . . . . . . . . . . . 81 5.10 Simulated chip multiprocessor architecture. TMNOC augmentations are marked with bold rectangles. . . . . . . . . . . . . . . . . . . . . . 85 5.11 Normalized cycle count when the directory is busy serving transactional requests (B: baseline w/o TMNOC; T: TMNOC-base; T+: TMNOC- aggressive). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.12 Normalized network energy. . . . . . . . . . . . . . . . . . . . . . . . 87 5.13 Normalized network traffic. . . . . . . . . . . . . . . . . . . . . . . . . 88 5.14 Hop count distribution (measured in router traversals by flits). . . . . . 89 5.15 Normalized execution time. . . . . . . . . . . . . . . . . . . . . . . . . 91 xi 5.16 Performance vs. Number of CT-Buffer entries. . . . . . . . . . . . . . . 92 5.17 Performance vs. Timeout threshold. . . . . . . . . . . . . . . . . . . . 93 6.1 Comparison between the cache coherence scheme and transaction exe- cution. DIR: home node directory. (a) coherence protocol handling for a GETX request; (b) contention management mechanism handling the GETX request. Explosion marks indicate transaction conflicts. . . . . . 99 6.2 Breakdown of the GETX requests from transactions. . . . . . . . . . . 100 6.3 Distribution of the number of transactions being aborted unnecessarily due to false aborting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Comparison of transaction executions in the conventional scheme and PUNO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.5 (a) Directory augmentation to support unicast destination prediction. Added hardware structures in bold rectangle. r-cnt: rollover counter; v-cnt: validity counter. (b) State transition of the validity counter. . . . . 103 6.6 Structure of the transaction length buffer and computing logic. . . . . . 106 6.7 Protocol message extensions to support PUNO. . . . . . . . . . . . . . 108 6.8 PUNO operation examples. All the coherence messages and states are with regard to the same cacheline. DIR: directory. C: comparator. Key operations are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . 110 6.9 The baseline chip multiprocessor architecture. PUNO augmentation in bold rectangles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.10 Normalized transaction abort count. . . . . . . . . . . . . . . . . . . . 115 6.11 Normalized on-chip network traffic. . . . . . . . . . . . . . . . . . . . 116 6.12 Normalized cycle count when the directory is busy servicing transac- tional GETX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.13 Normalized execution time. . . . . . . . . . . . . . . . . . . . . . . . . 118 6.14 Normalized transaction G/D ratio indicating the efficiency of transaction execution (the larger the better). . . . . . . . . . . . . . . . . . . . . . 120 xii 7.1 Examples comparing (a) data race detection, (b) distributive conflict detection and (c) consolidated conflict detection. Explosive marks indi- cate the operation of race condition detection in (a) or conflict detection in (b) and (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 Breakdown of transactional traffic. . . . . . . . . . . . . . . . . . . . . 129 7.3 Broadcast traffic as a percentage of the total transactional traffic. . . . . 130 7.4 Architectural overview of a CMP tile. Bold rectangles indicate C2D- specific components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.5 Procedure of the C2D logic to detect and resolve conflicts. AF: Abort- Flag. TSV: Transactional Sharer Vector. . . . . . . . . . . . . . . . . . 138 7.6 A hardware implementation of the MOS-Tracker. . . . . . . . . . . . . 140 7.7 Execution phases of the MOS-Tracker. . . . . . . . . . . . . . . . . . . 141 7.8 Probability of a false positive as a function of the number of addresses being inserted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.9 Normalized on-chip network traffic. . . . . . . . . . . . . . . . . . . . 147 7.10 Normalized Coherence Message Count. . . . . . . . . . . . . . . . . . 148 7.11 Normalized network energy. . . . . . . . . . . . . . . . . . . . . . . . 149 7.12 Normalized execution time. . . . . . . . . . . . . . . . . . . . . . . . . 150 7.13 Conflict detections performed at each tile. . . . . . . . . . . . . . . . . 152 7.14 Sensitivity to MOS-Tracker entry size. . . . . . . . . . . . . . . . . . . 153 xiii Abstract The architectural challenges for reaching extreme-scale computing necessitate major progress in designing high performance and energy-efficient hardware building blocks, such as microprocessors. The chip multiprocessor (CMP) architecture has emerged as a preferred solution to exploit the increasing transistor density for sustainable perfor- mance improvement. As the core count keeps scaling up, developing parallel appli- cations to reap commensurate performance improvement becomes imperative and of paramount importance. The Hardware Transactional Memory (HTM) approach promises increased productivity in the practice of parallel programming. Recent research in academia and industry suggests that the design space and tradeoffs of HTM are still far from being well understood. To pave the way for more HTM-enabled processors, two crucial issues in HTM designs must be addressed. The first issue is achieving high performance under frequent transaction conflicts. The second issue is designing energy- efficient HTM techniques. Invariably, both issues demand efficient communication in transaction execution. This thesis dissertation contributes a set of hardware techniques to achieve efficient and scalable communication in such systems. First, we contribute the Selective Eager-Lazy HTM system (SEL-TM) to leverage the concurrency and communication benefit of lazy version management while sup- pressing its corresponding complexity and overhead with eager management. The mixed xiv mode execution generates 22% less network traffic in high contention workloads rep- resentative of upcoming TM applications. The performance is improved by at least 14% over either a pure eager or a pure lazy HTMs. Second, we contribute Transac- tional Memory Network-on-Chip (TMNOC), an in-network filtering mechanism that proactively filters away pathological transactional requests that waste network-on-chip bandwidth utilization. TMNOC is the first published HTM-network co-design. Exper- imental results show that TMNOC reduces network traffic by 20% averaged across the high contention workloads, thereby reducing network energy consumption by 24%. The third proposal mitigates the disruptive coherence forwarding in transactional execution when the cache coherence protocol is reused for conflict detection. We address the prob- lem with a Predictive Unicast and Notification (PUNO) mechanism. PUNO is effective in reducing transaction aborting by 43% on average and avoiding 17% of the on-chip communication. Fourth, we propose Consolidated Conflict Detection (C2D), a holis- tic solution that addresses the communication overhead in conflict detection with cost- effective hardware designs. Evaluations show that the C2D technique, when being used to implement eager conflict detection, can reduce 39% of the on-chip communication. The corresponding energy savings due to C2D is 27%. xv Chapter 1 Introduction The science and economic opportunities in advancing into exascale computing are so compelling that numerous tremendous challenges must be addressed. According to a report compiled by the U.S. Department of Energy ASCAC subcommittee in 2010 [Sub10], achieving exascale systems requires a peak performance boost of 500 with the stringent constraint that the energy cost cannot go up by more than a factor of 3. So, it is mission-critical to build high-performance and energy-efficient function blocks (e.g. CPU, GPU, memory subsystem, interconnects) for next-generation computational sys- tems. As the performance and power benefit due to semiconductor technology scaling is diminishing, computer architects are presented with colossal challenges as well as abun- dant opportunities for radical innovation. In fact, the past decade has already witnessed a fundamental paradigm shift from uniprocessor to chip multiprocessor architectures. This emerging parallel architecture exposes extreme thread-level parallelism, which is a new performance opportunity to be exploited. Leveraging thread-level parallelism entails the task of synchronizing concurrent accesses from threads to the shared memory, which is a well-known grand challenge in developing parallel applications. Traditionally, locks are used to enforce mutual exclusive execution of critical sections. However, programming with locks is chal- lenging: coarse-grain locks limit performance while fine-grain locks increase com- plexity. Transactional Memory (TM)[Kni86, HEM93] is a programming paradigm that 1 promises to simplify parallel programming by providing atomic and isolated execu- tion of code blocks (i.e. transactions). Programmers are freed from the heavy bur- den of managing the correctness and forward progress of synchronization because the underlying TM system guarantees that the transactions are executed correctly and obstruction-free. Furthermore, TM can offer higher performance due to its opti- mistic concurrency control that allows non-conflicting threads to enter an atomic sec- tion simultaneously. TM systems can be implemented in hardware [TC08, HOF + 12], in the software stack [SATH + 06, HLMS03, HLM06] or with a hybrid approach [DFL + 06, KCH + 06, MTC + 07]. The Hardware TM (HTM) has a performance advan- tage over other approaches, and has been deployed in commercial microprocessors. My research focuses on HTM as its tight integration with evolving parallel architecture con- tinues to produce ample research opportunities. 1.1 Problems and Motivation The initial proposal of transactional memory in 1993 used custom hardware and an extended coherence protocol [HEM93]. However, the hardware approach was not immediately accepted. Software approaches (i.e., STM) have gained considerable pop- ularity. Since 2003, enabled by technology advances, HTM has gradually regained momentum in the research community. Extensive research on HTM has paved the way for commercial implementations. In 2012 and 2013, IBM and Intel debuted their respec- tive processors with HTM features. In retrospect, the first landscape shift from STM to HTM is driven by reduced transistor cost. The transition took a decade. It took another decade for HTM to be partially supported in products from major processor vendors. For the next decade, a question to be answered by the TM research community is “what will be the distinctive attributes of the next generation HTM system”. 2 We believe efficient on-chip communication will be a first priority concern towards next generation HTM because data movement will dominate the energy dissipation of future computational systems, as suggested by technology scaling trends [KDK + 11]. Efficient on-chip communication is at the juncture of high performance and energy effi- ciency. On the one hand, transactional systems depend on the communication fabric to fetch data and detect conflicts. As threads in transaction execution are more sus- ceptible to serialization, the memory-level parallelism is limited, making overall perfor- mance more sensitive to the network latency. Thus, the criticality of TM traffic demands low-latency and high-bandwidth communication. On the other hand, HTM imposes an energy footprint on the network since the delivery of inter-transaction communication incurs energy dissipation in routers as well as on links. As conflict detection requires fre- quent inter-transaction communication, HTM designs could have a huge impact on the network energy consumption if not designed properly. Energy-efficient on-chip com- munication could not be achieved without an in-depth understanding and optimization of the interaction between HTM and the network. Besides performance and energy concerns, efficient communication for transac- tional systems also has a non-negligible implication to the network’s quality of service. Transactional applications inherently demand extensive core-to-core communication to exchange data and resolve conflicts between transactions. The bandwidth requirement grows rapidly with the thread count and transaction granularity. Such aggressive band- width utilization could adversely affect the network’s quality of service in use cases such as cloud computing and server consolidation. Certain mechanisms should be available to contain the transactional communication when needed, so that the shared network resource is not monopolized. 3 Thus, we argue that HTM systems must achieve efficient on-chip communication to continuously improve the performance-per-joule metric of future microprocessors. 1.2 Thesis Contribution This thesis dissertation presents an analytical model for the communication cost in trans- actional systems and proposes a set of hardware techniques to reduce the cost. With a well-defined central theme, each technique attempts to address the problem from a unique aspect such as the execution policy, HTM-Network co-design, coherence pro- tocol and conflict detection mechanism. The contributions made in individual chapters are discussed below. 1.2.1 An Analytical Model for Communication Cost In Chapter 3, we contribute a formal model to analyze the communication cost in a typical tiled CMP architecture. This mode categorizes the cost into four different types which are associated with different high-level behaviors in the system. For instance, the coherence cost is due to enforcing coherence in the cache hierarchy, and the conflict cost is incurred by request failures due to transaction conflicts. As this model identifies the key factors in contributing to the cost, it provides insights into viable approaches to achieve efficient communication in transactional systems. 1.2.2 Selective Eager-Lazy HTM In Chapter 4, we describe the Selective Eager-Lazy HTM (SEL-TM) to leverage the favorable characteristics of both eager and lazy HTM systems while mitigating their 4 respective performance pathologies. By dynamically dividing the write set of a transac- tion into eagerly- and lazily-managed memory addresses, SEL-TM enables each trans- action to manage highly-contended memory blocks lazily so as to improve concurrency and reduce conflict detection communication. As the rest of the memory blocks are managed eagerly, SEL-TM minimizes the impact of slow commit in typical lazy sys- tems and avoids the design complexity in handling transaction overflow. The mixed mode execution generates 22% less network traffic in high contention workloads repre- sentative of upcoming TM applications. The performance is improved by at least 14% over either a pure eager or a pure lazy HTMs. With SEL-TM, we make four contributions. First, a novel HTM design is presented that supports simultaneous lazy and eager version management within a transaction. We demonstrate that this hybrid approach can increase concurrency and suppress inter- transaction communication with a small implementation overhead. Second, an efficient hardware scheme is described to profile dynamic transactions and discover critical con- flict points at runtime. Third, an adaptive strategy is discussed and evaluated that selects the memory addresses in each dynamic transaction for lazy version management based on runtime profiling information. Fourth, a light-weight and generic RTL-level imple- mentation of hardware support for hybrid version management is proposed and synthe- sized. A trade-off analysis is conducted based on the implementation. 1.2.3 In-Network Traffic Filtering for HTM In Chapter 5, we introduce Transactional Memory Network-on-Chip (TMNOC), a HTM and Networks-On-Chip (NOCs) co-design to proactively filter out transactional requests that incur superfluous network traffic that does not contribute to the continued transac- tion execution. First, a cost-effective communication mechanism is provided for the 5 HTM and on-chip network to exchange critical information about conflicts between transactions. Second, the on-chip routers are augmented to track the conflicts by com- municating with the HTM. Enabled by these two mechanisms, the network filters out transactional requests that have a high probability to fail due to conflicts. Consequently, unnecessary on-chip communication initiated by those requests can be suppressed. Experimental results show that TMNOC reduces network traffic by 20% averaged across the high contention workloads, thereby reducing network energy consumption by 24%. The contributions in this work are three fold. First, we identify a largely unexplored design opportunity in cross-layer optimization of HTMs and NOCs. To the best of our knowledge, our work is the first to investigate the interaction between HTMs and NOCs. Second, we describe TMNOC, a novel approach that exploits co-designing of the HTM and NOC to regulate network traffic and streamline inter-transaction communication. Third, we evaluate TMNOC through extensive full system simulations to demonstrate the ability of a HTM and NOC co-design in improving overall energy efficiency and performance. 1.2.4 Predictive Unicast and Notification Chapter 6 contributes Predictive Unicast and Notification (PUNO), a lightweight hard- ware mechanism to mitigate disruptive coherence forwarding that causes unnecessary transaction aborting. The disruptive forwarding is due to the tight coupling of the coher- ence protocol and conflict detection mechanism: transactional write requests are always exhaustively multicasted to all the nodes that read-share the requested data. The exhaus- tive multicast not only degrades network performance by consuming bandwidth but also causes unnecessary transaction aborting. PUNO mitigates this problem by replacing the 6 multicast with a unicast to the high priority sharer. Moreover, sharer transactions proac- tively notify any nacked requesters with the time that the requested cacheline would be available. Thus, the requesters can refrain from polling the sharers too frequently, fur- ther preventing unnecessary communication. PUNO is effective in reducing transaction aborting by 43% on average and avoiding 17% of the on-chip communication. With PUNO, our contributions are three-fold. First, we identify an intrinsic mis- match between the coherence protocol and conflict detection that leads to sizable band- width waste and pathological transaction aborting behavior. Second, we propose PUNO, a novel hardware mechanism that can opportunistically replace the multicast of coher- ence requests from transactions with predictive unicast and notification. Third, we eval- uate PUNO with full system simulations to demonstrate its efficacy in improving energy efficiency in the network and overall performance. 1.2.5 Consolidated Conflict Detection Chapter 7 contributes the Consolidated Conflict Detection (C2D), a holistic micro- architecture proposal to minimize the bandwidth requirement of the conflict detection mechanism. Typical HTM designs (both research proposals and commodity implemen- tations) require the home node to interrogate remote cores across the chip for conflict detection and resolution. The distributive nature of conflict detection is essentially the root cause of a variety of communication overheads that account for a considerably large fraction of network traffic. In C2D, we combat those communication overheads by imposing a logically consolidated conflict detection scheme. More specifically, trans- actions send their requests to a logically central agent for conflict detection, thereby removing the need to contact remote cores. Consequently, the root cause of numerous communication overheads is eliminated. The consolidation of conflict detection does 7 not create a scalability bottleneck as the logically central agent is physically distributed to the home nodes in a practical design. Evaluations show that the C2D technique, when being used to implement eager conflict detection, can reduce 39% of the on-chip communication. The corresponding energy savings due to C2D is 27%. With C2D, we made three contributions. First, we investigate the bandwidth uti- lization of conventional conflict detection mechanisms. Numerous inefficiencies are identified, which could become a fundamental limiting factor when deploying HTM on large CMPs. Second, we propose the Consolidated Conflict Detection to minimize com- munication in conflict detection. To the best of our knowledge, this work is the first to address the traffic overhead of HTM conflict detection. Third, we evaluate the proposed technique using extensive full system simulations to demonstrate its effectiveness. 1.3 Dissertation Organization The rest of this dissertation is organized as follows. Chapter 2 presents a foundational background on the design dimensions of HTM systems that are closely related to sub- sequent research proposals. Chapter 3 describes an analytical approach to model the communication cost in typical HTM systems. The model provides high-level insights into the viable avenues towards efficient communication in transaction systems. Chap- ter 4 develops the Selective Eager-Lazy HTM design. Chapter 5 develops the TMNOC design for proactively filtering transactional communication in the network. Chapter 6 develops the Predictive Unicast and Notification mechanism. Chapter 7 elaborates the Consolidated Conflict Detection to remove the communication overhead in conven- tional conflict detection mechanisms. Finally, Chapter 8 concludes this dissertation with summary and reflection. 8 Chapter 2 Transactional Memory Background Parallel programming using traditional synchronization mechanisms has long been a challenge due to the lack of abstraction and composition, which are two great tools to manage the complexity of software engineering. In particular, the locking mechanism has been used for decades to guarantee mutual exclusive accesses to shared memory from more than one thread. However, the locking mechanism is problematic for the following reasons: Locking exposes the detail of implementing a critical section (atomic section) to the programmer, which requires meticulous reasoning about the parallel programs behavior to avoid pitfalls such as deadlock and priority inversion, among others. Locking requires extreme care to use locks in composition. For instance, devel- opers often need to understand the detail of a library implementation with locks. The concept of transaction was introduced into the computer architecture community as a viable approach to tackle the programmability of the emerging chip multiprocessor architecture. In this section, the productivity problem with parallel programming is discussed. Then, the basic concept of transactional memory is introduced. An extensive survey of hardware transactional memory designs is provided as the background of this research. 9 2.1 Parallel Programming: the Grand Challenge The advent of multiprocessor architectures is regarded as an inflection point in main- stream software development because it forced developers to write parallel programs to fully exploit the underlying hardware. Writing parallel code is known to be notoriously challenging as developers are required to divide a monolithic job into concurrent tasks that must collaborate correctly as well as efficiently. Exchanging data among parallel tasks/threads is crucial and, unfortunately, quite burdensome from the perspective of programmers. There are two primary models of data exchange: message passing and shared memory. The message passing model is considered the de facto standard for parallel computing in the high-performance com- puting community. In this model, the memory space is partitioned. Programmers control the data exchange explicitly by sending/receiving messages between concur- rent tasks. In contrast, the shared memory model provides a logically unified global memory space which is easier for a developer to reason about. Communication among tasks/threads occurs implicitly through memory load/store operations. It is expected that the shared memory multiprocessor architecture will be the mainstream in the next few years [MHS12]. Nonetheless, programming complexity is still a significant problem in writing shared-memory multithreaded programs. In particular, synchronizing accesses to shared data from concurrent threads is extremely burdensome and could easily incur subtle errors. Mutual exclusion is sufficient for most forms of synchronization [MCS91]. Locks are typically used to construct mutually exclusive critical sections to ensure that the shared data can be accessed by concurrent threads in a serialized fashion. However, lock-based synchronization has well known problems. First, programmers face complex tradeoffs between programmability and performance. On the one hand, conservative 10 coarse-grain locks can be used to ensure correctness with ease, but lose performance by overly serializing the execution. On the other hand, fine-grain locks allow for more concurrency, but are error-prone and require a non-trivial programming effort. Second, lock-based synchronization often incurs forward progress problems [RG02]. Program- mers should reason about the locks carefully to avoid deadlock. Furthermore, locks can incur priority inversion and convoying, which also hinder a program’s forward progress. As a result, these limits of locks significantly reduce the productivity of programmers in writing parallel code. New synchronization mechanisms are in great need to improve the programmability of chip multiprocessors while achieving a balance between pro- ductivity and efficiency (in terms of performance, energy, complexity, etc.). 2.2 Transactional Memory Basic Concept A transaction is a sequence of memory accesses that either executes completely or has no effect at all [ATKS06]. With TM, programmers focus only on the question of where to enforce atomicity. Once the code blocks that require atomic execution have been identified, they are encapsulated into transactions by programmers. That is all the pro- grammers need to do to write parallel code using the TM programming model. The underlying hardware or software TM system guarantees the data accesses from multi- ple transactions are synchronized correctly and efficiently. The transaction provides the programmers with a proper abstraction by hiding the underlying details about how to synchronize accesses to the shared memory. Although a transaction is similar to a crit- ical section protected with locks, programmers are freed from the burden of reasoning about the correctness. Moreover, programs using transactions are composable due to the non-blocking characteristic. The simple example in 2.1 compares parallel programming using locks and transactions. In Figure 2.1 (a), a thread attempts to obtain two locks on 11 1 lock (timestamp); 2 lock (counter); 3 *t = timestamp; 4 *r = counter++; 5 unlock (timestamp); 6 unlock (counter); 1 TX_BEGIN { 2 *t = timestamp; 3 *r = counter++; 4 } (a) (b) Figure 2.1: Examples of parallel programming using locks and transactions. the two shared variables before accessing the variable. Unfortunately, this simple piece of code could cause deadlock if two threads obtain the locks in reversed order. By using TM, developers simply put the code that accesses shared variables into a transaction and rely on the TM system to ensure the execution is deadlock-free and correct. As noted in Herlihy and Moss’ first introduction of TM as a new multiprocessor architecture [HEM93], it is required that transaction execution satisfies two properties: serializability and atomicity. Serializability requires that transactions appear to be exe- cuted in a serialized order. The memory accesses from one transaction cannot be inter- leaved with the memory accesses from another transaction. Atomicity requires that a transaction either executes to its completion or has no effect on the machine states what- soever. In addition, the memory accesses of a transaction are not globally visible until the transaction commits. To support serializability and atomicity, TM systems require two key mechanisms: version management and conflict detection. Version management maintains two data versions, namely, the pre-transaction ver- sion and the speculative version. During the execution of a transaction, any data that has been modified within the transaction will have two versions co-existing in the sys- tem. The pre-transaction data version is discarded when the transaction commits. How- ever, the speculative data version is discarded when the transaction aborts. The two 12 usual approaches to implement version management are undo logs and buffered updates. In the first approach, the transaction directly propagates the data modification into the memory while moving the corresponding pre-transaction version into the log for poten- tial rollback. In the second approach, the speculative data version is buffered in a thread- private buffer and is not propagated into the memory until the transaction commits. The first approach is also referred to as eager version management while the second approach is referred to as lazy version management. A conflict occurs when multiple transactions access the same data and at least one transaction attempts to modify it. Conflicts must be detected and resolved to ensure the semantic integrity of transactional execution. In order to detect a conflict, the TM system must keep track of the data being accessed within each transaction. Every indi- vidual transaction is associated with a read set and a write set. A read set is the set of memory addresses from which the transaction has loaded data while a write set is the set of memory addresses to which the transaction has stored data. A transactional load instruction causes the TM system to add the load address into the read set. Similarly, a transactional store instruction causes the TM system to add the store address into the write set. In some TM implementations, the read- and write-set could be populated with data identifiers other than memory addresses. When a conflict is detected, it must be resolved to maintain the “multi-reader-single-writer” invariant. The detection and resolution can be performed either eagerly or lazily. Eager conflict detection processes conflicts progressively as a transaction executes, and lazy conflict detection defers detec- tion to the end of a transaction. Depending on the TM implementation, conflicts can be detected at various levels of granularity. For example, hardware TM implementations usually detect conflicts at a cacheline granularity while software TM implementations often detect conflicts at the level of a data object. 13 Conflict resolution used to be part of the conflict detection because the detection of a conflict is followed closely by the action to resolve the conflict. However, conflict reso- lution has gradually evolved into a third design dimension besides version management and conflict detection due to its criticality to forward progress and performance. Conflict resolution is governed by a conflict resolution policy, which selects the victim transac- tions in the conflict and decides the proper action on the victim transactions. The basic requirement for conflict resolution is to guarantee the concurrent execution of multiple transactions is deadlock-free and livelock-free. Moreover, a good conflict resolution policy should also achieve a balance among performance, fairness and efficiency. TM systems can be implemented in hardware [RG02, AAK + 05, MBM + 06, YBM + 07, HWC + 04, CCC + 07, RRW08, LMG09, LMG10], in software [SATH + 06, HLMS03, HLM06], or in a hybrid approach [MTC + 07, DFL + 06, KCH + 06, DCW + 11]. Although TM is first introduced as a hardware mechanism, software TM (STM) has seen greater popularity at first due to the lack of hardware support for TM in com- modity systems. STM implements all the transactional semantics in software. So, the main advantage of STM is flexibility as it does not require any hardware support. How- ever, the major disadvantage of STM is performance due to the inevitable overhead of software runtime to provide the basic TM functionality. Also, STM has made com- promises to the transactional semantics [HPST06, CBM + 08]. Due to the above lim- itations of STM, Hardware TM (HTM) has gained increasing momentum. Research on HTM has largely focused on performance [MBM + 06, YBM + 07, CCC + 07, LMG09, LMG10, BRM10, NTGA + 12], implementation issues [JTV10, SYHS07, BDLM07], transaction scheduling [YL08, BDM09, BDM11], and hardware-software interplay [SDS08, RHL05, RHP + 07]. These efforts have paved the way for HTM to be present in 14 commodity systems [TC08, HOF + 12, JSG12, YHLR13]. Nonetheless, HTM has sev- eral limitations. First, it often incurs considerable implementation and execution over- head. Second, HTM are inherently inefficient to support unbounded transaction execu- tion. Several research proposals look at a hybrid approach to implement TM. A basic design principle of hybrid transactional memory systems is to execute the transactions using the hardware mechanism and fall back to software if the hardware mechanism fails. 2.3 Hardware Transactional Memory and Its Taxon- omy Hardware transactional memory implements version management, conflict detection and conflict resolution in hardware. An augmented Cache is the most popular approach to accommodate the pre-transaction and speculative data versions [HEM93, MBM + 06, HOF + 12]. Conflict detection usually piggybacks onto the cache coherence protocol to minimize implementation overheads [HEM93, MBM + 06, CCC + 07]. Conflict resolu- tion can be implemented in hardware or in a software contention manager [SIS05]. The design space of HTM can roughly be divided into three dimensions: version management, conflict detection and conflict resolution. These three dimensions are used to classify the various HTM implementations. 2.3.1 Eager vs. Lazy Version Management In an eager version management HTM, the memory is updated immediately when a transaction issues a store request. In the meantime, the pre-transaction data version is moved into a software-managed log [MBM + 06] or dedicated hardware buffer [LMG08] 15 or another cacheline [LMG09]. To ensure atomicity, the update to memory cannot be visible to all the other concurrent transactions. So eager version management HTMs usually resort to pessimistic concurrency control to prevent the speculative memory block from being accessed by other transactions. Transactions can commit by sim- ply discarding the pre-transaction versions since the new data versions already exist in the memory. Therefore, transactions can commit quickly thanks to eager version man- agement. However, when a transaction aborts, the pre-transaction versions have to be restored in the memory. This operation often requires software intervention, thereby imposing a latency penalty. So, transactions abort slowly in HTMs adopting eager ver- sion management. In a lazy version management HTM, the transactional updates to memory are buffered in the cache [LMG10] or in a dedicated hardware buffer [HWC + 04]. The buffered memory updates are not propagated to the memory until the transaction com- mits. During the execution of the transaction, the speculative updates are not visible to other concurrent transactions. Therefore, unlike eager version management HTMs, HTMs using lazy version management do not need to enforce pessimistic concurrency control on transactional updates. This distinction between eager and lazy version man- agement approaches is crucial to performance, implementation and the design choice of conflict detection. When a transaction commits, the buffered speculative versions must become globally visible to all the concurrent transactions. Traditionally, lazy version management suffers a long commit latency, which is one of the more significant per- formance pathologies of the HTM system. Committing multiple transactions in parallel using lazy version management is implemented in [CCC + 07, TPK + 09, NTGA + 12], which incur non-negligible hardware overhead and implementation complexity. Since 16 the pre-transaction versions remain in the memory, no rollback is incurred when a trans- action aborts. 2.3.2 Eager vs. Lazy Conflict Detection As discussed, a conflict occurs when multiple transactions access the same data and at least one transaction attempts to modify the data. Eager conflict detection detects conflicts progressively on each individual memory access when a transaction issues an access. It is pessimistic because it assumes conflicts are frequent. A majority of the eager conflict detection mechanisms are implemented on top of the cache coherence protocol to detect conflicts at the granularity of a cacheline. Any protocol capable of detecting accessibility conflicts can also detect transaction conflicts at no extra cost [HEM93]. Since every memory access from transactions must be communicated to all the potentially conflicting transactions, the communication overhead of eager con- flict detection could be substantial. Figure 2.2 depicts eager conflict detection using the MESI (Modified, Exclusive, Shared, Invalidate) directory protocol. The requester trans- action issues a GETX (request for exclusive access) to the directory¬, which replies to the requester with data. The directory state of the block is set to busy (i.e. incom- ing requests to the same block are blocked). Then, the request is forwarded to all the nodes currently sharing the block®. Depending on the outcome of conflict detection and resolution, the sharing transactions could respond with either a NACK (negative acknowledgement) or an ACK¯. Once receiving all the responses, the requester sends an UNBLOCK message to the directory to conclude the request°. If all the responses are ACKs, the requester transaction continues executing. If one of the responses is a NACK, the requester transaction usually stalls and keeps retrying the request until all 17 Figure 2.2: Examples of eager conflict detection using a coherence protocol. the high priority sharer transactions have finished executing. In what follows, the trans- action that sends a NACK message is often called nacker transaction or nacker. Lazy conflict detection defers conflict detection to the commit phase of transactions. It is assumed optimistically that conflicts are rare. The committing transaction must broadcast its write set to all the concurrent transactions so that conflicting transactions can be aborted. Broadcasting the entire write set once avoids the frequent, cacheline-size and latency-sensitive coherence communication incurred by eager conflict detection, thereby potentially improving the tolerance to network latency potentially. However, broadcasting can easily become a bottleneck to scalability. 2.3.3 Reactive vs. Proactive Conflict Resolution Traditionally, TM systems react to conflicts when they have already occurred and been detected. In particular, a conflict between two transactions is resolved reactively by stalling or aborting one of the transactions in favor of another one. Rajwar and Goodman [RG02] proposed a timestamp mechanism to decide which transaction wins the conflict. 18 This timestamp mechanism and its variations have been adopted in multiple contem- porary HTM implementations [MBM + 06, YBM + 07, LMG09, LMG10, WGW + 12]. Scherer and Scott [SIS05] propose a cluster of policies to guide the reactive conflict resolution in STM. Among those policies, the Karma policy decides transaction pri- orities based on the amount of work being done so far by each transaction. Another flavor of reactive conflict resolution attempts to preserve the work of both transac- tions in a conflict. DATM [RRW08] and LEVC [PB09] allow data being forwarded between uncommitted transactions to safely commit conflicting transactions. RETCON [BRM10] attempts to repair output data based on remotely modified input data instead of stalling or aborting the reader transactions. However, due to the complexity of track- ing the symbolic link between the input and output data, the mechanism is used mainly for auxiliary data accessed by short and simple transactions. Proactive conflict resolution essentially focuses on the scheduling of transactions to reduce conflicts and improve performance. Yoo and Lee [YL08] proposed Adaptive Transaction Scheduling (ATS) that throttles concurrent transaction execution and seri- alizes those transactions that are highly susceptible to conflicts. Blake et al. [BDM09] proposed Proactive Transaction Scheduling (PTS) that proactively schedules transac- tions, by identifying thread-pairs dynamically that are likely to conflict, and enforces a more appropriate schedule to avoid conflicts. After PTS, Blake et al. [BDM11] propose the Bloom-Filter Guided Transaction Scheduling (BFGTS) that employs a Bloom filter to capture the memory access locality of an individual transaction, using the information to schedule conflict-free transaction execution. Almost all HTM implementation must have reactive conflict resolution for the cor- rectness of transaction execution. Proactive conflict resolution is optional but often desired for high performance HTM. 19 2.3.4 Eager vs. Lazy HTM Depending on the design choices for version management and conflict detection, con- temporary HTM implementations can be categorized into two main types: eager HTM and lazy HTM. Eager HTM adopts an eager approach for both version management and conflict detection. In contrast, lazy HTM adopts a lazy approach for both. Neither eager HTM nor lazy HTM performs better universally for all applications. They differ from each other in the following aspects: Concurrency Wasted computation On-chip bandwidth utilization Requirement on conflict resolution Concurrency: Eager HTM must apply pessimistic concurrency control on trans- actions’ write sets in the memory so that other concurrent transactions cannot access the speculative data version. As a result, multiple reader transactions could be seri- alized behind a writer transaction. However, since lazy HTM updates memory at the commit phase of transactions, the writer transaction cannot serialize the reader transac- tions. If the reader transactions reach the commit phase before the writer transaction, all the transactions can commit, thereby increasing the overall throughput. Figure 2.3 illustrates one example of the multiple reader and single writer scenario. In this exam- ple, lazy HTM has three transactions committed at the time when the writer transaction commits. However, eager HTM allows only the writer transaction to commit during the same period of time. As will be discussed later, the producer-consumer sharing pattern in some benchmarks greatly benefit from the increased concurrency in lazy HTM. 20 Figure 2.3: Lazy HTM vs. Eager HTM on a multiple reader and single writer scenario. Wasted computation: Although lazy HTM has higher concurrency, it incurs more discarded transaction computation as a side-effect because potential conflicting transac- tions can execute side-by-side for a longer period of time before the conflict is finally discovered and resolved when one of the transactions reaches the commit phase. Eager HTM can reduce discarded transaction progress through early detection of conflicts [SD09]. In particular, eager HTM is especially effective in handling multiple writer transactions. Figure 2.4 illustrates one such example. In the example, the computa- tion done by the three subsequent writers are discarded in the lazy HTM. In contrast, eager HTM does not incur wasted work as the potentially conflicting writers are stalled promptly. On-chip bandwidth utilization: HTM relies on low-latency on-chip communica- tion for conflict detection. The communication usually takes the form of coherence messages. In eager HTM, a transaction needs to notify peer transactions of its memory accesses whenever it issues a memory request. Thus, its performance is more sensitive to the network latency. Moreover, the amount of messages could be large. On the other 21 Figure 2.4: Lazy HTM vs. Eager HTM on a multiple writer scenario. hand, lazy HTM broadcasts a transaction’s entire write set at the commit phase. Pre- sumably, combining all the memory accesses could improve the tolerance to network latency since the number of messages on the network could be reduced. The lazy nature of conflict detection could consume relatively lower network bandwidth compared with the eager approach. Conflict resolution: eager HTMs typically require sophisticated conflict resolution schemes to achieve forward progress and high performance. However, the conflict reso- lution in lazy HTM is less critical. The committer-win policy, in which the committing transaction always wins the conflict, is sufficient to achieve forward progress and good performance. 2.4 Contemporary HTM Designs In this section, we describe some of the contemporary HTM designs that are represen- tative of the state-of-the-art. 22 2.4.1 Lazy HTM Designs Transactional Coherence and Consistency (TCC) [HWC + 04] is proposed as a new pro- gramming model for shared memory multiprocessors. The novelty of this model stems from the fact that atomic transactions are always the basic unit of parallel work, com- munication, memory coherence, and memory reference consistency. It is assumed that parallel programming can be greatly simplified by reasoning about concurrency control and coherence at the granularity of a transaction instead of individual memory instruc- tions. TCC is a typical lazy HTM design using lazy version management and lazy conflict detection. It buffers speculative modification from transactions in the L1 cache. A vic- tim buffer is used to keep track of the speculative accesses in the case of cache overflow. The speculative modification is not visible to the rest of the system until the transaction commits. To track the memory references made by the transactions, each cacheline is augmented to contain a read bit and a modified bit. The read bit is set when the cor- responding cacheline is read by a transaction. Similarly, the write bit is set when the corresponding cacheline is modified by a transaction. So, when a write committed by another processor is received and the write address matches with the address of a local cacheline with its read bit set, a conflict between the local transaction and remote com- mitting transaction is detected. TCC forces the local transaction to abort in favor of the committing transaction. Once a transaction is ready to commit, it must arbitrate system- wide for permission to propagate its speculative modification into the memory hierarchy. After winning the arbitration, the transaction combines all its speculative writes into a single packet and broadcasts the packet to every node in the system. Broadcast ensures that any transaction that conflicts with the committing transaction is aborted. 23 Figure 2.5: Architecture of a processor with TCC-support [HWC + 04] TCC offers three main advantages. First, it removes the need for a conventional cache coherence protocol which maintains coherence for every load/store request. Instead, coherence is maintained at the granularity of a transaction. Therefore, the fre- quent arbitration and synchronization in conventional protocols can be avoided, leading to a simplified hardware design. Second, TCC proposes to buffer the speculative modi- fications in the private L1 cache. As a result, no dedicated hardware is needed to support lazy version management. Third, TCC combines the writes from a transaction and can forward the info to the rest of the system using a single packet. Therefore, the network traffic generated by the transactions can be significantly reduced. Nonetheless, the scalability of TCC is limited due to the system-wide commit arbi- tration and commit-time broadcast. These operations are expensive and will incur a 24 considerable latency penalty in future multiprocessors that are expected to accommo- date hundreds or even thousands of processing cores. Furthermore, TCC serializes the commit phase of different transactions (no support for parallel commit), which drasti- cally reduces the transaction throughput. To overcome the scalability problem of TCC, the Scalable TCC [CCC + 07] is proposed. Scalable TCC leverages the directories to pro- vide several optimizations to TCC. First, multiple transactions are allowed to commit in parallel as long as they do not have a conflict on the directory. Second, a write-back protocol can be used as the directory can redirect the requests for the new data produced by a committed transaction to the L1 cache. Third, broadcast is removed because the directory multicasts the invalidation to the sharers. More recently, EazyHTM [TPK + 09] is proposed to further reduce the commit-time problems in TCC and Scalable TCC. The novelty of EazyHTM is it decouples the con- flict detection from conflict resolution. In particular, EazyHTM performs eager conflict detection using the directory protocol and a dedicated core-to-core on-chip interconnec- tion. During the execution of a transaction, it keeps track of conflicting transactions in a list. However, conflicts are not resolved until the conflicting transactions are ready to commit. The committing transaction first sends invalidation messages to the poten- tially conflicting transaction on its list through the core-to-core interconnection. Once receiving all the acknowledgements, the committing transaction makes its speculative writes globally available. If a transaction arrives at the commit point finding the list of conflicting transactions is empty, it can simply skip the notification phase and publish its speculative writes. This approach has three main advantages. First, it increases the concurrency of transaction execution by allowing more potentially conflicting transac- tions to execute simultaneously. Second, it allows multiple non-conflicting transactions 25 to commit in parallel. Third, the pathological cascading waiting in eager conflict detec- tion is avoided. Nonetheless, EazyHTM’s scalability is still limited by the notification phase at commit time. Despite a fast core-to-core interconnection, sending notification to hundreds of cores in a future kilo-core chip multiprocessor will incur a substantial latency. Besides, every first read and first write to any cacheline accessed by a transac- tion is effectively a miss in the private L1 cache for the purpose of correctness. However, workloads with small transactions and infrequent conflicts could be burdened unneces- sarily. Pi-TM [NTGA + 12] is proposed to combat this problem. It augmented the L1 cache with a single-bit per line to track conflicts at the granularity of a cacheline. When a transaction aborts, the conflicting cacheline in its read set is invalidated in addition to the cachelines in its write set. Consequently, only those potentially conflicting cache- lines will be re-fetched from either L2 cache or remote transactions, thereby improving performance. 2.4.2 Eager HTM Designs LogTM [MBM + 06] is a representative eager HTM that has received great popularity. A running transaction propagates its speculative writes to the memory hierarchy and uses a software managed log in the virtual memory to hold pre-transaction states of the cachelines that have been modified in the transaction. The use of a log to buffer pre- transaction states is the most distinguishing feature of LogTM. As the old data is not held in any level of the memory hierarchy, LogTM can simply handle the overflow of speculative writes by evicting the overflowing modification out of the L1 cache. Con- sequently, LogTM does not require complex hardware mechanisms to handle overflow. LogTM piggybacks onto the directory-based cache coherence protocol for eager con- flict detection. Any memory access from a requesting transaction is forwarded to all the 26 current sharers by the directory. Upon receiving the forwarded request from directory, the transaction on the sharer node invokes the conflict detection procedure which checks whether the incoming request conflicts with the transaction’s read or write set. To track the read set and write set, LogTM adds one read bit and one write bit to the tag of each cacheline, which is similar to TCC. If the receiving transaction has detected a conflict, it has two choices to resolve the conflict: 1) send Acknowledge (ACK) message to the requester and abort itself, 2) send Negative Acknowledgement (NACK) message to the requester and continue executing. The conflict resolution policy makes the decision. LogTM Signature Edition (LogTM-SE) [YBM + 07] decouples the version manage- ment and conflict detection from the L1 cache by tracking a transaction’s read set and write set using hardware signatures. A signature uses hashing to encode a transaction’s memory access information [SYHS07]. In order to track a transaction’s read set and write set, two signatures are needed, namely, the read signature and write signature. The transactional load addresses are encoded into the read signature while the transactional write addresses are encoded into the write signature (see Figure 2.6 a). When the trans- action receives a request from another transaction, the same hash function is applied to the memory address of the incoming request, and the result of the hash function is matched with the receiving transaction’s read signature and write signature to detect conflicts (see Figure 2.6 b). Besides LogTM-SE, there are several other signature-based TM designs. Ceze et al. [CTTC06] also proposed signature-based memory addresses disambiguation. Chi Cao Minh et al. [MTC + 07] used signatures to support conflict detection and provide strong isolation for STM. Choi and Draper [CD11] proposed to unify the read signature and write signature into a single signature to reduce hardware complexity and power dissipation. Although unified signature scheme could introduce read-read conflicts, it is shown to be very insignificant and can be removed by using a 27 (a) Insert Memory Address (Local Access, Load or Store) h 0 h 1 RD/ST (0/1) 1 0 0 0 ... 0 0 0 ... 0 Write-Signature 0 1 ... 0 0 0 ... 1 Read-Signature RD/ST (0/1) 0 1 n-bit Signature Index n bits 2 n bits 2 n bits 2 n bits 2 n bits (b) Test Declare a conflict if 1 Memory Address (Incoming Request, REQ or REQX) h 0 h 1 0 1 REQ/REQX (0/1) 0 ... 0 0 1 ... 0 0 0 Read-Signature Write-Signature Figure 2.6: Operation of hardware signature to support conflict detection. small assistant signature. Sanchez et al. [SYHS07] provided a thorough performance and trade-off analysis on signature implementation techniques. As both LogTM and LogTM-SE buffers the pre-transaction version in the log, these systems could incur long abort recovery times as the rollback is handled by software that restores the pre-transaction states back into the memory hierarchy. To combat this problem that is general to any log-based HTM, FASTM [LMG09] uses a novel cache protocol to buffer the transactional modifications in the first level cache and to keep 28 the non-speculative values in the higher levels of the memory hierarchy. Consequently, abort recovery is accelerated substantially as it does not require software intervention to restore the pre-transaction values from the log provided that the speculative modi- fication does not overflow the L1 cache. Meanwhile, FASTM also keeps a log of the pre-transaction values so that overflow can be handled gracefully by table walking the log file in a similar fashion as LogTM. However, FASTM requires re-engineering the cache coherence protocol to provide fast abort recovery which might be too expensive to be adopted. 2.4.3 Hybrid Policy HTM Designs Besides pure eager and pure lazy designs, HTMs that provide flexibility to decide con- flict detection and version management policies at runtime has been proposed recently to reap more performance benefit by matching workload characteristics with the optimal hardware execution mode. DynTM [LMG10] can execute transactions in either lazy mode or eager mode. In the lazy (eager) mode, transactions use lazy (eager) version management and lazy (eager) conflict detection. To allow the most concurrency, DynTM executes a transaction in lazy mode by default. However, the transaction is aborted and re-executed in eager mode if speculative modification overflows the L1 cache. DynTM implements the Transaction Mode Selector to record transactions that have frequent lazy mode abort or overflow and select eager mode for those transactions. ZEBRA [TGNA + 11] provides policy flexibility at the granularity of cachelines. The hardware keeps track of the conflicting cachelines during the transaction execution. Those cachelines that experience conflicts are managed lazily: speculative modifica- tion to those cachelines are re-directed to a dedicated hardware buffer and committed to 29 the memory at the end of transactions. ZEBRA requires one more bit (C-bit) be added to the L1 cacheline tag. If a conflict is detected on the cacheline, its C-bit is set. A C-bit is not reset until the cacheline is evicted. Any cacheline with C-bit set is managed lazily. 2.5 Summary Previous research on HTM has paved the way for HTM being implemented in sev- eral commodity processors. And the three design dimensions, namely, version man- agement, conflict detection, and conflict resolution have been explored. Nonetheless, it is observed from the evaluation results of the latest HTM implementation that the per- formance overhead of HTM under frequent conflicts is still substantial. More research works should focus on mitigating the negative impact of transaction conflicts on the per- formance. Furthermore, as energy has become a crucial constrain in chip multiprocessor design as well as in exascale machine design, energy efficient HTM is the next logical step in the evolution of HTM designs. Therefore, improving the performance-per-joule of HTM designs forms the crux of the problem this research attempts to tackle. The next three chapters discuss the research outcomes and proposals towards high performance and energy efficient HTM design. 30 Chapter 3 An Analytical Model for Communication Cost This chapter describes a model for the analysis of communication costs in transactional systems. This model categorizes costs based on the high-level operation and event in the system such as cache coherence and transaction conflict. Thus, it can provide valuable insights into the viable approaches to reduce the communication cost by optimizing those operations and events. The model is generic enough so that it can be applicable to general CMP architectures beyond CMPs with HTM support. 3.1 Categorization of On-Chip Communication On-chip communication in transactional systems is mainly attributed to data transfer and the corresponding control. Ideally, the on-chip communication of a transactional data request would be the round trip cost of transferring data between the home node and the requesting node. Here, a home node is defined as the node that either caches the requested data or hosts the memory controller to fetch data from off-chip memory. This cost is termed inherent cost, which is the minimal cost for a processing element to fetch data into its private memory (e.g., L1 cache). The uniprocessor scheme achieves this minimal cost trivially. However, for shared memory CMPs, the cache coherence protocol usually incurs extra on-chip communication that is necessary for correctness. In a directory-based coherence protocol, when the data is not shared by any other cores, 31 the home node can respond to requestors without initiating further communication. In this case, the data request only incurs the inherent cost. Otherwise, if the data is shared (owned), the home node has to ask the sharers (owners) to invalidate their private copies and respond to the requestors. The communication between the home node and the sharers (owners) constitutes the coherence cost, which is essential for achieving cache coherence. In general, the inherent cost and coherence cost is incurred in almost all shared memory CMPs. In a transactional system, multiple requests from a thread to the shared memory are packed into a transaction (chucks of requests) that is executed atomically and in isola- tion from other transactions. Transactional requests mainly take the form of coherence requests with extra information fields in the message. As with plain coherence requests, transactional requests also incur the inherent cost and the coherence cost. In addition, extra on-chip communication is required. The additional cost is termed transactional cost, which includes the conflict cost, squash cost and utility cost. The conflict cost is incurred by transactional requests that fail due to conflicts between the requestors and other concurrent transactions in the system. Each failed request incurs communication to the home node and to the sharer nodes. In HTM sys- tems that allow transaction stalling, a request can fail multiple times before eventually succeeding in obtaining data access permission, thereby making the conflict cost a mul- tiple of the inherent cost and coherence cost combined. The squash cost is the aggregate communication cost of those transactions that are aborted. Data accesses in each transaction incur inherent cost, coherence cost and con- flict cost in the process of requesting data. When the transaction aborts due to conflicts, the associated cost contribute to the overall squash cost, which is essentially a waste of network bandwidth. 32 Inherent Coherence Conflict Squash Figure 3.1: Categorization of Communication in Transactional System. The utility cost is the required communication to facilitate the mechanisms in HTM designs. This cost is highly dependent on the specific system. For instance, the Scalable TCC [CCC + 07] uses a specific message type for the committing transaction to request a commit ID and to probe the target directory. In the EazyHTM [TPK + 09], the abort messages from a committing transaction to peer transactions in its racer list is another example of the utility cost. Because the quantification of utility cost should be based on a particular HTM system of interest, the subsequent cost analysis mainly focuses on the conflict and squash portion of the transactional costs. 3.2 Relation of Different Types of Communication Cost Figure 3.1 summarizes the different types of communication costs in a typical trans- actional system. For an arbitrary coherence request from an arbitrary transaction, the minimal communication cost is the inherent cost. Its coherence cost depends on how many nodes share (own) the requested data block because the home node must forward the request to the sharers (owner) for invalidation according to the coherence protocol. If the request is rejected by at least one of the remote nodes due to conflicts, the request is 33 regarded as failed, converting all the costs to a conflict cost. Thus, a successful request’s communication cost includes the conflict cost associated with its failed instances plus the inherent and coherence cost with the last instance that succeeds in obtaining the data. However, if the dynamic transaction instance issuing the successful request eventually aborts, the cost associated with the request is counted as the squash cost. Otherwise, if the transaction commits, the request is considered materialized because any machine state change associated with the request is non-speculatively presented. The cost of a materialized transactional request includes squash costs for any transaction aborts plus the inherent, coherence and conflict costs associated with the successful request in the committed transaction. Overall, the inherent and coherence costs can be converted to a conflict cost if the request fails. The inherent, coherence and conflict costs of a request can be converted to a squash cost if the transaction issuing the request aborts. The total communication cost of a materialized request is the sum of its inherent, coherence, conflict and squash costs. 3.3 Quantification of Communication Cost The communication cost can be quantified in terms of network hop count using math- ematical formulas. The hop count is a proper abstraction of the underlying communi- cation fabric and is a clear indicator of latency and energy for the communication. To decouple the generic analysis with the specific thread layout while still capturing the impact of the communication fabric, we use the average hop count (h) as the basic unit to measure the node-to-node communication cost. Thus, the inherent cost of a request can be calculated as: C inherent = 2h (3.1) 34 The coherence cost is the forwarding from the home node and the corresponding response to the requestor, which can be calculated as: C coherence =n fwd 2h (3.2) In the above equation, n fwd is the number of nodes to which the home node forwards the request. The conflict cost is calculated as: C conflict = nretry X i=1 (C inherent +C coherence(i) ) (3.3) n retry is the number of request retries before the request is satisfied successfully without conflict. The coherence cost in each request retry depends on the current retry number as the nodes that need to receive the forwarding from the home node changes as the execution proceeds. The squash cost is calculated as: C squashed = nrestart X j=1 (C inherent +C coherence(j) +C conflict(j) ) (3.4) n restart is the number of restarts a transaction incurs before it commits. Both the coher- ence and conflict costs depend on the current restart count because n fwd and n retry vary as the execution proceeds. The overall communication cost incurred by a transactional data request can be cal- culated as follows: C total =2h+n fwd 2h+ nretry X i=1 (2h+n fwd(i) 2h)+ nrestart X j=1 (2h+n fwd(j) + n retry(j) X i=1 (2h+n fwd(i) 2h)) (3.5) 35 If we assume that n fwd , n retry and n restart are independent variables. The cost func- tion can be reduced into the equation as follow: C total(reducedform) = (1+n fwd )(1+n retry )(1+n restart )2h (3.6) 3.4 Key Factors to Reduce Communication Cost The quantitative analysis of the communication cost in transaction execution identifies the key factors that contribute to the cost. Thus, the cost can be controlled through affecting the individual key factors. n fwd . The number of nodes to receive the coherence forwarding from the home node is determined by current nodes that share the requested data block. The sharing dimen- sion is specified mainly in the parallel programs. There are well established software techniques to reduce the sharing dimension without compromising the performance such as privatization and sub blocking. On the hardware side, n fwd can be reduced especially in conflict detection where not all sharers need to receive the forwarding as long as one sharer can handle the conflict properly. Our proposed PUNO technique in Chapter 6 exemplifies a hardware design to reduce n fwd . n retry . The number of request retries before obtaining the data access permission is largely determined by two factors. The first factor is the transaction characteristics. Obviously, if transactions in an application run into conflicts frequently, n retry increases. Also, coarse-grain transactions run longer, thereby forcing other transactions to retry more times. Second, the conflict detection and resolution mechanisms of a HTM system can affect n retry . For instance, if conflicts are detected lazily and resolved with the committer-win policy, the committing transaction’s requests always succeed without the need to retry. Another example is the fine-grain backoff after a request is rejected. 36 n restart . The number of transaction restarts before committing is also determined by the transaction characteristics in the application and the HTM design, similar to n retry . In general, transaction aborting should be avoided when possible because it is detrimental to performance and energy efficiency. h. The average hop count is determined by the topology of the underlying commu- nication fabric. Reducing the hop count not only shortens the communication latency but also saves energy as fewer routers and links need to be traversed. Researchers resort to richly-connected low-diameter networks to decrease average hop counts. Two repre- sentatives of such designs are the Flattened Butterfly [KDA07] and Multidrop Express Channels [GHKM09]. Subsequent chapters introduce several hardware techniques to reduce the communi- cation cost in HTM systems by acting on different subset of the key factors as identified above. 37 Chapter 4 Selective Eager-Lazy HTM Hardware Transactional Memory (HTM) systems implement version management and conflict detection in hardware to guarantee that each transaction executes atomically and in isolation. The design choice of version management mechanism and conflict detec- tion mechanism has a significant impact on performance. Individual design choices can also have different communication cost as they have a first order impact on the n retry and n restart term in the communication cost presented in Chapter 3. In general, HTM imple- mentations fall into two categories, namely, eager systems and lazy systems. Due to inflexibility, neither eager nor lazy systems always perform better than the other over a wide range of transactional workloads [LMG10]. On one hand, eager systems facilitate fast transaction commit but have performance pathologies such as cascading stalling and frequent aborting due to their pessimistic concurrency control on transactional writes. On the other hand, lazy systems can exploit more concurrency at the potential cost of more wasted work and commit-time performance pathologies such as commit serializa- tion. Besides, lazy conflict detection utilizes network bandwidth more efficiently as a transaction refrains from contacting other transactions for conflict detection until reach- ing the commit phase. It is observed in a wide range of TM applications that more than 55% of the transac- tion aborts are due to conflicts on a small set of memory blocks (three memory blocks in particular). Based on this observation, it is argued that, by managing only a small portion of a transaction’s write set lazily, an eager HTM system can achieve the same 38 level of concurrency and communication benefit as lazy systems while avoiding lazy system’s performance pathologies and implementation overhead. We propose the Selective Eager-Lazy HTM (SEL-TM) to improve performance and reduce communication cost by taking advantage of the favorable characteristics of both eager and lazy HTM systems while mitigating their respective performance pathologies. By dynamically dividing the write set of a transaction into eagerly- and lazily-managed memory addresses, SEL-TM enables each transaction to manage highly- contended memory blocks lazily for increased concurrency and manage the rest of the memory blocks eagerly for less waste and an accelerated commit process. With SEL- TM, we make the following contributions: A new HTM design is presented that supports simultaneous eager and lazy version management within a transaction. We demonstrate that this hybrid eager-lazy approach increases performance and reduces on-chip communication. An efficient hardware scheme is described to profile every dynamic transaction, and discover critical conflict points at runtime. An adaptive strategy is discussed and evaluated that selects the memory addresses in each dynamic transaction for lazy version management based on runtime pro- filing information. In Section 4.2, we analyze the benchmark characteristics which motivate the pro- posed work. It is first shown how eager HTM’s pessimistic concurrency control on spec- ulative writes can limit the number of concurrent transactions. Then, it is demonstrated that conflicts among transactions typically concentrate on a small set of data blocks. In Section 4.3, the detailed SEL-TM design is presented. We focuses on the version management, conflict detection and resolution, and commit arbitration. An operation 39 example is given at the end of this section. In Section 4.4, the dynamic mechanism is presented to select memory hot spots for lazy management and tailor the management policy at the granularity of a transaction. In Section 4.5, the hardware implementation of the key structures in SEL-TM is presented for overhead analysis. The RTL imple- mentation is synthesized to obtain the area and power consumption to deploy SEL-TM onto computing chips. In Section 4.6, the experimental methodology and results are shown. It is demonstrated that SEL-TM achieves better performance on average over two baseline HTMs using fixed policies for version management and conflict detection. 4.1 Motivation We studied the STAMP benchmark suite [MCKO08] that is widely used to evaluate TM designs. It includes eight scientific and commercial workloads. The details about benchmarks and their input sets are provided in Section 4.6. Lazy HTMs could exploit more concurrency compared with eager systems. Eager HTM systems propagate speculative data versions to the memory immediately when the memory access requests are issued. Consequently, pessimistic concurrency control is enforced on the speculatively modified data in the memory. Unfortunately, pessimistic concurrency control limits the concurrency as other transactions should be stalled or aborted while trying to access the memory blocks that have been modified speculatively. We measure the average position of transactional writes with respect to the length of dynamic transactions that issue each store. The result is shown in Figure 4.1. The earlier a transactional write is issued within a transaction, the longer the corresponding data block is locked. Therefore, other transactions could be stalled longer, further reducing concurrency. 40 0% 20% 40% 60% 80% 100% Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Avg store position with respect to the transaction length Figure 4.1: Average store position with respect to the transaction length. The number of conflicts on each memory block is counted in all the eight bench- marks. A conflict could incur transaction stall or abort. Here, only the conflicts that cause transaction aborts are counted. Figure 4.2 shows the percentage of transaction aborts due to conflicts on the three most-contended memory blocks in each workload. It is observed from Figure1 that the transaction conflicts tend to concentrate on a small set of memory blocks in most workloads. In particular, 97% of the transaction aborts in Bayes are caused by conflicts on only three memory blocks. This characteristic is common in shared memory programs since threads spend most of the time working on their private data set and update global data structures less frequently. Based on this observation, we argue that it is feasible to harvest a similar concurrency and communi- cation benefit as in lazy systems with eager systems by managing a small set of highly- contended memory blocks lazily. As managing the conflict hot spots in memory lazily does not enforce pessimistic concurrency control on the memory blocks, more concur- rent transactions can access the data simultaneously, potentially allowing more transac- tions to commit. Although applying lazy version management and conflict detection to less-contended memory locations could further increase concurrency, the benefit is quite marginal since fewer transactions will conflict on those memory locations. Weighing the 41 0% 20% 40% 60% 80% 100% 120% Normalized Abort Count Figure 4.2: Normalized Number of transaction aborts on three most contended memory blocks. Figure 4.3: Architectural overview of SEL-TM (SEL-TM specific modules in circle frames). limited concurrency benefit and the cost of maintaining the speculative states for these less-contended memory locations (more hardware overflow and longer commit latency), it is more efficient to eagerly manage such memory locations. The lazy management of a bounded set of memory blocks not only increases concurrency but also reduces the overhead stemming from lazy systems, such as long commit latency and overflow of speculative states. 42 4.2 The Static SEL-TM Design Figure 4.3 shows an overview of the proposed SEL-TM that is based on an eager system similar to LogTM [MBM + 06]. The log-based approach of SEL-TM provides eager ver- sion management to most memory addresses in a transaction’s write set. However, SEL- TM also uses a gated store buffer to provide lazy version management to a selected set of memory addresses that are highly contended among transactions. Before a transaction begins, the SEL Manager, which is a distributed hardware scheme, makes two decisions: first, whether the transaction should selectively manage a set of memory addresses lazily or simply adopt pervasive eager version management; second, the addresses to be lazily managed in the transaction. The selected lazy addresses are stored in the Lazy Address List (LAL), while the LAL Signature summarizes these addresses. Both the LAL and LAL Signature are empty if the transaction is to be executed in a pure eager mode. Dur- ing transaction execution, the Lazy Store Buffer (LSB) holds speculative modifications to the lazily managed memory addresses. Before commit, the transaction must procure a commit token if the transaction has buffered speculative states in the LSB. Otherwise, the transaction can commit in parallel with other committing transactions. If a system supports multithreading, all the SEL-TM modules should be regarded as part of the thread state and treated as such during thread swaps. The following paragraphs describe SEL-TM’s key mechanisms in detail. Version Management: SEL-TM supports both eager and lazy version management at the granularity of a cacheline. On the one hand, SEL-TM follows the log-based approach to eagerly manage most memory blocks in a transaction’s write set. Spec- ulative values are stored immediately into the memory hierarchy, and pre-transaction values are kept aside in a cacheable log which can be used by a software handler during rollback. To prevent other transactions from accessing a speculative value, its address is 43 added to the transaction’s write set to ensure isolation. Then the transaction can detect conflicts upon receiving offending requests. On the other hand, SEL-TM applies lazy version management to a selected set of memory addresses that are determined by the SEL Manager for each dynamic transaction instance. In the rest of this chapter, the addresses of these memory blocks are often referred to as lazy addresses. A deferred memory request to a lazy address is referred to as a lazy request. Speculative modifica- tions to lazy addresses are stored aside in the LSB while pre-transaction state remains in memory. Subsequent stores from the same dynamic transaction to the same lazy address overwrite the existing stored value in the buffer in program order. Subsequent loads from the same dynamic transaction get a value forwarded from the buffer. Since the number of lazy addresses is designed so as not to exceed the capacity of the LSB, the lazily-managed speculative modifications cannot overflow the hardware buffer. As a speculative modification is effectively isolated in the LSB, the transaction does not need to add the store address into its write set to maintain isolation on the data. Therefore offending requests will not trigger conflicts with the buffered store. At commit time, the buffered value is written to memory. Meanwhile the directory forwards the lazy request to potential conflicting threads and the store is globally performed if all con- flicts are resolved. SEL-TM’s control flow of processing transaction loads and stores is illustrated in Figure 4.4. Conflict Detection: SEL-TM piggybacks upon the conventional cache protocol for conflict detection. The cache directory forwards coherence requests to any potential conflicting transactions. A receiving transaction matches the destination address of an incoming request with its read/write sets to detect conflicts eagerly. Conflict detection on transaction stores to the lazy addresses is deferred since such stores are re-directed to the LSB and corresponding destination addresses are not added to the write set. Conflict 44 Figure 4.4: The control flow of processing transaction load/store in SEL-TM. detection on the buffered lazy requests are performed at commit time when a transaction propagates the deferred stores to memory, and the directory forwards the lazy requests to other threads. Throughout the commit phase, the destination addresses of buffered store requests are included in the write set to ensure isolation on the speculative states in mem- ory. On the other hand, transaction loads, no matter whether the destination addresses are lazy or not, are issued to memory eagerly. And load addresses are also eagerly added to the read set. So conflict detection on transaction loads is always performed eagerly in SEL-TM. Conflict Resolution: Once a conflict is detected, SEL-TM resolves it eagerly based on a modified timestamp policy. Due to the simultaneous bi-modal execution, eager and lazy requests co-exist. The issue of a store request is postponed until the commit phase if its memory address is in the Lazy Address List. Otherwise, a store request is issued eagerly. Meanwhile, load requests are always issued eagerly. SEL-TM distinguishes between eager and lazy requests for better performance even though it is not required for correctness. Scherer and Scott [SIS05] proposed to assess the amount of work a transaction has completed when deciding whether to abort it. SEL-TM gives lazy stores 45 Table 4.1: Summary of conflict resolution policy in SEL-TM LD Req / Eager ST Req Lazy ST Req LD Req / Eager ST Req Older Tx wins Tx with Lazy ST wins Lazy ST Req Tx with Lazy ST wins Not possible a higher priority over eager requests because lazy stores are issued at the end of a trans- action, and it would be a significant waste to abort an “almost finished” transaction. So a transaction issuing a lazy store request always stalls or aborts any other conflicting transactions. Since only the committing transaction can issue lazy stores and the commit policy requires only one committing transaction at a time, it is not possible to have con- flicts between transactions that issue lazy store requests to the same address. Conflicts between transactions that eagerly access the same memory address are resolved using a basic timestamp policy. Table 4.1 summarizes SEL-TM’s conflict resolution policy. To distinguish between eager and lazy requests, we augment memory request messages with one bit. This modification is minimized since no extra cacheline state or coherence actions are required. Commit Arbitration: SEL-TM allows transactions that have not written to the LSB to commit in parallel. However, for those transactions that have buffered their specula- tive states in the LSB, a globally unique commit token must be obtained by a transaction before entering the commit phase. Requests to the commit token from different trans- actions are processed on a first-in, first-served basis. The committing transaction is given a higher priority in conflict resolution to ensure forward progress. This commit scheme preserves more concurrency since it minimizes the number of transactions issu- ing destructive lazy requests and transactions aborting after having issued lazy requests. A traditional lazy system has performance pathologies due to commit arbitration and validation. Although SEL-TM’s commit procedure shares some similarities with some lazy systems in this regard, the performance overhead is much smaller. The underlying 46 reason for lazy systems’ pathologies at commit time is the long latency to propagate the entire write set to memory. However, SEL-TM has an accelerated commit phase because only a small portion of the write set is lazily managed. So the latency to flush buffered speculative modifications is greatly reduced. 4.3 The Dynamic SEL-TM Design SEL-TM requires the write set of a transaction be divided into eagerly- and lazily- managed addresses based on the contention rate of each address. Since highly contended memory blocks in a workload are not always self-evident, both static and dynamic identification methods merit investigation. The static approach requires that either the programmer assumes the responsibility to annotate “hot spot”, or program analysis is performed in advance. Placing the burden on programmers increases the difficulty of writing TM code, which goes against the TM objective of easing parallel programming. On the other hand, transactional application profiling remains an open research area. Zyulkyarov et al. have demonstrated the effectiveness of various profiling techniques on conflict point discovery [ZSH + 10], and we expect a full-fledged profiling framework for TM applications in the near future. Nevertheless, some conflicts are implementation- dependent. So the static approach might miss conflicts which only occur in specific TM implementations. One such example is a false conflict, which stems from a coarser data granularity that the underlying TM implementation uses to detect conflicts. The dynamic approach discovers the highly contended memory blocks at runtime with built-in mechanisms that monitor the status of transaction executions. In this sec- tion, we describe the SEL Manager, a distributed hardware mechanism that dynamically identifies the lazy addresses for each transaction based on the contention rate collected at runtime. It is also responsible for determining whether a dynamic transaction should 47 Figure 4.5: Architectural overview of the SEL Manager. be executed with selective-eager-lazy or pervasive eager version management. We first present the architectural overview of the SEL Manager and, then, explain the underlying mechanisms. Finally the policies that support the decision making are discussed. 4.3.1 Architectural Overview The SEL Manager consists of three major functional modules, namely, Conflict Count- ing, Transaction Profiling and Lazy Address Selector. The Conflict Counting mod- ule maintains a list of the highly contended memory addresses from which the lazy addresses are selected based on a “most-used-recently” policy. Meanwhile the Trans- action Profiling module profiles the dynamic instances of each static transaction. The transaction profiling is used by the Lazy Address Selector to decide whether a dynamic transaction is qualified for eager-lazy management. Figure 4.5 shows an overview of the SEL Manager. 48 Figure 4.6: Algorithm to update HIT and TSR. 4.3.2 Conflict Counting The two hardware structures in the Conflict Counting module are shown in Figure 4.5. The History Information Table (HIT) stores the addresses of the most contended mem- ory blocks observed during past execution. The Transaction Status Register (TSR) col- lects information for the currently running transaction in order to update the HIT when the transaction commits. In addition to a memory block address, every HIT entry has a valid bit, which indi- cates whether the entry contains valid information, and an 8-bit saturating usage counter. The usage counter is incremented under the condition that a committed transaction has stored to the corresponding address. It is important to have a usage counter for each address since it provides history information as well as enabling replacement upon HIT overflow. HIT overflow occurs if the HIT is full and a new address gets promoted from the TSR. The HIT entry that has the smallest usage counter is evicted. In the case of mul- tiple qualified victim entries, the victim is randomly selected. The algorithm to update the HIT is described in Figure 4.6. The TSR collects execution statistics for the current dynamic transaction. For the purpose of providing an accurate estimation of the contention rate on each memory 49 block that a transaction has modified, each TSR entry has two 8-bit saturating counters besides the store address, namely, rcounter and kcounter. While rcounter counts the number of remote coherence requests to the corresponding address, kcounter counts the number of remote transactions being killed (aborted) when the local transaction issues a lazy store request to that address. All the entries in the TSR are reset before a transaction begins. When the transaction issues a store, the store address is added into the TSR. When the transaction receives a coherence request to an address that is found in the TSR, the corresponding rcounter is incremented. When an ACK message from a remote conflicting transaction is received after the local transaction issues a lazy store request, the corresponding kcounter is incremented. At commit time, any TSR entry, that has a maximum rcounter value and sufficiently small kcounter value, gets promoted to the HIT as a lazy address candidate. The rationale behind this promotion criterion is to select the most contended memory address while conservatively avoiding addresses which might cause too many aborts once lazily managed. The algorithm to update the TSR is described in Figure 4.6. To avoid TSR overflow, its size should be the same as the write set of the largest transaction. However there is an apparent conflict between limiting TM-dedicated hard- ware resources and the future trend of coarser transactions with large write sets. To improve its scalability, the TSR records a store address only upon receiving a remote request to that address. As observed in our experiments using STAMP benchmarks, an optimized 32-entry TSR never encounters overflow. Furthermore, a hybrid TSR scheme can always direct any TSR overflow to a software managed per-transaction log. 50 4.3.3 Transaction Profiling The Transaction Profiling Table (TPT) keeps track of the length of every static trans- action in clock cycles by averaging the length of their dynamic instances. As shown in Figure 4.5, each TPT entry stores the memory address of the TxBegin instruction to uniquely identify a static transaction (XactID), the average length of that transac- tion (AvgXactLen) and the total number of dynamic instances having been executed (NumExed). Since SEL-TM timestamps every transaction, the length of a dynamic instance (CurXactLen) is calculated by subtracting its timestamp from the cycle time when it commits. Then, the new average length is calculated using the following for- mula. AvgXactLen new = AvgXactLen old NumExed +CurXactLen NumExed +1 (4.1) The TPT must also scale with future workloads. We can resort to a software- managed TPT in the case of hardware TPT overflows. The performance overhead of this hybrid solution is marginal as the operation to update the TPT is not in the criti- cal path. The only timing requirement is that the TPT must be updated before the next transaction begins since the Lazy Address Selector needs the transaction length to decide whether the transaction should be executed with eager-lazy management. 4.3.4 Lazy Address Selector The Lazy Address Selector is essentially a decision making engine that integrates infor- mation from the underlying Conflict Counting and Transaction Profiling mechanisms. Before each transaction begins, it decides whether to execute the transaction with eager- lazy management and which memory addresses should be lazily managed. These two 51 decisions are made according to two policies. The first policy requires that short transac- tions are executed with pure eager version management and all the other transactions are executed with lazy-eager version management since deferring the store in a short trans- action might not increase concurrency. The second policy requires that, if a transaction is qualified for eager-lazy management, the address with the maximum usage counter value in the HIT is selected for lazy version management. With this policy, we exploit temporal locality as well as taking advantage of the aggregated history of transaction execution. While selecting a lazy address from the HIT by comparing usage counters is straight- forward, identifying a short transaction requires a careful definition of“shor”. Since the length of transactions varies across applications, we do not attempt to apply an absolute delimitation between short and long transactions. Our proposed approaches are based on the average length of each static transaction recorded by the Transaction Profiling module. Alternative formulas to identify short transactions are provided below. Formula 1.0: The length of a short transaction that should use pure eager manage- ment must satisfy the following condition: XactLen < ( XactLen longest +XactLen shortest 2 ) (4.2) The XactLength shortest and XactLength longest are the average length of the shortest and longest static transaction respectively, from the Transaction Profiling Table. The right hand side of Formula 1.0 will be referred as XactLength avg in our following dis- cussion. 52 It is observed that some workloads have a few transactions that are more than two orders of magnitude longer than their peer transactions. To prevent an oversized transac- tion from disqualifying other mid-size transactions for selective eager-lazy management, two more alternative formulas are proposed. Formula 1.1: The length of a short transaction that should use pure eager manage- ment must satisfy the following condition: XactLen<min(STT; XactLen avg ) (4.3) Formula 1.2: The length of a short transaction that should use pure eager manage- ment must satisfy the following condition: 8 > > < > > : XactLen<STT; if ( XactLen longest XactLen shortest > 100) XactLen<XactLen avg ; otherwise (4.4) The Small Transaction Threshold (STT), which is pre-determined, is an absolute upper-bound for the length of a short transaction. These formulas are designed as com- putationally simple to avoid long latency operations. Their performance impact is eval- uated and discussed in the next section. 4.4 Evaluation 4.4.1 Methodology The baseline machine in our simulation is a 16-core chip multiprocessor. Each core has a private L1 I-Cache and D-Cache, with both being write-back caches. All cores share a unified 8MB L2 cache and 4GB main memory. Memory coherence is maintained by 53 Table 4.2: System Configuration Unit Value Core 16 cores, In-order, CPI=1 L1 Cache Split, 32KB, 4-way associative, Latency=1 cycle L2 Cache Unified, 8MB, 8-way associative, Latency=32 cycles Memory 4GB, Latency=500 cycles Interconnect 2d Mesh, Link Latency=3 cycles Table 4.3: Benchmark Input Parameters and Characteristics Benchmark Input Parameter Tx Time Abort Rate r-set size w-set size Avg Tx Length Bayes 32 var, 1024 records, 2 edge/var 90.58% 96.53% 93.19 50.37 125945 Intruder 2k flow, 10 attack, 4 pkt/flow 75.83% 56.08% 7.47 3.50 1490 Labyrinth 32*32*3 maze, 96 paths 99.92% 98.72% 145.71 92.28 373619 Yada 1264 elements, min-angle 20 99.69% 38.63% 52.96 31.61 35854 Genome 32 var, 1024 records 79.07% 3.06% 32.39 3.32 4792 Kmeans 16K seg. 256 gene. 16 sample 7.12% 4.69% 6.23 1.75 306 SSCA2 8k nodes, 3 len, 3 para edge 10.22% 0.33% 2.99 1.99 131 Vacation 16K record. 4K req. 60% coverage 97.38% 46.24% 64.63 18.62 17002 a directory-based MESI protocol with sticky states. The detailed system configuration is listed in Table 4.2. The GEMS simulator [MSB + 05] and the SIMICS infrastructure [MCE + 02] are used to simulate hardware support for transaction execution. Each core is augmented with register checkpointing, read/write signatures, transaction logging and SEL-TM hardware components including a Lazy Store Buffer, a Lazy Address List and a SEL Manager, among others. Perfect read/write signatures are used to exclude false conflicts. We use the STAMP [MCKO08] benchmark suite to evaluate the performance of SEL-TM. Table 4.3 summarizes the input parameters, which are typical for transac- tional memory evaluations. For the analysis, we compare our design with two HTM systems with fixed ver- sion management and conflict detection policies. In addition, SEL-TM itself has two alternatives, namely, static and dynamic SEL-TM. The four simulated systems are listed below: 54 Eager-Eager system (EE): LogTM-SE is used as the simulated eager system. It uses a perfect signature to track the read and write sets. Timestamps are used to resolve conflicts and avoid cyclic dependences between transactions. Lazy-Lazy system (LL): the simulated lazy system uses an infinite-size buffer for speculative states and a committer-win policy for conflict resolution. This system uses a commit-token policy similar to that of SEL-TM to exclude any commit-time perfor- mance pathologies not existing in SEL-TM. Static SEL-TM (SEL-TM-S): the static alternative of SEL-TM requires that the lazy addresses be specified in advance by a programmer. Every transaction must apply lazy management to the specified memory blocks while managing the remain- ing addresses eagerly. Although SEL-TM can manage multiple lazy addresses within each transaction, we demonstrate the performance impact of a SEL-TM configuration that manages up to two lazy addresses per transaction. As a result, the LSB and Lazy Address List will have two entries. Each entry of the LSB can store one cacheline size data in addition to the 32-bit address field that is used for an associative search. Each entry of the Lazy Address List stores a cacheline address that is to be managed lazily. Future work will explore the impact of multiple lazy addresses. Dynamic SEL-TM (SEL-TM-D): this SEL-TM alternative employs the SEL Man- ager to identify the lazy addresses at runtime. The SEL Manager also determines whether a transaction should be executed with eager-lazy version management. It uses a 32-entry TSR and 200-cycle Small Transaction Threshold. 4.4.2 Performance Analysis of Static SEL-TM Figure 4.7 compares the performance of the static SEL-TM with the simulated eager and lazy HTM systems. While no HTM system performs better than the others across 55 Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Mean 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized Execution Time LL EE SEL−TM−S High Contention Low Contention Figure 4.7: Static SEL-TM execution time. the whole range of applications, static SEL-TM performs better in seven out of eight applications. On average, static SEL-TM improves performance by 9.5% over the EE system and by 25.6% over the LL system. Considering SEL-TM’s low implementation overhead, its performance impact strongly justifies its deployment. Static SEL-TM delivers almost the same performance as the eager system on Labyrinth, while both are outperformed by the lazy system. After analyzing Labyrinth’s source code, we find that a static transaction loads the entire global maze grid, and, right before it commits, stores to a small portion of the grid. So the store from one dynamic instance of that transaction always conflicts with other instances’ read sets. Consequently, deferring the store request to the commit time changes the commit order but cannot improve performance by allowing more transactions to commit. As a result, static SEL-TM performs only as good as the eager system. On the other hand, the lazy system outperforms the other two because (1) abort is the common case in Labyrinth and (2) the lazy system aborts faster. Several optimizations to Labyrinth are proposed 56 such as early-release and privatization that would allow SEL-TM to better exploit con- currency for higher performance. The transaction behavior in Yada is quite different from that in Labyrinth. One of the two main global data structures in Yada is a task queue that contains objects waiting to be processed. A transaction reads an object from the task queue, processes the object and conditionally stores the object back to the task queue. Since only a portion of transactions store an object back while others only read the queue, deferring the store request allows more read-only transactions to commit thus increasing concurrency as well as performance. To assess the effectiveness of SEL-TM in increasing concurrency, the normalized count of transaction aborts due to conflicts on lazy addresses is shown in Figure 4.8. It is observed that, on average, SEL-TM avoids 72% of the aborts on the lazy addresses by deferring conflict detection on transaction stores to these addresses. Therefore more potentially conflicting transactions can execute concurrently and commit ultimately. 4.4.3 Performance Analysis of Dynamic SEL-TM The overall performance of dynamic SEL-TM is presented in Figure 4.9. The dynamic SEL-TM achieves more than 20% performance improvement on average over either pure eager or pure lazy HTM in workloads with high/medium contention rates. As future workloads tend to use coarser transactions with higher contention rates, dynamic SEL- TM is readily equipped to achieve higher performance. On the other hand, SEL-TM per- forms as well as LogTM in benchmarks with low contention (Genome and SSCA2 spend considerable amounts of execution time in barrier synchronizations; Kmeans spends less than 10% of total cycle time in transactions). Workloads with low contention benefit less 57 Bayes IntruderLabyrinth Yada Genome Kmeans SSCA2 Vacation Mean 0 0.2 0.4 0.6 0.8 1 1.2 Normalized Num of Aborts on Lazy Address EE SEL−TM−S Figure 4.8: Transaction abort count on top3 conflict hot spots in memory. Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Mean 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized Execution Time LL EE SEL−TM−D High Contention Low Contention Figure 4.9: Dynamic SEL-TM execution time. from the additional concurrency of SEL-TM, as the objective of SEL-TM is to mitigate conflicts among transactions for higher performance. In particular, the dynamic SEL-TM outperforms the static SEL-TM as well as the fixed policy HTMs in the producer-consumer sharing model, which is very common in parallel workloads. One of such workloads is Intruder in the STAMP benchmark. 58 In Intruder, most conflicts occur between two transactions: one producer and one con- sumer. The producer transaction tries to push an element into the queue while the con- sumer pops the queue if the queue is not empty. Since the push and pop operations read the metadata of the queue and attempt to modify the push and pop position respec- tively, the memory block storing the metadata is fiercely contended among transactions. Moreover, after the push operation, the producer transaction performs a long latency tree-rebalancing task during which the speculatively-modified queue metadata must be isolated from other transactions. Consequently, all the following consumer transactions are blocked by the producer if eager version management is used for the data block that stores the queue metadata. Dynamic SEL-TM performs better in the producer-consumer model due to three facts. Firstly, the Conflict Counting module effectively identifies the queues metadata as the most contended memory block. Secondly, the producer trans- action defers the metadata modification so that more consumers can access the queue without being blocked. Once the consumers find the queue empty, they can move on to other tasks and check back later. Thirdly, the SEL Manager prevents a short consumer transaction from deferring its metadata modification when it pops the queue. Therefore, it cannot issue a destructive lazy request that causes a huge amount of waste by aborting the large producer transactions. In general, this producer-consumer model is common in many parallel programs. Hence, the dynamic SEL-TM can exploit significantly more concurrency in such cases by lazily managing the shared buffer space. Overall, some transaction behaviors favor eager management while others favor lazy management. Our hybrid approach in SEL-TM aims at taking advantage of both man- agement styles to improve HTM performance. According to our experiments, dynamic SEL-TM achieves an average 14% improvement over LogTM-SE and an average 22% 59 Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Mean 0 1 2 3 4 5 6 7 Speedup Single Thread EE LL SEL−TM−D Figure 4.10: Speedup on 16-core processor over sequential execution. improvement over the fixed lazy system. Moreover, SEL-TM shows significant paral- lel speedup on the 16-core baseline architecture over sequential execution. Figure 4.10 shows the speedup of the simulated HTM systems on a 16-thread processor over single- thread execution. As observed, the SEL-TM design scales with the number of threads. 4.4.4 Impact on Network Traffic Lazy conflict detection is regarded more efficient in utilizing the network bandwidth because transactions wait until commit time to communicate with their peers to resolve conflicts. In comparison, in eager conflict detection, transactions notify each other immediately upon potentially conflicting data access. Thus, by adding laziness in an eager system, SEL-TM attempts to reap the traffic benefit of lazy systems without the known complexities in lazy designs. Figure 4.11 shows the transactional traffic of the 60 0 0.2 0.4 0.6 0.8 1 1.2 EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D EE-Base SELTM-S SELTM-D Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Normalized Network Traffic Xact Data Xact Ctrl High Contention Low Contention Figure 4.11: Normalized transactional on-chip traffic. static and dynamic SEL-TM normalized to the eager baseline HTM. As observed, both static and dynamic SEL-TM reduce 22% of the traffic in the four high contention work- loads. The reduction indicates energy savings as well as less congested on-chip net- work. While SEL-TM reduces both the data and control traffic, it has a more signifi- cant impact on the control traffic. This observation is expected as the conflict detection communication, which benefits from the hybrid policy in SEL-TM, mainly consists of control messages from conflicting transactions to negotiate for data access permission. High contention workloads exhibit more substantial traffic reduction compared with low contention workloads. Transactions in high contention scenarios contact each other much more frequently to resolve conflicts while transactions in low contention scenar- ios barely need to negotiate for data access permission. As a result, the high contention workloads can benefit more from SEL-TM as it mainly reduces the conflict detection traffic. As future HTM systems are expected to execute coarse-grain transactions on hundreds or even thousands of cores, high contention will be the norm. Across all the 61 Bayes Intruder Yada 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Normalized Execution Time SEL−TM−S SEL−TM−D−CCM HITSIZE=8 No Replace SEL−TM−D−CCM HITSIZE=8 LUR Replace SEL−TM−D−CCM HITSIZE=32 No Replace SEL−TM−D−CCM HITSIZE=32 LUR Replace Figure 4.12: Impact of HIT configuration on SEL-TM performance.“LRU” stands for “Least Recently Used”. workloads, SEL-TM achieves 12% reduction in transactional traffic. dynamic SEL-TM uses a 32-entry HIT with a “least-used-replaced” replacement policy. 4.4.5 Sensitivity Analysis The History Information Table (HIT) in the Conflict Counting module stores lazy address candidates and their usage information. Ideally an infinite-size HIT bookkeeps the addresses of the most contended memory blocks observed in every dynamic trans- action. A limited-size HIT, on the other hand, can only track the hot-spots of dynamic transactions that are executed recently. Consequently, the accuracy of lazy address selec- tion depends on the size and replacement policy of the HIT. The performance result of four HIT configurations is shown in Figure 4.12. The result of only three benchmarks is shown since the other benchmarks are insensitive to the change of HIT configura- tion. It is observed that the SEL-TM with an 8-entry HIT performs almost as well as the 62 1 2 3 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 Normalized Speedup Number of Lazily Managed Memory Locations Bayes Intruder Kmeans Labyrinth SSCA2 Vacation Yada Figure 4.13: Normalized performance vs. Number of address being lazily managed. alternative with a 32-entry HIT if a “least-used-replaced” replacement policy is adopted. Therefore, the 8-entry HIT is the better candidate for hardware implementations with a constrained transistor budget. The following evaluation of The lazy address in our simulation is selected based on the contention on the cor- responding cache line. The cache line address causing the most aborts will be selected for lazy version management and conflict detection. In the baseline SEL-TM system, we lazily manage up to one cache line. However, many benchmarks have high con- tention on multiple memory locations. Thus we simulate SEL-TM supporting multiple lazily managed cacheline addresses. We informally define the number of lazily managed cacheline addresses as the laziness of SEL-TM. After inspecting the simulation traces of all the benchmark programs, we find that most of the aborts are caused by conflicts 63 on up to three memory blocks. For instance, transactions in Kmeans conflict on only one of the few cluster centers. So the laziness in our simulated system can range from 1 to 3. Figure 4.13 shows the normalized performance of SEL-TM at different laziness levels. For most of the benchmarks, the performance of SEL-TM degrades as laziness increases. At the commit phase, a transaction globally performs all its lazy stores to memory and all the other conflicting transactions are aborted. If the number of lazy stores are increased, more conflicts occur which leads to the aborting of more transac- tions. Therefore increasing the number of lazy stores can cause more aborts and degrade performance. 4.5 Summary In this work, we propose SEL-TM, a new HTM implementation that supports comple- mentary version management and conflict detection policies at the granularity of cache lines. The highly contended portion of a transaction’s write set is managed lazily for increased concurrency, while other addresses are managed eagerly for accelerated com- mit. A gated store buffer provides support for bounded lazy version management that defers transactional stores to conflict “hot-spots” until the end of transactions to facili- tate more reader-writer and writer-writer sharing. The SEL Manager, a distributed hard- ware scheme, is designed to dynamically discover the most contended addresses in a transaction’s write set and determine whether each dynamic transaction should adopt a hybrid eager-lazy version management. Despite the simplicity of the SEL-TM design, our experimental results show that it achieves an average performance improvement of 14% over LogTM-SE and 22% over a simulated lazy system. 64 Chapter 5 In-Network Traffic Regulation for HTM While the previous chapter investigates the communication and performance benefit in dynamically configurable execution policy in the HTM system, this chapter explores the benefit of HTM and NOC co-design to achieve efficient communication. As dis- cussed in Chapter 2, HTM systems rely heavily on the on-chip communication fabric for inter-transaction communication. However, the network bandwidth utilization in trans- actional execution has been largely neglected in HTM designs. In this work, we explore the interaction between the HTM paradigm and Networks-On-Chip (NOCs). We iden- tify a huge source of superfluous network traffic due to transactional requests that fail due to conflicts. This problem adversely affects network performance and energy effi- ciency. As observed in a wide spectrum of workloads, 39% (up to 79% for a specific application) of the transactional requests have failed, which renders 58% of the transac- tional network traffic futile. To combat this problem, a novel in-network filtering mech- anism is proposed. Transactional requests that have a high probability to fail are filtered out in-network by the intelligent router as early as possible to save energy. Experimental results show that our design reduces total network traffic by 21% on average for a set of high-contention benchmarks representative of future TM applications, thereby reducing energy consumption by an average of 24%. Meanwhile, the contention in the coherence 65 directory is reduced by 68% on average. These improvements are achieved with only 5% area added to a conventional on-chip router design. 5.1 Motivation 5.1.1 HTM-NOC Interplay Transactions fetch data and communicate with each other via the on-chip networks. TM- induced network traffic often takes the form of coherence messages. As the messages are injected into the network, they are encapsulated into short or long packets, which are further divided into flow control digits or flits. In typical on-chip networks, short pack- ets (e.g., coherence read requests and acknowledge responses) are single-flit, while long packets (e.g., coherence read responses and write requests) have multiple flits. Once injected into the network, the packet is forwarded hop-by-hop by routers to the destina- tion node. After being reassembled at the destination node, the coherence messages are ejected from the network. Then, the transaction at the destination is notified of receiving a message from the remote transaction. We characterized the on-chip network traffic in terms of router traversals of the flits in the TM applications from the STAMP benchmark suite (see Section 4.1 for experiment details). The results are presented in Figure 5.1. As observed, TM-induced traffic accounts for an average 58% of the total NOC traf- fic. Transactional control and data traffic is approximately equal, which indicates that a large fraction of the on-chip communication in HTM systems are not for actual data transfer but for maintaining transaction semantics. The proposed technique in this work mainly targets the transactional control traffic for detecting transaction conflicts. 66 0% 20% 40% 60% 80% 100% Normalized Network Traffic TX Control TX Data Non-TX Figure 5.1: On-chip network traffic categorization of STAMP benchmark. 5.1.2 Communication in Conflict Detection Data access conflicts between transactions could lead to incorrect program execution. A conflict occurs when two or more concurrent transactions access the same data and at least one access is a write [BMV + 07]. Any coherence protocol capable of detecting accessibility conflicts can also detect transaction conflicts [HEM93]. Directory-based protocols provide scalable solutions to cache coherence due to a unicast nature of com- munication [JP09]. The directory can be distributed among all the nodes by statically mapping a cacheline address to its home node. The home node is responsible for order- ing coherence requests to the same cache block. The majority of HTM designs assume directory protocols for conflict detection. Our work follows suit so that the proposed design can be readily migrated to such HTMs. Nonetheless, the proposal is also appli- cable to systems adopting snooping protocols on a totally ordered broadcast network. In general, the eager and lazy conflict detection schemes have their own benefits regarding on-chip communication overhead. This work mainly targets the wasted traffic in eager 67 R DIR S S ① TxGETX ② DATA ③ Fwd_TxGETX ③ Fwd_TxGETX ④ NACK ④ ACK ⑤ UNBLOCK Figure 5.2: HTM conflict detection. R: Requester; DIR: home node directory; S: Sharer. conflict detection. However, the basic principle is applicable to the lazy conflict detec- tion where the committing transactions usually use eager conflict detection to protect their write sets. When a transaction is executing, the load address (store address) is added into the transaction’s read set (write set). Upon receiving a request from another transaction, the transaction checks the request against its read and write set to see if any conflict occurs. Conflicts are resolved by serializing the execution of conflicting transactions. The exe- cution order of conflicting transactions is determined by conflict resolution policies. A conflicting transaction with lower priority should stall or abort while one with higher priority continues executing. Figure 5.2 depicts TM conflict detection using the MESI (Modified, Exclusive, Shared, Invalidate) directory protocol. The requester transaction issues a GETX to the directory¬, which replies to the requester with data. The directory state of the block is set to busy (i.e. incoming requests to the same block are blocked). Then, the request is forwarded to the nodes currently sharing the block®. Depending on the outcome of conflict detection and resolution, the sharing transactions respond with either a NACK (negative acknowledgement) or an ACK¯. Upon receiving 68 all the responses, the requester sends an UNBLOCK message to the directory to con- clude the request°. If all the responses are ACKs, the requester transaction continues executing. If one of the responses is a NACK, the requester transaction stalls and keeps retrying the nacked request until all the high-priority sharer transactions have finished executing. In what follows, the transaction that sends a NACK message is often called nacker transaction or nacker. The node on which a transaction is executed is referred to as the transaction’s host node. 5.1.3 False Forwarding False forwarding occurs when a transaction’s coherence request, before being nacked eventually, initiates numerous messages from the requestor to the directory, from the directory to each sharer/owner, and from each sharer/owner to the requestor. False for- warding wastes energy since nacked requests do not contribute to the continued execu- tion of transactions. Here, we estimate the energy waste of a nacked coherence request by counting the hops in terms of router traversals needed to accomplish the request. Equation (1) gives the average hop count of a coherence request. H CoherenceRequest =H avg +2S avg H avg +H avg (5.1) Here, H avg is the average hop of a flit in the network and S avg is the average number of sharers of a memory block. The first term counts the hops of the request to the directory. The second term counts the hops incurred by forwarding and acknowledging. The last term counts the hops of an UNBLOCK message to the directory. In a 4x4 2D mesh network under uniform random traffic, H avg is 3.6 (including the router into which a flit is injected). Assume the requested block is read-shared by 4 nodes. Then, 69 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Nacked GETX Nacked GETS Acked GETX Acked GETS Figure 5.3: Breakdown of GETS/GETX coherence requests from transactions to the directory. a GETX incurs 36 hops on average. The GETS (request for shared access) needs less hops (14 hops) as the directory only forwards the request to at most one node that owns the data. Each hop, which involves router and link traversal, consumes a sizable amount of energy. Unfortunately, the energy is wasted in the case of false forwarding. To estimate the extent of false forwarding, we first track the GETS/GETX coherence requests generated by transactions in a representative HTM system. Figure 5.3 presents the breakdown of requests based on the outcome of the requests. Across all eight work- loads, nacked requests account for 39% of all the requests from transactions. So, more than one third of all the TM-induced coherence requests incur false forwarding. Fur- ther, we categorize transactional communication into two types, namely, effective and abortive. The effective transactional communication is the acknowledged coherence requests and the associated responses in transactional execution, whereas the abortive transactional traffic is the nacked coherence requests and the associated responses. As observed in Figure 5.4, abortive transactional communication accounts for an average 70 0% 20% 40% 60% 80% 100% Normalized Network Traffic Abortive TX Traffic Effective TX Traffic Figure 5.4: Breakdown of on-chip transactional traffic. 87.6% of the total transactional on-chip communication in four high-contention applica- tions. Across all eight applications, 56.7% of the transactional on-chip communication is abortive. This observation indicates the significance as well as the benefit of mitigat- ing false forwarding, as a substantial portion of abortive transactional communication is contributed by false forwarding. 5.2 The TMNOC Design The TMNOC is essentially an in-network filtering mechanism to proactively filter away transactional requests that have a high probability to incur false forwarding. The filter- ing mechanism is implemented via a co-design approach of HTM and NOC. We identify three mechanisms to support such functionality. First, the HTM should provide concise yet expressive information on transaction conflicts for the routers to track current con- flicts and predict potential conflicts. Second, a cost-effective communication mechanism must be devised to deliver the information to the on-chip routers. Third, the routers must store and use the conflict information to regulate TM traffic. In the subsequent discus- sion, the three mechanisms are presented respectively. Based on these mechanisms, the 71 traffic regulation policy is described, followed by walk-through examples and further discussion. 5.2.1 NOC-aware HTM To enable a network to track and predict conflicts between transactions, the conflict resolution policy used by the HTM should be straightforward for the NOC to adopt. Moreover, concise and expressive information on transaction conflicts must be pre- pared for the NOC to track conflicts. Other aspects of the HTM design (e.g., version management, read/write set implementation, and overflow handling) are orthogonal and thus complementary to the proposed design. Any HTM designs that piggyback on the coherence protocol for conflict detection can be augmented in a similar way [MBM + 06, LMG09, LMG10]. Conflict Resolution. TMNOC adopts time-based conflict resolution [RG02]. Con- flicts are resolved by stalling or aborting the younger transaction in favor of the older one. Each transaction is assigned a timestamp when it begins. The timestamp is attached to all the inter-transaction communication (coherence messages). Besides ensuring for- ward progress and providing good performance [SIS05], the time-based policy provides a global transaction ordering that is straightforward for the on-chip network to identify when detecting potential conflicts. Conflict Trace Registers: We define a conflict trace as the sufficient yet minimal piece of information to i) describe conflicts among transactions and ii) enable other sys- tem components (e.g., on-chip routers) to detect potential conflicts. A generic conflict trace consists of: Address of the memory block in the conflict. 72 ADDR HOST NODE TIME- STAMP DAS Tx Metadata Figure 5.5: Format of CT-Register. Metadata (e.g., priority and host node) of the transaction that is given priority in a conflict resolution. Data Access Status (DAS) of the memory block, specifying whether the transac- tion with higher priority holds the block in read-shared or write-exclusive state. The L1 cache controller is augmented with a set of Conflict Trace Registers (CT- Registers) to record conflicts encountered by the outstanding requests. Figure 5.5 depicts the CT-Register. Every outgoing coherence request is assigned a CT-Register. If the request is nacked due to a conflict, the conflict trace obtained from the NACK message is stored into the associated CT-Register. The extension to NACK messages to supply all the needed pieces of information to construct conflict traces will be discussed below. When multiple NACKs to a request are received, the conflict trace from the lat- est NACK overwrites the previous one in the associated CT-Register. The number of CT-Registers is bounded by the number of outstanding data requests that miss in local L1. As processors usually support a limited number of outstanding L1 misses (e.g., Intel Itanium2 supports 8 [Ita]), the area overhead of CT-registers remains low. 5.2.2 Coordination between HTM and NOC Coherence messages from transactions are injected into the network as HTMs piggyback onto the cache coherence protocol to detect conflicts. Furthermore, the on-chip routers can easily examine the in-transit coherence messages. Thus, the coherence messages are 73 ADDR MSG TYPE SRC NODE DEST NODE TXREQ ADDR MSG TYPE SRC NODE DEST NODE TIMESTAMP DAS BYRTR ADDR MSG TYPE SRC NODE DEST NODE TIMESTAMP HOST NODE VBIT (c) Coherence Request (a) NACK (b) UNBLOCK Coherence message extension TIMESTAMP TXREQ DAS BYRTR HOST NODE VBIT Whether the request is issued within a transaction Whether the address is in the nacker’s read or write set Whether the NACK is initiated from a router The node executing the nacker transaction Whether embedded con!ict trace is valid ... ... ... ... ... DAS Figure 5.6: Extended coherence protocol messages to support coordination between HTM and NOC. cost-effective mechanisms for delivering the conflict traces from HTMs to the routers. For this purpose, three coherence messages are extended. First, the NACK messages from the nacker transactions to the conflicting transac- tions contain almost all the information (i.e. memory address, timestamp and host node of the nacker transaction) to construct conflict traces. As shown in Figure 5.6(a), a sin- gle DAS bit is added to the NACK message to specify whether the data in conflict is currently read-shared or written-exclusive by the nacker transaction. Besides, a single BYRTR (By Router) bit is also added to indicate whether the NACK is initiated from routers or not, as TMNOC allows the routers to nack requests (as described later). When a destination node receives a NACK with BYRTR set, the coherence controller at the destination neither waits for acknowledgements from other nodes nor does it send an UNBLOCK message to the directory. In this particular case, the request is nacked by an enroute router and has not yet been serviced at the directory. Second, the UNBLOCK message, that is destined to the directory to conclude a request, is extended to carry the content of the CT-Register associated with the request. A VBIT (valid bit) is needed 74 Crossbar Switch VC & Switch Allocation Input Unit Input Unit Input Unit Route Compute TMNOC Logic Conflict Trace Buffer … ... Input Port #1 Input Port #2 Input Port #N TT-Reg Addr Timestamp Src Node DAS Arrival Time Valid (a) (b) RC VA SA ST TO CT-Buffer Entry Format: Output Port #1 Output Port #2 Output Port #N Figure 5.7: (a) Router microarchitecture (TMNOC-specific structures in bold rectan- gles); (b) CT-Buffer entry format; (c) Router pipeline organization (TO: TMNOC Oper- ation). since the embedded conflict trace is valid only if the request is nacked by a transac- tion due to conflict. Third, as the network attempts to regulate TM traffic, transactional requests must be distinguished from non-transactional requests. A 1-bit TXREQ (trans- actional request) is attached to coherence request messages (e.g., GETS and GETX). TM requests have TXREQ set to 1. Figure 5.6 summarizes the extended protocol messages. Due to the wide on-chip channels, the extended messages can still be encapsulated into short packets. So the cost is minimized. The message extensions do not change the protocol behaviors that are originally implemented in the multiprocessor. 75 5.2.3 In-Network Conflict Tracking The on-chip routers should be able to store the conflict traces provided by the HTM through the in-transit coherence messages. For this purpose, each router is augmented with a Conflict Trace Buffer (CT-Buffer) (see Figure 5.7 (a) and (b)). The CT-Buffer is the key structure to couple the on-chip networks with the HTM. Each CT-Buffer entry stores a piece of conflict trace regarding a memory block. The time when the conflict trace arrives is recorded to handle replacement and improve prediction accuracy (as described below). In addition, each line is augmented with a valid bit. The CT-Buffer uses 2-way set associative mapping. To reduce energy and area overhead, the conflict traces in the CT-Buffer can be shared by all input ports in a router. However, the number of the CT-Buffer’s read/write ports can be less than the number of input ports in a router if the area budget is tight, as the probability that packets at the head of multiple input ports are all transactional requests and those requests incur accesses to the CT-Buffer in the same cycle is relatively low. In the rare case of contention on a read/write port, the overflowed requests are just forwarded normally. 5.2.4 The TMNOC Logic The TMNOC logic manages the CT-Buffer and performs proactive filtering on in-transit coherence requests. It operates in parallel with route computation to avoid additional delay in the critical path of a router 1 . The router pipeline is presented in Figure 5.7(c), assuming a canonical 4-stage pipeline [JP09]. Now, we discuss the functions of the TMNOC logic. 1 In the case of lookahead routing in some router designs, the operation of TMNOC logic can overlap with virtual channel allocation or switch allocation. 76 CT-Buffer management: TMNOC logic examines every incoming packet. If the packet carries an UNBLOCK message with a valid conflict trace (VBIT is set) and is destined to the directory on the node to which the router is attached, the embedded conflict trace is buffered in the router’s CT-Buffer. If a valid conflict trace regarding the same memory block already exists, it gets updated provided the new conflict trace records a younger nacker transaction. The freshness of the conflict trace can be pre- served by always tracking a younger nacker, as the conflict traces become stale if the nacker transactions have finished. If no valid conflict trace is found, the new one is buffered. CT-Buffer replacement is handled by evicting the entry with the earlier arrival time within the set of 2 entries. As the router tracks the conflict traces regarding only memory blocks whose home node is attached to the router, requests can be filtered away only by the home node router (i.e., the router attached to the home node). This is an intuitive design choice since the requests will always be destined to home nodes. A more aggressive scheme will be introduced shortly. Transaction requests filtering: Upon receiving a packet carrying a coherence request from a transaction (TXREQ=1), the TMNOC logic searches in the CT-Buffer for a conflict trace regarding the requested block. If nothing is found, the router con- tinues forwarding the request as normal. Otherwise, TMNOC logic uses the matching conflict trace to predict whether the request will be rejected by the nacker transaction that is recorded in the conflict trace. The prediction requires two steps. The first step is to evaluate the freshness of the conflict trace. As discussed, a conflict trace in the CT-Buffer becomes stale if the nacker transaction has finished. So it is important to verify that the nacker is still active. The latency and energy overhead of directly con- tacting the nacker is prohibitive. So TMNOC implements a local timeout mechanism to invalidate stale conflict traces. As the arrival time of each conflict trace is recorded in 77 the CT-Buffer, the router identifies a stale conflict trace if the trace has stayed in the CT- Buffer longer than a threshold cycle count (i.e., the timeout threshold). Theoretically, the conflict trace is invalid once the conflicting transaction finishes execution. Thus, the transactions running length in terms of cycles is a close approximation of the optimal timeout threshold. We propose a cost-effective approach that exploits the tight coupling between cores and routers in a tiled many-core architecture. Specifically, each router is augmented with a single 32-bit Timeout Threshold register that is mapped into the processor cores memory space. This register provides the flexibility of controlling the timeout threshold in either hardware or software. In the hardware approach, the proces- sor tracks the transaction length (TxLen) based on the timestamps of theTBEGIN and TEND instructions. Then, the processor updates the Timeout Threshold (TT) register using the following formula. TT new = TT old +TxLen latest 2 (5.2) Also, the register can be updated by the operating system in supervisor mode. The software can leverage more sophisticated algorithms to derive the timeout threshold. The prediction proceeds to the second step if the nacker is predicted to be active. The type of the request (transactional read or write) and the data access status of the conflict trace (read-shared or written-exclusive by the nacker transaction) are used to detect a potential conflict that violates the “single-writer-multi-reader” invariant. Upon a conflict, the requester and nacker’s timestamps are compared. If the requester is older (i.e., has higher priority), the request is forwarded as normal. Otherwise, the router stops forwarding and discards the packet. Meanwhile, a router-initiated NACK message (BYRTR=1) is sent to the requester. To prevent a transaction from being nacked by itself, routers do not nack a request from the host node of the nacker transaction that 78 Figure 5.8: Flowchart depicting transactional request filtering in TMNOC logic. REQ- TYPE and TXREQ are the fields in the in-transit request message. DAS is from the matching conflict trace. is recorded in the matching conflict trace. Figure 5.8 provides the detailed procedure implemented in TMNOC logic to decide whether to filter out in-transit requests. TMNOC-aggressive: In the above scheme, transactional requests can be filtered out only by the home node router. Here we propose a more aggressive design that allows requests to be nacked by any enroute routers. In our discussion, the aforementioned scheme and this more aggressive scheme are referred to as TMNOC-base and TMNOC- aggressive, respectively. The only difference between the two TMNOC variants lies in the CT-Buffer man- agement policy. In TMNOC-aggressive, the on-chip router not only records the conflict traces embedded in the UNBLOCK messages destined to the node to which the router is attached (same as TMNOC-base), but also extracts conflict traces from any in-transit NACK messages the router has forwarded. As the routers can record conflict traces regarding any memory blocks, transactional requests could in turn be filtered out by any routers along the route to the home node. Consequently, more energy savings can be 79 attained by further reducing the network traffic. TMNOC-aggressive needs a larger CT- Buffer since the routers are allowed to buffer conflict traces of any blocks. To alleviate buffer contention and guarantee forward progress, the routers are forbidden to extract conflict traces from NACK messages that are initiated from routers. While conflict traces are captured aggressively from enroute NACK messages, TMNOC-aggressive also actively invalidates the existing traces in the CT-Buffer. In addition to the time- out mechanism discussed before, routers use the enroute ACK messages to invalidate the buffered conflict traces. Specifically, if the nacker transaction in a recorded con- flict trace sends an ACK message to requestors to the memory block, it indicates that the nacker transaction no longer conflicts with other transactions on the memory block. Thus, the corresponding conflict trace is stale and, hence, can be invalidated. Other than the difference in CT-Buffer management policy, both TMNOC variants follows the same procedure to filter out transactional requests (see Figure 5.8). 5.2.5 Operation Walk-through Update CT-Register (Figure 5.9(a)): Transaction A (TxA)@node1 sends GETX to the directory@node2. The request is forwarded to two sharers at node3 and node4. Both nodes respond with NACKs. The NACK from node3 is recorded into the CT-Register of node1 since the nacker transaction on node3 is younger than the nacker on node4. Update CT-Buffer (Figure 5.9(a)): After receiving responses from both sharers, node1 sends an UNBLOCK message to the directory. The content of a CT-Register is attached. The router at the home node records the conflict trace into its CT-Buffer. The arrival time of the conflict trace is 500. Home node router nacks a request (Figure 5.9(b)): TxD@node5 sends GETX to the directory@node2. The router@node2 predicts the GETX to be nacked by 80 TxA@ node1 Timestamp=100 NACK CT-Reg @ node1 TxB@ node3 TxC@ node4 GETX_TX NACK Fwd_GETX_TX Timestamp=200 TS=320 TxA@ node1 Router node2 UNBLOCK Addr N3 200 R CT-Reg @ node1 TS=320 Addr N3 200 R CT-Bu!er @ node2 router 500 Router node5 Router node6 Router node2 Addr N3 200 R CT-Bu!er @ node2 router 500 Timestamp=300 GETX_TX NACK Router node5 Router node6 Router node2 CT-Bu!er @ node6 router GETX_TX NACK TxA@ node1 NACK TxB@ node3 TxC@ node4 GETX_TX NACK Fwd_GETX_TX TS=320 CT-Reg @ node1 Router node6 Addr N3 200 R CT-Bu!er @ node6 router 450 TxD@ node5 TxD@ node5 (a) (b) (c) (d) Addr N3 200 R Addr N3 200 R Addr N3 200 R 450 Update Hit Hit Update GETX_TX NACK NACK Dir@ node2 Dir@ node2 Dir@ node2 Dir@ node2 Dir@ node2 Timestamp=200 Timestamp=100 Timestamp=300 Figure 5.9: Operation examples. (a) and (b): TMNOC-base. (c) and (d): TMNOC- aggressive. All the requests, responses and coherence states are with regard to the same cache block. Dir: directory. TxB@node3. So the GETX is dropped instead of being forwarded to the directory. A NACK is sent to node5. Since the NACK is from a router, node5 does not update its CT-Register. Router buffers in-transit NACK (Figure 5.9(c)): The NACK message from node3 to node1 is forwarded by the router@node6, which buffers the conflict trace in its CT- Buffer. Enroute router nacks a request (Figure 5.9(d)): TxD@node5 sends GETX to the directory@node2. The GETX flows through the router@node6, which predicts the GETX to be nacked by TxB@node3. So the router@node6 drops the GETX instead of forwarding the request. A NACK is sent to node5. Since the NACK is initiated from a router, node5 does not update its CT-Register. 81 5.2.6 Discussion Correctness. In TMNOC, the routers filter out coherence requests that are predicted to be rejected by the nacker transactions that are recorded in the CT-Buffers. A mis- prediction occurs i) when a nacker transaction is predicted inactive though it is still active; ii) when a nacker transaction is predicted active though it has already finished. In the first case, the router forwards the request as normal. Correctness is guaranteed by the HTM system. In the second case, the router could nack the request conservatively while the request may encounter no conflicts. However, the router cannot block the request for long, as the nacker transaction is predicted inactive after the corresponding conflict trace has stayed in the router’s CT-Buffer for a certain amount of time. So live- lock (lack of forward progress) due to mis-prediction never occurs in TMNOC. Overall, as coherence requests from transactions are either forwarded to the HTM system or nacked by the routers, TMNOC does not affect the correctness of transaction execu- tion (strong isolation and conflict serializability), which is guaranteed by the conflict detection mechanism in the HTM. In-network vs. In-directory. The filtering provided by TMNOC-base can be placed in front of the coherence directory while the filtering provided by TMNOC-aggressive cannot. There are two main reasons to implement filtering in the network instead of in the directory. First, the on-chip routers are better poised for a fresh and broad view of conflicts through eavesdropping on the inter-transaction communication. Second, filtering the traffic in-network as early as possible could achieve further energy savings and avoid disrupting the directory controller. Besides, if the directory is distributed among the home nodes, each node (instead of each router) is equipped with the CT- Buffer and logic. So, the hardware cost of placing the filter in the directory will be similar to that of TMNOC. As a result, for a fixed hardware budget, placing the filter in 82 Table 5.1: Benchmark Input Parameters and Characteristics Benchmark Input Parameter Tx Time Abort Rate r-set size w-set size Avg Tx Length (cycles) Bayes 32 var, 1024 records, 2 edge/var 90.58% 96.53% 93.19 50.37 125945 Intruder 2k flow, 10 attack, 4 pkt/flow 75.83% 56.08% 7.47 3.50 1490 Labyrinth 32*32*3 maze, 96 paths 99.92% 98.72% 145.71 92.28 373619 Yada 1264 elements, min-angle 20 99.69% 38.63% 52.96 31.61 35854 Genome 32 var, 1024 records 79.07% 3.06% 32.39 3.32 4792 Kmeans 16K seg. 256 gene. 16 sample 7.12% 4.69% 6.23 1.75 306 SSCA2 8k nodes, 3 len, 3 para edge 10.22% 0.33% 2.99 1.99 131 Vacation 16K record. 4K req. 60% coverage 97.38% 46.24% 64.63 18.62 17002 the network can be more effective. Also, placing the filter in routers might incur even less hardware overhead in concentrated NOCs [BD06]. 5.3 Evaluation 5.3.1 Methodology The efficacy of TMNOC is evaluated with cycle-accurate full system simulation using SIMICS [MCE + 02] and the Ruby memory model [MSB + 05]. Garnet [AKPJ09] is used to model the timing of the on-chip network while the Orion power model [WZPM02] is used to estimate the energy consumption of the routers and links in the network. Results are presented for all eight workloads in the STAMP benchmark suite [MCKO08] that is widely used to evaluate HTM designs. Table 5.1 lists the input parameter and characteristics of each benchmark. The baseline tiled CMP architecture for our experiment is depicted in Figure 9. Each of the 16 nodes consists of an in-order SPARC core with a private L1 and a shared L2. The shared L2 is organized as a static non-uniform cache architecture [KBK02] that uses the directory-based MESI protocol to maintain coherence. The width of coherence control messages is 64-bit. The L2 cacheline tags are augmented with directory entry state. The processor core provides hardware support for log-based HTM. Pre-transaction 83 Table 5.2: System configuration Unit Value Core in-order, single-issue, 16 SPARC V9 cores L1 Cache 32 KB, 4-way associative, write-back, 1-cycle L2 Cache 8 MB, 8-way associative, 20-cycle latency Coherence MESI protocol, static cache bank directory Memory 4 GB, 4 memory controller, 200-cycle latency Network 4x4 2D mesh, DOR, 1-cycle link Router 4-cycle, 5 vnet, 4 VCs/vnet, 4-flit/VC TMNOC 32-entry CT-Buffer per router states are written to a software managed log while speculative states are propagated to the memory hierarchy eagerly. Pre-transaction states are also stored to a dedicated buffer for fast abort recovery. After receiving a NACK, transactions back off for a fixed period of 20 cycles before retrying. The performance of the baseline HTM is comparable to contemporary eager HTM designs (e.g., FASTM [LMG09]) that manage both data versions in cache for fast abort and commit. The 2d mesh on-chip network uses dimension-order routing and credit-based virtual channel flow control. Multiple virtual networks are used to avoid protocol-level deadlock. The routers are pipelined into 4 stages. The system configuration is listed in Table 5.2. Both TMNOC-base and TMNOC-aggressive are modeled in the simulator. The Garnet router model is augmented with the TMNOC logic and CT-Buffer. Since the TMNOC logic works in parallel with route computation, the router latency is not affected. For energy estimation, we implemented and synthesized the TMNOC design in 40nm technology. The power dissipation of the SRAM-based CT-Buffer is estimated using CACTI [MBJ07]. Based on the obtained results, we modified Orion to carefully account for the energy overhead of TMNOC in 40nm technology and 0.9V on-chip volt- age. A flit size of 128-bit is used in the simulations as most current NOCs have 128-bit or 256-bit channel width. So, no extra flit is needed as the flit size is large enough to accommodate the extended fields in the coherence messages. For the overhead of 84 Register Checkpoint Read/Write Signature Fast-abort Support Transaction Logging Timestamp CT-Registers Overflow Handling Core HTM Support L1I L1D L2 cache Directory Router Node 0 Node 4 Node 8 Node 12 Node 1 Node 5 Node 9 Node 13 Node 2 Node 6 Node 10 Node 14 Node 3 Node 7 Node 11 Node 15 CT-Buffer TMNOC-Logic Figure 5.10: Simulated chip multiprocessor architecture. TMNOC augmentations are marked with bold rectangles. 0% 20% 40% 60% 80% 100% B T T+ B T T+ B T T+ B T T+ B T T+ B T T+ B T T+ B T T+ Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Normalized Cycle Count Blocking by TxRead Blocking by TxWrite Figure 5.11: Normalized cycle count when the directory is busy serving transactional requests (B: baseline w/o TMNOC; T: TMNOC-base; T+: TMNOC-aggressive). extending the coherence messages, no extra flit is needed as the flit size is large enough to accommodate the extended fields (a flit size of 128-bit is used in the simulations as most current NOCs have 128-bit or 256-bit channel width). 85 5.3.2 Reduction in Directory Blocking Figure 5.11 shows the impact of TMNOC on the number of cycles the directory is blocked by coherence requests from transactions. The values are obtained by accumu- lating the cycles during which directory entries stay in the busy transient state while ser- vicing transactional requests. It is observed that TMNOC-base reduces the TM-induced directory blocking by 43% on average (up to 87%). TMNOC-aggressive reduces the blocking by 66% on average (up to 88%). The reduction in directory blocking allows more requests to be serviced by the directory instead of waiting or being rejected, thereby increasing the concurrency in the memory system. Another observation is that high-contention benchmarks show a significant reduction in the cycles the directory is blocked by transactional write requests. This observation indicates that a large por- tion of transactional write requests are filtered out as transactions in high-contention benchmarks tend to update shared data frequently. As write requests usually have a large energy footprint on the network (as discussed in Section 2.3), it is expected that TMNOC will provide significant energy savings in high-contention workloads. 5.3.3 Reduction in Network Energy One of the primary goals of this work is to improve energy efficiency of the on-chip network in supporting of HTM operations. Figure 5.12 shows the normalized energy consumption of the network including routers and links. It is observed that TMNOC- base reduces the network energy consumption in high-contention benchmarks by 20% on average (up to 35%) while TMNOC-aggressive reduces the figure by 24% (up to 39%). Across all the benchmarks, TMNOC-base and TMNOC-aggressive reduces aver- age network energy consumption by 12% and 15%, respectively. The energy savings of 86 0 0.2 0.4 0.6 0.8 1 1.2 Normalized Network Energy Baseline TMNOC TMNOC+ High Contention Low Contention Figure 5.12: Normalized network energy. TMNOC-base are achieved by the avoidance of forwarding the requests from the direc- tory to other concurrent transactions, whereas TMNOC-aggressive achieves additional savings by saving the hops from the requester to the home node. Since the majority of the traffic and energy waste is due to directory forwarding (multicasting to several nodes) rather than the requests to the home node (unicast between two nodes), TMNOC- base can achieve much of the energy savings with relatively small incremental benefits resulting from the more aggressive scheme. However, it is worth noting that TMNOC- aggressive does not incur extra overhead for the extra energy savings. High-contention benchmarks exhibit more energy savings for two reasons. First, high-contention bench- marks have more requests being nacked causing more energy waste due to false for- warding (see Figure 5.3) in the baseline system, which offers TMNOC more energy saving opportunities through mitigating false forwarding. Second, frequent conflicts provide the routers with plenty of information about transaction conflicts, hence increas- ing the prediction accuracy of the TMNOC logic. Besides contention rate, the type of the filtered-out requests also affects how much energy could be saved. For instance, in the Vacation benchmark, TMNOC-base filters out a large portion of transactional reads according to Figure 5.11. However, the energy savings is marginal since GETS requests 87 0 0.2 0.4 0.6 0.8 1 1.2 Normalized Network Traffic Baseline TMNOC TMNOC+ High Contention Low Contention Figure 5.13: Normalized network traffic. are not the major source of energy waste in false forwarding. On the other hand, Bayes has an energy reduction of 38% since a large number of transactional writes in Bayes are filtered out by TMNOC. Otherwise, those transactional writes would initiate extensive communication between multiple nodes before being eventually nacked, which wastes a considerable amount of energy. Overall, both TMNOC variants achieve the goal of improving NOC energy efficiency. 5.3.4 Reduction of Network Traffic The interconnection traffic has a fundamental impact on the network energy consump- tion. Figure 5.13 shows the normalized interconnection traffic measured in router traver- sals by flits. It is observed that TMNOC-base reduces the interconnection traffic in high-contention benchmarks by 16% on average (up to 28%) while TMNOC-aggressive reduces the figure by 24% on average (up to 39%). Across all the benchmarks, TMNOC- base and TMNOC-aggressive reduce interconnection traffic by 10% and 12%, respec- tively. The reduction in interconnection traffic translates directly into energy savings. 88 1 2 3 4 5 6 7 Hops Vacation 1 2 3 4 5 6 7 Hops SSCA2 1 2 3 4 5 6 7 Hops Kmeans 0% 5% 10% 15% 20% 25% 30% 1 2 3 4 5 6 7 Hops Genome 0% 5% 10% 15% 20% 25% 30% 1 2 3 4 5 6 7 Hops Base TMNOC TMNOC+ Bayes 1 2 3 4 5 6 7 Hops Base TMNOC TMNOC+ Yada 1 2 3 4 5 6 7 Hops Base TMNOC TMNOC+ Intruder 1 2 3 4 5 6 7 Hops Base TMNOC TMNOC+ Labyrinth Figure 5.14: Hop count distribution (measured in router traversals by flits). Figure 5.14 shows the distribution of network flits according to their hops. It is observed that both TMNOC variants reduce the proportion of long-distance flits through proactive filtering while increasing the proportion of short-distance flits. This trend is particularly noticeable in applications with high contention, which hence exhibit substantial reductions in network traffic. Compared with TMNOC-base, TMNOC- aggressive further increases the proportion of 1- and 2-hop flits due to the more aggres- sive policy to filter out in-transit requests as early as possible. This observation demon- strates the effectiveness of TMNOC in regulating network traffic. As CMPs are increas- ingly distributed, the impact of in-network filtering on long-distance flits as well as network traffic will become more and more substantial. 5.3.5 Impact on Performance Although TMNOC shows the potential to increase concurrency in the memory system, the proactive filtering could nack a transaction’s request conservatively, thereby stalling the transaction needlessly. This situation happens when the router decides to nack a 89 request based on a previous NACK from a transaction that has already finished. Such conservative nacks may degrade overall performance and potentially offset the benefit of increased concurrency in the memory system. Figure 5.15 shows the normalized exe- cution time. It is observed that TMNOC does not impose a performance penalty on the system in order to regulate the network traffic in transactional systems. On the contrary, Bayes and Intruder exhibit performance improvements of 18% and 12%, respectively, indicating further energy savings in the cores. These performance improvements stem from the fact that TMNOC reduces the contention on the directory by mitigating false blocking. Workloads with a small set of memory addresses being contended fiercely among transactions (i.e., memory hot spots) benefit the most from the alleviation of false blocking, as requests to the hot spot are serviced more promptly instead of being blocked unnecessarily. Bayes and Intruder are two such workloads. Although Yada has a high contention rate, it shows negligible improvement in performance as it does not exhibit the bottleneck of a few memory hot spots. In Labyrinth, each transaction reads the entire global maze grid at the beginning and writes to part of the grid at the end. This sharing pattern effectively serializes the transaction execution preventing the workload from taking advantage of the reduced directory contention. Due to the in-order execu- tion and well-optimized parallel applications in our experiment, the memory subsystem is not fully stressed. Consequently, the reduction of directory busy cycles is not fully translated into performance improvement. Nevertheless, as future chips are expected to grow substantially in the number of cores with applications incorporating a large number of threads and transactions to exploit so many cores, contention on shared data will inevitably become intensive, implying more performance improvement potential for TMNOC. 90 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized Execution Time Baseline TMNOC TMNOC+ High Contention Low Contention Figure 5.15: Normalized execution time. 5.3.6 Sensitivity Study CT-Buffer size. The microarchitecture design trade-off between performance and hard- ware overhead is mainly affected by the size of the CT-Buffer. A larger CT-Buffer can store conflict traces regarding more cache blocks leading to potentially more accurate filtering of transactional requests. We explore the sensitivity of TMNOC to the size of the CT-Buffer in terms of overall execution time. As CT-Buffer read/write operations are not on the router critical path, the increased access latency due to a larger CT-Buffer does not affect the router latency. Figure 18 shows the impact of CT-Buffer size on the overall execution time. It is observed that the majority of the benchmarks, especially those with low contention rates, are not sensitive to the size of the CT-Buffer. This is mainly due to the fact that those benchmarks have a small set of memory hot spots. For the TM workloads evaluated, a small CT-Buffer size is sufficient to achieve significant energy savings and effective traffic regulation. 91 0.8 0.85 0.9 0.95 1 1.05 4 8 16 32 64 Normalized Execution time CT-Buffer Size (in # of entries) Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation 0.8 0.85 0.9 0.95 1 1.05 4 8 16 32 64 Normalized Network Traffic CT-Buffer Size (in # of entries) Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Figure 5.16: Performance vs. Number of CT-Buffer entries. CT-Buffer timeout threshold. Recall that the TMNOC logic leverages a timeout mechanism to invalidate stale conflict traces in the CT-Buffer. A conflict trace is con- sidered invalid if it has stayed in the CT-Buffer longer than the timeout threshold. Theo- retically, the lifetime of a conflict trace should be no longer than the lifetime of the con- flicting transaction. A threshold value too small reduces the effectiveness of TMNOC by limiting routers capabilities to identify conflicts, whereas unnecessarily large thresh- olds introduce unwarranted NACKs from the routers. Thus, the timeout threshold is an important design parameter in TMNOC that uses the moving average of the transaction running length as the timeout threshold. This dynamic approach enables the TMNOC logic to adapt to the varying transaction characteristics at different execution phases. Here, we study the sensitivity of TMNOC to the timeout threshold by disabling the dynamic approach and assigning a static threshold value instead. The results are pre- sented in Figure 19. All the results are normalized to the baseline. Two observations can be made. First, both the execution time and network traffic are relatively insensi- tive to the timeout threshold when its value is less than 2000 cycles. Two applications (Intruder and Kmeans) exhibit substantial degradation as the threshold value increases 92 0.7 0.9 1.1 1.3 1.5 100 250 500 1000 2000 5000 Normalized Execution Time Static timeout period Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation 0.7 0.9 1.1 1.3 1.5 100 250 500 1000 2000 5000 Normalized Network Traffic Static timeout period Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Figure 5.17: Performance vs. Timeout threshold. beyond 2000. As these two applications mainly consist of fine-grain transactions that typically finish within 1000 cycles, a large timeout threshold gives the conflict traces an unnecessarily longer lifetime than the actual conflicting transactions themselves, thereby introducing more stale conflict traces. The routers nack requesting transactions based on the stale conflict traces, which leads to performance degradation. The second obser- vation is that no single threshold value can deliver good performance across the full spectrum of applications that are evaluated in our experiments. For instance, the opti- mal threshold value for Intruder is approximately 250 cycles while the value is 5000 cycles for Labyrinth. This observation emphasizes the need for a dynamic mechanism in determining the timeout threshold. 5.3.7 Area Overhead The additional storage and processing logic in the on-chip routers introduce area over- head. We estimate the area of the CT-Buffer using a commercial memory compiler. The buffer is implemented as a 32x64bit dual-port SRAM. We implement the TMNOC 93 Table 5.3: Result of area overhead estimation Components Estimated Area(um 2 ) Baseline router 145901 Conflict Trace Cache 6563 TMNOC Logic 162 Overhead 4.6% logic at the RTL level. The virtual channel router implementation is based on the open- source design from Stanford University [Bec12]. The router configurations are identical to those used in the full-system simulation, as shown in Table 5.2. The design is synthe- sized using Synopsys Design Compiler targeting TSMC 40nm technology. The clock frequency is set to 1GHz. Table 5.3 reports the estimated area overhead of TMNOC. TMNOC incurs a reasonable 4.6% area overhead to the virtual channel router. This area overhead is justified by the energy savings and performance improvement of TMNOC. 5.4 Summary We explore the largely neglected interaction between HTMs and NOCs. In the process, a potential energy and performance pitfall is identified as false forwarding, which wastes the network bandwidth. To reduce the bandwidth utilization in transactional system, we propose a novel in-network filtering mechanism that exploits the co-design of HTMs and NOCs to regulate transactional network traffic. In TMNOC, the on-chip routers track conflicts between transactions by monitoring in-transit TM traffic. Then, routers use the conflict information to filter out transactional requests as early as possible, before the requests incur false forwarding. Evaluation results from full system simulation show that the proposed mechanism is capable of reducing 21% of the network traffic on average over a set of high-contention benchmarks, which is translated into an average energy 94 savings of 24% and a directory contention reduction of 68%. Implemented TMNOC mechanisms result in only a 5% area overhead to a conventional NOC router. 95 Chapter 6 Predictive Unicast and Notification The in-network filtering technique discussed in previous chapter prevents the pathologi- cal coherence requests from reaching the home node. However, once those requests are serviced by the home node, they will often initiate unnecessary coherence forwarding. This chapter describes a hardware technique to further suppress the unnecessary and dis- ruptive coherence forwarding. As shown in the analytical model presented in Chapter 3, the number of nodes that receive coherence forwarding messages has a first order impact on the communication cost in transactional systems. In this chapter, we contribute Pre- dictive Unicast and Notification (PUNO), a hardware technique to convert the multicast forwarding to unicast while still guaranteeing the execution correctness. PUNO further suppress frequent polling from blocked transactions with pseudo-notification from the blocking transactions. Specifically, the blocking transactions provide an estimated finish time along with the NACK messages so that the blocked transactions can backoff until the notified time to retry the request. PUNO achieves two key benefits. First, it restrains the forwarding and retry of trans- actional requests. Consequently, the network traffic is reduced. Second, it isolates trans- actions from the disruptive forwarding so as to avoid unnecessary aborting caused by failed requests. As a result, transaction throughput improves. 96 6.1 Introduction A conflict occurs when more than one concurrent transaction access the same data and at least one access is a write. Conflicts violate the isolation semantic and have catas- trophic consequences on correctness.HTM typically implements contention manage- ment to detect and resolve conflicts. As cache coherence protocol can detect data access conflicts, the majority of HTM designs including commercial implementations (e.g., IBM System Z [15]) piggyback onto the coherence protocol (typically directory-based) for conflict detection. However, there is an intrinsic difference between the cache coherence scheme and transaction execution. The participating entities of cache coherence are processors with equal priority. In contrast, the participating entities of transaction execution in a HTM are transactions with different priorities. This difference results in a mismatch between the coherence scheme and conflict detection. In the coherence scheme, a GETX (request for exclusive access) is always forwarded exhaustively (multicast) to the entire set of sharer nodes so that all the sharers will invalidate their private data copy. However, in conflict detection, not all sharers need to receive the request if a high priority sharer can detect and resolve the conflict properly. As the HTM piggybacks onto the cache coherence protocol to forward the GETX from a requester transaction to all the sharer transactions, the sharers with higher priority than the requester will nack the request while other sharers with lower priority will acknowledge the request and abort them- selves to avoid conflicts. However, if the request is eventually nacked (i.e., the conflicts do not materialize), any aborted transactions on low-priority sharers could have contin- ued their execution. In other words, the aborting is unnecessary as no conflicts involving the aborted sharers actually materialized. This pathological aborting behavior is identi- fied as false aborting, which wastes energy and degrades performance because 1) valid 97 transaction computation is discarded needlessly and 2) the multicast of transactional write requests to all the sharers generates superfluous inter-transaction communication. According to our study of a spectrum of TM workloads, 92% of the transaction aborts are caused by the transactional GETX requests and 41% of these requests incur false aborting. False aborting is essentially caused by the protocol’s exhaustive multicast of trans- actional GETX requests to the entire set of sharer transactions. Unfortunately, it is impractical to tackle false aborting by re-engineering a HTM-specific cache protocol due to the exorbitant cost. In this paper, we introduce Predictive Unicast and Notifica- tion (PUNO), a novel hardware mechanism to mitigate false aborting. First, the directory unicasts (instead of multicast) the transactional requests to the sharer that is predicted with high confidence to nack the requests, so that conflicts can be detected and resolved with minimal interference in transaction execution while maintaining correctness. Sec- ond, the transactions at the unicast destination proactively notify the nacked requesters with the time that the requested cacheline would be available. Thus, the requesters can refrain from polling the sharers too frequently, thereby further mitigating false aborting. The functionalities of PUNO do not require re-engineering the coherence protocols. Our evaluations using full system simulation show that PUNO reduces transaction aborting by 61% on average (up to 89%) in a set of high contention benchmarks that are repre- sentative of future TM workloads. Consequently, the on-chip network traffic is reduced by 32% on average (up to 67%) while the execution time is reduced by 12%. These improvements are achieved with a meager 0.41% area overhead. 98 Figure 6.1: Comparison between the cache coherence scheme and transaction execu- tion. DIR: home node directory. (a) coherence protocol handling for a GETX request; (b) contention management mechanism handling the GETX request. Explosion marks indicate transaction conflicts. 6.2 Motivation As discussed above, false aborting occurs when the exhaustive multicast of a transac- tional GETX request aborts several low priority sharer transactions before the request is eventually nacked by the high priority sharers. So any transaction aborts caused by the nacked GETX are unnecessary. Figure 6.1 compares the detection of data races in plain cache coherence and the detection of transaction conflicts. In the cache coherence scheme, the requestor always gets the desired data access permission by invalidating the data copies at remote sharer nodes. However, in conflict detection, upon receiving the same coherence forwarding, the sharer transactions could respond with either a NACK (negative acknowledgement) if they have higher priority than the requester or an ACK if they have lower priority. After receiving all the responses, the TxA sends an UNBLOCK message to conclude the request. As long as one of the responses is a NACK, the requester transaction stalls. In the example (Figure 6.1 b), the aborting of transaction C and D is false aborting because the request causing the aborting is nacked eventually. We assess the gravity of false aborting by tracking the coherence requests from trans- actions in a set of high-contention benchmarks running on a representative HTM design 99 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Bayes Intruder Labyrinth Yada TXGETX w/ False Aort TXGETX w/o False Aort Figure 6.2: Breakdown of the GETX requests from transactions. 0% 5% 10% 15% 20% 25% 30% 35% 40% 1 2 3 4 5 6 7 8 9 10 11 12 Percentage Number of Transactions Bayes Intruder Labyrinth Yada Figure 6.3: Distribution of the number of transactions being aborted unnecessarily due to false aborting. (see Section 4.1 for experiment details). Figure 6.2 shows a breakdown of transac- tional write requests. It is observed that an average of 41% of those requests incur false aborting. Figure 6.3 shows the distribution of the number of transactions being aborted unnecessarily due to false aborting. For example, in Intruder, 5 transactions are aborted unnecessarily in 10% of the false aborting cases. The long trailing indicates that false aborting can seriously disrupt transaction execution as it causes a considerable number of transactions being aborted unnecessarily. Thus, the potential energy and performance gain of reducing false aborting is substantial. 100 However, combatting false aborting is challenging. Mitigating false aborting using a conventional coherence protocol is difficult as there is no notion of transactions and their relative priorities in a conventional coherence protocol. Thus, transactional GETX requests are always forwarded to all the sharers to ensure that conflicts are resolved, even though the multicast may disrupt transaction execution unnecessarily and incur false aborting. Also, a TM-specific cache protocol is an impractical solution due to the exorbitant cost. Supposing such a TM-aware protocol does exist, it is still obscure how the protocol decides which sharers can be exempt from receiving the GETX requests without jeopardizing correctness. 6.3 Predictive Unicast and Notification 6.3.1 Basic Idea The basic idea of PUNO is based on the following two important observations regard- ing false aborting. First, the exhaustive multicast of transactional GETX request to the sharers is needless as long as the conflict caused by the request can be resolved by a sharer with higher priority than the requester. Second, the nacked requester transaction cannot proceed until the nacker sharer transaction finishes executing, as immediate retry of the request will still be rejected by the nacker. PUNO takes advantage of the two observations by 1) replacing the multicast with predictive unicast to the high priority sharer and 2) performing proactive notification to the nacked requester with regard to when to poll the sharers again. When the directory receives a GETX from a transac- tion, it unicasts the request to the sharer transaction (i.e., the unicast destination) that is predicted with high probability to nack the request. Upon an accurate prediction, the 101 TxA TxB TxC TxD GETX NACK GETX GETX NACK NACK GETX ACK Stall Abort Abort Restart Restart Abort Abort Resume Commit Commit TxA TxB TxC TxD GETX NACK +noti cation GETX ACK Stall Resume Commit Commit Commit Backo! (a) (b) Commit Figure 6.4: Comparison of transaction executions in the conventional scheme and PUNO. transaction at the unicast destination resolves conflict by nacking the request. In addi- tion, the nacker transaction proactively notifies the requester with the remaining time the nacker is expected to run. When the requester receives the NACK and the attached noti- fication, it refrains from retrying the request until the time when the nacker transaction is expected to finish. Figure 6.4 compares PUNO with the conventional scheme. In the example, a cache- line is read-shared among three transactions (i.e., TxA, TxC and TxD). TxB wishes to write to the cacheline. TxB has a higher priority than TxC and TxD, but has a lower pri- ority than TxA. In the conventional scheme (see Figure 6.4(a)), The GETX from TxB is forwarded by the directory to all the three sharers. The request is nacked by TxA. However, it causes false aborting as TxC and TxD are aborted unnecessarily. TxB keeps polling the sharers and succeeds with the request when TxA finishes. The polling exac- erbates false aborting as TxC and TxD are aborted several times. In contrast, in Figure 102 ... ... ... ... ... state owner sharer list UD pointer Prio-Bu!er ... ... priority v-cnt 16 bit 2 bit 11 10 01 00 TimeOut TimeOut TimeOut TimeOut TxREQ TxREQ TxREQ TxREQ valid invalid (a) (b) reset misprediction Directory entry r-cnt timeout Figure 6.5: (a) Directory augmentation to support unicast destination prediction. Added hardware structures in bold rectangle. r-cnt: rollover counter; v-cnt: validity counter. (b) State transition of the validity counter. 6.4(b), PUNO directs the directory to unicast the GETX request to TxA which is pre- dicted to nack the request. TxA nacks the request and notifies TxB with an estimation of its remaining running time. Consequently, TxB enters backoff and does not retry the request until TxA commits. PUNO reduces inter-transaction communication, and increases transaction throughput by allowing TxC and TxD to commit along with TxA. While the basic idea is conceptually straightforward, the effectiveness of PUNO depends on accurate prediction of the unicast destination and a reliable scheme to derive transaction’s running time. In subsequent discussion, corresponding mechanisms are described. Then, we discuss the protocol support. Finally, operation examples are pro- vided. 6.3.2 Unicast Destination Prediction To support unicast destination prediction, each directory is augmented with hardware structures to track the priority of active transactions on each sharer node. As shown in Figure 6.5(a), each directory entry is augmented with a Unicast Destination pointer (UD pointer) to record the id of a sharer node for the cacheline. The sharer node recorded 103 in the UD pointer has the highest priority among all the sharers of the cacheline. The UD pointer is also used to retrieve the priority of the sharer from the Transaction Pri- ority Buffer (Prio-Buffer). The Prio-Buffer has P entries to record the priority of all the P nodes. As transactional coherence requests include the host node and priority of the sending transactions, the directory extracts the host node, priority pair from those received requests to actively update its Prio-Buffer. The prediction of the unicast destination for transactional GETX requests relies on the UD pointer and Prio-Buffer. Upon receiving such a request, the UD pointer is accessed in parallel with the directory entry. Then, the UD pointer is used to access the Prio-Buffer to retrieve the priority of the sharer (Priority sharer ) that has the highest priority among all the sharers of the requested cacheline. Next, the directory compares Priority sharer with Priority requester (obtained from the request). If the sharer transac- tion has higher priority, the directory predicts that the sharer will resolve the conflict by nacking the request (i.e., other sharers need not to receive the request). Therefore, the directory unicasts the GETX to the host node of the high priority sharer. Otherwise, if the requester has higher priority, the directory forwards the request to all the sharers as normal. The priority becomes stale if a new transaction on the remote node begins executing and the directory has not received a request from the new transaction to be able to update the transaction priority of the remote node in the directory’s Prio-Buffer. Stale priority information can cause misprediction of the unicast destination. Thus, an adaptive time- out mechanism is implemented to validate newly updated priorities and invalidate stale priorities in the Prio-Buffer. As shown in Figure 6.5(a), the directory is augmented with one 32-bit rollover counter for the entire directory and 2-bit validity counters per Prio- Buffer entry. Upon overflow, the rollover counter generates timeout signal to trigger 104 the state transition of all the validity counters. The timeout period used by the rollover counter is determined dynamically based on the average transaction length obtained from a hardware mechanism (discussed in the subsequent section). Figure 6.5(b) depicts the state transition of the validity counter. When the rollover counter generates a time- out signal, all the non-zero validity counters are decremented by 1 so the validity of the associated priority is decreased. When the directory updates a priority, the corre- sponding validity counter is incremented. So, priorities that have not been updated for a long period of time have small validity counters whereas recently updated priorities have bigger validity counters. Only those priorities with validity counters greater than 1 is considered valid and used for the prediction of unicast destination. If a node’s priority has a validity counter of value 0, the time interval (as observed by the directory) between two requests from that node is long, which could indicate that the transactions on that node are long. So, when the directory receives a request from the node and updates its priority, the validity counter is incremented twice to allow for a longer timeout period for the potentially long-running transaction. The adaptivity to transaction characteris- tics enhances the timeout mechanism for workloads with a large variance in transaction length. 6.3.3 Handling Misprediction A misprediction occurs when the GETX request is unicasted to a sharer transaction that has a lower priority than the requester. Misprediction, if not handled properly, may incur a correctness problem. Consider the transaction execution in Figure 6.1(b). Mispredic- tion happens if the GETX request from TxA on Node0 is unicasted to TxC on Node2 that has lower priority than the requester. TxC acknowledges the request as it has lower priority. Consequently, TxA can write to the cacheline while the other two sharers (i.e., 105 ... ... TxID shift DynTxLen + StaticTxLen TxLB TxLB entry Figure 6.6: Structure of the transaction length buffer and computing logic. TxB and TxD) have no knowledge of the write. To guarantee correctness, misprediction of the unicast destination is handled conservatively by letting the mispredicted sharer transaction to nack the GETX. So, the requester is forced to retry the request. To help retry and improve prediction accuracy, a misprediction feedback mechanism is imple- mented. The mispredicted sharer (e.g., TxC in previous example) informs the requester (e.g., TxA) of the misprediction via the NACK message. Then, the requester notifies the directory (e.g., DIR) of the misprediction through the UNBLOCK message (coherence message extension is discussed shortly), so that the directory can invalidate the stale priority in its Prio-Buffer that causes the misprediction. Then, the directory updates the UD pointer to a new sharer. While guaranteeing correctness, the misprediction handling approach could poten- tially hurt performance. However, the impact is marginal due to three reasons. First, the prediction accuracy is high (90%+ hit rate in simulation). Second, some NACKs due to misprediction can be true positives anyway as the request could be nacked by other sharers if not being unicasted. Third, the invalidation and upgrading performed by the directory upon receiving the misprediction feedback do not incur a performance penalty as they are off the critical path of the coherence messages. 106 6.3.4 Notification PUNO further suppresses false aborting with a notification mechanism. Upon a unicast destination hit, the nacker transaction running at the unicast destination node notifies the requester with its expected running time (T est in terms of cycles) through the NACK response. As the requester cannot proceed until the nacker finishes, it can leverage the notification to decide whether to backoff. If T est minus twice the average cache-to- cache latency is positive, it is used as the backoff period to prevent the requester from retrying the nacked request immediately. The average cache-to-cache latency can be pre-determined based on network topology. The effectiveness of notification depends on an accurate tracking of transactions’ running lengths. With sub-optimal backoff, the requester could either spend too much time waiting or retry too soon causing additional false aborting. Due to the significant variance of the transaction lengths even in one application, averaging the lengths of all the past transactions is not sufficient. The proposed design tracks the average length of individual static transactions sepa- rately using a per-node hardware structure called the Transaction Length Buffer (TxLB), as depicted in Figure 6.6. A static transaction is the transaction as defined in the code with TX BEGIN and TX END. Such a static transaction is usually executed multiple times. Each execution is a dynamic instance of the static transaction. Each static trans- action has a TxLB entry to track the average length of its past dynamic instances. The static transaction id (assigned at compile time) is used to access the TxLB. When a dynamic instance commits, its length (DynTxLen) is known by subtracting its beginning cycle time from the current cycle time. Then, the static transaction length (StaticTxLen) in the TxLB is updated using the following formula: StaticTxLen new = StaticTxLen prev +DynTxLen 2 (6.1) 107 ADDR MSG TYPE SRC NODE DEST NODE U-bit ADDR MSG TYPE SRC NODE DEST NODE TIMESTAMP Noti cation ADDR MSG TYPE SRC NODE DEST NODE MP NODE (a) GETX (b) NACK (c) UNBLOCK Coherence message extension TIMESTAMP ... ... ... ... ... 16bit 1bit MP-bit 1bit MP-bit 1bit Figure 6.7: Protocol message extensions to support PUNO. This formula places more weight on recent dynamic instances, so that TxLB can closely track recent execution. The hardware overhead to track transaction length is low for two reasons. First, the TxLB size is small as workloads usually have a limited num- ber of static transactions. For instance, Bayes, the workload with the largest number of static transactions in the STAMP benchmark, has 15 static transactions in total. Second, the computation to get the average transaction length is straightforward. The divide- by-2 operation can be simply implemented as a shifting logic as depicted in Figure 6.6. 6.3.5 Protocol Support for PUNO PUNO requires minimal modification to the coherence protocol. While a few coher- ence messages are extended for extra information, the complex protocol state transition remains unchanged, and no extra coherence states (stable or transient) are needed. So, PUNO can work with the coherence protocols in many existing HTM designs. 108 Three coherence messages are extended (see Figure 6.7). First, the GETX message is extended with 1 bit (U-bit) to indicate whether it is a unicast request. The U-bit is set by the directory when the request is unicasted to a sharer. In some protocol variations, the directory sends invalidations to the sharers instead of forwarding the GETX, in which the U-bit can simply be added to the invalidation messages. Second, the NACK message is extended with the notification from the unicast destination to the requester. Essentially, the notification includes the number of cycles that indicates the nacker transaction’s running time. Besides, a misprediction bit (MP-bit) is added to support misprediction feedback as discussed in Section 3.3. Third, the UNBLOCK message is extended with a misprediction bit (MP-bit) and a MP-node field that specifies the mispredicted unicast destination. Due to the wide on-chip channels, the extended messages can fit into the existing flits, requiring no extra flits to be transmitted through the network. So, the communication overhead is minimized. 6.3.6 Operation Example This subsection provides several walk-through examples to illustrate how the predictive unicast and the notification work collaboratively to mitigate false aborting. Directory updates the Prio-Buffer (Figure 6.8(a)): when the directory receives transactional GETS requests (TxGETS) from the three nodes, it updates its Prio-Buffer and increments the validity counters from 1 (invalid) to 2(valid). The UD pointer is pointing to the priority of Node1 because it has the highest priority (smaller timestamp indicates higher priority). Directory predicts the unicast destination (Figure 6.8(b)): when the directory receives the transactional GETX (TxGETX) from Node2, it follows the UD pointer to 109 TxGETS TxGETS TxGETS ACK ACK ACK NODE #1 DIR S NODE #3 NODE #4 -- Sharer Vector 0 1 0 1 1 0 0 ... ... Prio-Bu!er Timestamp = 100 Timestamp = 120 Timestamp = 150 100 120 150 150 150 ... ... NODE #2 10 10 10 0 1 2 3 4 01 01 UD pointer NODE #1 DIR S NODE #3 NODE #4 -- Sharer Vector 0 1 0 1 1 0 0 ... ... Prio-Bu!er Timestamp = 100 Timestamp = 120 Timestamp = 150 100 120 150 150 150 ... ... NODE #2 10 10 10 0 1 2 3 4 01 01 UD pointer TxGETX Fwd_TxGETX Timestamp = 160 C unicast 160 NODE #1 S NODE #3 NODE #4 -- Sharer Vector 0 1 0 1 1 0 0 ... ... Prio-Bu!er Timestamp = 100 Timestamp = 120 Timestamp = 150 100 120 150 150 150 ... ... NODE #2 10 10 10 0 1 2 3 4 01 01 UD pointer Timestamp = 160 DIR NACK+ noti"cation ... ... TxLB - already run cycles UNBLOCK NODE #1 S NODE #3 NODE #4 -- Sharer Vector 0 1 0 1 1 0 0 ... ... Prio-Bu!er Timestamp = 180 Timestamp = 120 Timestamp = 150 100 120 150 150 150 ... ... NODE #2 01 10 10 0 1 2 3 4 01 01 UD pointer Timestamp = 160 DIR NACK MP-bit = 1 UNBLOCK MP-bit = 1 MP-node = 1 invalidate 100 avg tx length (a) (c1) (b) (c2) Figure 6.8: PUNO operation examples. All the coherence messages and states are with regard to the same cacheline. DIR: directory. C: comparator. Key operations are high- lighted. get the highest priority of the sharers. As the requester’s priority is lower than Node1’s priority as recorded in the Prio-Buffer, the directory only sends the TxGETX to Node1. Unicast destination sends notification to the requester (Figure 6.8(c1)): upon receiving the Fwd TxGETX request, Node1 resolves the conflict by nacking the request. Node1’s average length is retrieved from the TxLB. The remaining running length is computed by subtracting the cycles it has already run from its average length. The information is attached to the NACK message to Node2. The transaction at Node2 enters backoff upon receiving the notification. Unicast destination provides misprediction feedback to the directory (Figure 6.8(c2)): now suppose that the previous transaction (timestamp=100) on Node1 has finished executing and a new transaction (timestamp=180) starts. But the directory is not aware of the new transaction just yet and, hence, still forwards the TxGETX to Node1. 110 Upon receiving the request, Node1 detect a misprediction of the unicast destination as the local transaction has a lower priority than the requester. Node1 nacks the request to guarantee correctness. Due to misprediction, no notification to Node2 is provided. The MP-bit of the NACK is set for misprediction feedback. After receiving the NACK, Node2 sets the MP-bit and MP-node in the UNBLOCK message. When the directory receives the misprediction feedback, it invalidates Node1’s priority in the Prio-Buffer entry. The UD pointer is updated to point to Node3 because it has the highest priority now. 6.4 Evaluation 6.4.1 Methodology We conduct cycle-accurate full system simulation using SIMICS [MCE + 02] and the GEMS tool set [MSB + 05] to assess the impact of PUNO. Garnet [AKPJ09] is used as the on-chip network timing model. We present results for all eight workloads from the STAMP benchmark suite [MCKO08] that is widely used to evaluate HTM designs. STAMP is representative of future TM workloads as it uses coarse grain transactions on sophisticated data structures such as red-black trees or graphs. Table 6.1 lists the details of the benchmarks. The baseline processor architecture for our experiments is depicted in Figure 6.9. Each of the 16 nodes comprises a SPARC core with private L1 and shared L2. The shared L2 follows the static non-uniform cache architecture [KBK02] and maintains coherence using the MESI directory protocol similar to the SGI Origin protocol [LL97]. Every memory block is statically assigned to a home node based on the memory address. The processor implements hardware support for a log-based HTM in which 111 pre-transaction states are written to a software log while speculative states are prop- agated to the memory eagerly. The baseline also uses a hardware buffer to store the pre-transaction states to support fast abort recovery. Conflicts are detected eagerly using the coherence protocol. When detecting a conflict, the receiver transaction resolves the conflict using the time-based policy [RG02]. To mitigate conflicts, a nacked requester node waits for a fixed 20 cycles before retrying the request. The performance of the baseline HTM is comparable to that of contemporary eager HTM designs (e.g., FASTM [LMG09]). The underlying 2D mesh on-chip network uses dimension-order routing and virtual channel flow control. The pertinent characteristics of the system configuration are in Table 6.2. We implemented PUNO on top of the baseline system in the simulator. It takes one cycle for the directory to access the Prio-Buffer and one cycle to determine whether to unicast the request. The remaining PUNO operations (i.e., notification, accessing Prio- Buffer and UD-pointer) do not add latency as they can either run in parallel with the rest of the system or operate off the critical path. PUNO is also compared against two other existing mechanisms that can reduce transaction abort: 1) Random backoff [SIS05]: aborted transactions enter randomized linear backoff before restarting. Transactions that abort frequently will have longer backoff. 2) Read-Modify-Write predictor(RMW-Pred) [BMV + 07]: transactions exhibiting the read-modify-write memory access pattern can request for exclusive permission upon the load, thereby avoiding abort due to the later dueling write. Each node has a RMW predictor to track up to 256 load instructions. 6.4.2 Reduction in Transaction Abort One of the main design objectives for PUNO is to mitigate the unnecessary transaction abort. Figure 6.10 shows the impact of PUNO on transaction abort. It is observed that, 112 Table 6.1: Benchmark input parameters Benchmark Input Parameters Abort % Bayes 32 var, 1024 records, 2 edge/var 97.1% Intruder 2k flow, 10 attack, 4 pkt/flow 77.6% Labyrinth 32*32*3 maze, 96 paths 98.6% Yada 1264 elements, min-angle 20 47.9% Genome 32 var, 1024 records 1.3% Kmeans 16K seg. 256 gene. 16 sample 7.4% SSCA2 8k nodes, 3 len, 3 para edge 0.3% Vacation 16K record. 4K req. 60% coverage 38% Table 6.2: System configuration Unit Value Core 16 Sun UltraSPARC III+ cores, 1GHz L1 Cache 32 KB, 4-way associative, write-back, 1-cycle L2 Cache 8 MB, 8-way associative, 20-cycle latency Coherence MESI protocol, static cache bank directory Memory 4 GB, 4 memory controller, 200-cycle latency Network 2D mesh, DOR, VC flow control, 4-stage router PUNO 16-entry Prio-Buffer; 32-entry TxLB on average, PUNO reduces transaction aborts by 43% (up to 98%) compared with the baseline. In particular, PUNO is effective in reducing aborts caused by the transactional GETX requests which are the main causes of most transaction aborts in the workloads. PUNO achieves significant abort reduction in the high contention benchmarks (61% less aborts). This result is expected as workloads with high contention usually incur more false aborting due to frequent transaction writes and extensive read-read sharing. PUNO incurs an average of 17% fewer aborts compared with random backoff, indi- cating that the notification-guided backoff scheme of PUNO is more effective in avoid- ing conflicts. In the random backoff scheme, the backoff period is determined by local transaction statistics such as number of retries. Nonetheless, the backoff period should essentially be dependent on the remote nacker transaction with which the local transac- tion has data conflicts. In PUNO, a local transaction receives reliable information from 113 Node 0 Node 4 Node 8 Node 12 Node 1 Node 5 Node 9 Node 13 Node 2 Node 6 Node 10 Node 14 Node 3 Node 7 Node 11 Node 15 Core HTM Support L1 L2 cache Directory UD ptrs + Prio-Buf Router Register Checkpoint Read/Write Signature Fast-abort Support Transaction Logging Overflow Handling Tx Length Buffer Figure 6.9: The baseline chip multiprocessor architecture. PUNO augmentation in bold rectangles. the notification from the remote transactions so that it can better optimize the backoff period. Previous work [BMV + 07] demonstrates the effectiveness of RMW-Pred to reduce conflicts in expertly optimized workloads that have very low contention and fine-grain transactions. Our evaluation results echo this finding as RMW-Pred reduces transaction abort significantly in Kmeans and SSCA2, both consisting of short transactions with few conflicts (for instance, the transaction abort rate is 0.3% in SSCA2). However, the results also suggest that RMW-Pred is inefficient in workloads with frequent conflicts among coarse-grain transactions. RMW-Pred tends to convert read-read sharing to write-read conflicts by obtaining exclusive permission upon loads. As the abort rate is already very sensitive to the number of conflicts in most contemporary and expected future TM applications, RMW-Pred exhibits many more transaction aborts (e.g., 2X more in vacation) than the other mechanisms. 6.4.3 Reduction in Network Traffic Figure 6.11 shows the normalized on-chip network traffic measured in router traversals by all the network flits. As can be observed, PUNO eliminates 33% (up to 68%) of the traffic in high-contention benchmarks compared with the baseline scheme. Across 114 - 0.20 0.40 0.60 0.80 1.00 1.20 1.40 Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Abort caused by GETS Abort caused by GETX 2.74 1.78 2.62 3.42 1.66 High-Contention Low-Contention Figure 6.10: Normalized transaction abort count. all the workloads, the network traffic is reduced by an average of 17%. The traffic reduction is due to three facts. First, PUNO replaces the wasteful multicast of GETX requests with unicast when possible. Second, the notification mechanism suppresses unnecessary transaction polling. Third, the reduction in transaction abort translates to less futile traffic from aborted transactions. In comparison with random backoff, PUNO reduces the network traffic by 34% in the high-contention workloads. As both random backoff and PUNO significantly reduce the transaction aborts (see Figure 6.10), the difference in network traffic reduc- tion largely comes from the difference in the traffic from committed transactions. The backoff period is an important factor in determining the traffic generated by those trans- actions. Suboptimal backoff periods can negatively impact the overall traffic. If the transaction returns from backoff too soon, it will be nacked due to similar conflicts, thereby generating more useless traffic. In comparison, more traffic can be saved in the proposed PUNO as the notification from the nacker transaction can effectively refine the backoff period. 115 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Baseline Backoff RMW-Pred PUNO 2.55 2.25 2.32 1.53 High-Contention Low-Contention Figure 6.11: Normalized on-chip network traffic. 6.4.4 Reduction in Directory Blocking When the directory forwards a GETX request to the sharer nodes, it cannot service sub- sequent requests to the same cacheline until the requester receives responses from all the sharers and sends an UNBLOCK message to the directory. Reducing the amount of time the directory is blocked could lead to potential performance gains in workloads bounded by memory bandwidth. Figure 6.12 shows the impact of PUNO on directory blocking. The values are obtained by averaging the number of cycles during which directory entries stay in a blocking transient state when servicing transactional GETX requests. As can be seen, PUNO eliminates 18% (up to 42%) of such directory block- ing compared with the baseline. The improvement mainly comes from the fact that the directory minimizes the number of sharer nodes by means of unicast. Statistically, the expected waiting time for response from a single node is shorter than the waiting time for responses from multiple nodes. Thus, the latency through the chain of communi- cation between the directory, the sharer nodes, the requester and finally the directory is reduced with a minimized set of sharer nodes. In particular, the transactions in Labyrinth 116 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Baseline Backoff RMW-Pred PUNO High-Contention Low-Contention Figure 6.12: Normalized cycle count when the directory is busy servicing transactional GETX. read in the entire global maze grid and write to a small portion of the grid. So, the writer transactions in the baseline need to wait for responses from a large number of sharers before unblocking the directory. In contrast, the predictive unicast significantly min- imizes the sharer transactions to respond to the request, thereby reducing the waiting time. Thus, PUNO incurs 42% less directory blocking in Labyrinth. The reduction in directory blocking allows more requests to be serviced instead of waiting, increasing the concurrency in the cache system. 6.4.5 Impact on Performance Figure 6.13 presents the normalized execution time. As it is observed, PUNO achieves an average of 12% (up to 31%) performance improvement over the baseline scheme in high-contention workloads. Across all the workloads, PUNO improves the performance by an average of 8%. The performance advantage of PUNO stems from the fact that it 117 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Baseline Backoff RMW-Pred PUNO Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Backoff Stalled Aborting Aborted Tx Commit Tx Barrier Non-tx High-Contention Low-Contention 2.33 2.90 2.08 5.94 2.18 Figure 6.13: Normalized execution time. succeeds in suppressing false aborting, which pathologically causes unnecessary trans- action aborting. Therefore, some of the transactions that are unnecessarily aborted in the baseline can continue executing and commit. Compared with the random backoff scheme, PUNO performs consistently better in all the workloads. Although random backoff can improve performance in some high- contention workloads (e.g., Bayes and Intruder) compared with the baseline, it incurs performance degradation in Labyrinth that represents workloads with extremely high contention. After analyzing the execution statistics of random backoff in Figure 6.13, it is found that transactions in Labyrinth spend more cycle time in backoff than in execut- ing transactions. It is worth noting that, although conservative backoff mitigates aborts in Labyrinth (see Figure 6.10), it hurts performance by limiting the concurrency among transactions. Hence, compared with PUNO, the random backoff scheme could be inef- fective in the presence of extreme contention. 118 Compared with the RMW-Pred scheme, PUNO performs better in six out of eight workloads, while the performance advantage of RMW-Pred in the remaining two work- loads (i.e., Kmeans and SSCA2) is very marginal (less than 1.6%). As discussed in Sec- tion 4.2, RMW-Pred performs well in Kmeans and SSCA2 as it can mitigate conflicts in workloads with very low contention. However, as observed in Figure 6.13, RMW-Pred incurs a performance penalty (1.83X slow down) in high-contention workloads due to the extra conflicts caused by upgrading GETS requests early on. Note that the performance improvement is not necessarily proportional to the reduc- tion in transaction aborts. For instance, PUNO eliminates more than 90% of the transac- tion aborts and reduces the execution time by 31% in Bayes while it eliminates 40% of the aborts and reduces the execution time by only 5% in Yada. The reduction of transac- tion aborts may not be translated directly to a performance advantage since transactions surviving the abort can still be stalled due to conflicts with other transactions. Execu- tion statistics show that transactions in Bayes, while aborting much less frequently in PUNO, experiences 1.6X more stalling cycles. Similar trends have been found in other workloads such as Intruder, Labyrinth and Yada. 6.4.6 Transaction Execution Efficiency The interesting tradeoff between abort and stalling reveals an opportunity to improve the efficiency of transaction execution. Transactions can be stalled instead of being aborted to avoid discarding valid transaction computation. In order to evaluate the efficiency of transaction execution, we measure the number of cycles in transactions that commit and the number of cycles in transactions that are aborted due to conflicts. The former metric is named good transaction effort, while the latter is named discarded transaction effort. The ratio of two efforts, namely the G/D ratio, signifies whether the system can execute 119 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Baseline Backoff RMW-Pred PUNO 2.98 5.00 High-Contention Low-Contention Figure 6.14: Normalized transaction G/D ratio indicating the efficiency of transaction execution (the larger the better). transactions with minimal waste. A large G/D ratio indicates that a significant amount of transactional computation is valid and committed to the memory eventually. In contrast, a small G/D ratio suggests that a sizable amount of transactional computation is wasted. Figure 6.14 shows the G/D ratio of the four designs. As can be observed, on average, the G/D ratio of PUNO is higher than the baseline, random backoff and RMW-Pred schemes by 1.65X, 1.24X and 2.11X respectively. This result highlights that PUNO incurs less computational waste due to its capability to mitigate false aborting. 6.4.7 Hardware Overhead The implementation of PUNO introduces area and power overhead. The Prio-Buffer, TxLB and UD pointers are the main contributors of the extra area and power dissipation. We estimate the area and power of the structures using a commercial memory compiler with a clock frequency of 2.3GHz and Vdd value of 0.9V . Table 6.3 reports the area and 120 Table 6.3: Area and power overhead estimation Components Area(10 3mm 2 ) Power (mW) Prio-Buffer 4.70 7.28 TxLB 5.38 7.52 UD pointers 47.4 16.43 Overall 57.48 31.23 Overhead 0.41% 0.31% power estimation of PUNO targeting 65nm technology. The configuration of the Prio- Buffer and TxLB is identical to that used in the full system simulation, as shown in Table 6.2. The area and power of UD pointers are overestimated as each pointer is set to 8 bits instead of 4 bits due to constraints of the memory compiler. The overhead estimation is derived by comparing with the Sun Rock processor [TC08] which is a 16-core chip multiprocessor with HTM support. The Rock processor is clocked at 2.3GHz and fabri- cated using 65nm technology. Each of the 16 cores has an area of 14mm 2 and a power dissipation of 10W. The overhead estimation in Table 6.3 shows that PUNO incurs less than 0.41% more area and 0.31% more power, which further justify its deployment into future HTM designs. 6.5 Summary HTM designs typically piggyback onto the cache coherence protocol for conflict detec- tion. In this chapter, we identify an intrinsic mismatch between the coherence protocol and eager conflict detection of HTM, which leads to a performance and energy pit- fall called false aborting. We propose the Predictive Unicast and Notification (PUNO) scheme to combat false aborting. First, PUNO replaces the wasteful multicast of trans- actional write requests with unicast, thereby preventing the requests from disrupting the execution of concurrent transactions unnecessarily. Second, a proactive notification scheme restrains transaction polling, thereby further suppressing false aborting. PUNO 121 does not require modification to coherence protocol states or transitions. Full-system simulation demonstrates that, compared with a conventional high performance HTM design, our approach reduces transaction abort by 61% in benchmarks representative of future TM applications. Meanwhile, the network traffic is reduced by 32%. These improvements are achieved with a mere 0.41% area overhead. 122 Chapter 7 Consolidated Conflict Detection The false aborting and false forwarding problem (among many other pathological com- munication patterns in transaction execution) are due to the need to detect conflict among transactions. In fact, the transactional control traffic is more than the actual data transferring traffic in many workloads, suggesting there is a swarm of communication overhead in current conflict detection schemes. Detecting conflicts between transac- tions requires intensive core-to-core communication. According to the characterization of a set of TM applications, the overhead traffic due to conflict detection can amount to more than 50% of the total transactional traffic. Understanding the on-chip network bandwidth utilization of such mechanisms is important as the energy and latency cost of routing packets across a chip is growing alarmingly. In this chapter, we investigate the traffic in a typical conflict detection mechanism. Numerous sources of overhead are identified. To mitigate the overhead, we propose Consolidated Conflict Detection (C2D), a novel hardware technique to consolidate conflict detection to a logically cen- tral (but physically distributed) agent to reduce the bandwidth utilization of conflict detection. C2D offers following benefits. First, it reduces the bandwidth requirement of eager conflict detection significantly, potentially being able to close the bandwidth gap between eager and lazy conflict detection mechanism. Second, the concept of consol- idation can improve the scalability of conflict detection especially as the core-to-core communication is growingly expensive in terms of power and time. 123 7.1 Introduction Two architectural features are important for such a mechanism. First, the system must track the memory locations being accessed transactionally (i.e., transaction foot- print). Second, the system tests any transaction’s requests against other transactions’ footprints to identify conflicts. In practice, HTM systems (both academic proposals [AAK + 05, MBM + 06, LMG10] and commodity designs [JSG12, YHLR13]) typically distribute the bookkeeping of the transaction footprint to individual cores using either extra state bits in the L1 cache or hardware signatures [YBM + 07]. In general, conflicts can be detected and resolved either eagerly when transactions access the memory or lazily when transactions commit. In the eager approach, when there is a data request, the sharer cores are interrogated via coherence messages to detect conflicts between the local transaction and the requestor. Detected conflicts are resolved with the Ack/Nack (negative acknowledgment) mechanism of the underlying coherence protocol. Understanding the bandwidth utilization of the conflict detection mechanism is important, as the interplay between the on-chip network and the HTM system can sig- nificantly affect the performance/joule of the microprocessor. In our study, we identify the traffic overheads in eager conflict detection, which aggravates the already problem- atic coherence traffic. The first major overhead comes from the interrogation process in which numerous cores are contacted to detect conflicts against the requestor. When the request fails due to conflicts, these cores continue as if the request has not been issued. Thus, the sole reason they are notified of the request is because conflicts can only be detected distributively by individual cores. Ideally, a core should be aware of a conflicting request only when its transaction execution must be throttled to avoid the conflict. The interrogation process initiates inter-transaction communication more fre- quently than necessary. Another traffic overhead is the data transfer to requestors whose 124 P5 Home Node P0 P1 P6 GetX (0x80) Inv Inv Inv Data Nack Ack Stall Abort Execute P5 Home Node P0 P1 P6 GetX (0x80) Nack Stall C2D Logic P5 Home Node 0x80 S [0, 1, 6] P0 P1 P6 GetX (0x80) Inv Inv Inv Data Ack Ack Ack to P5 Ack to P5 ... R-set TMcount 1 TS 200 (a) (b) (C) 0x80 S [0, 1, 6] 0x80 R-set TMcount 1 TS 100 0x80 R-set TMcount 1 TS 550 empty R-set TMcount 0 TS nil TMcount 1 TS 200 TMcount 1 TS 100 TMcount 1 TS 550 TMcount 0 TS nil 0x80 S [0, 1, 6] Figure 7.1: Examples comparing (a) data race detection, (b) distributive conflict detec- tion and (c) consolidated conflict detection. Explosive marks indicate the operation of race condition detection in (a) or conflict detection in (b) and (c). requests fail due to conflicts. The data is discarded by the requestors thereby render- ing the data transfer futile and wasted. Despite the waste, the home node sends data to requestors speculatively in order to hide the latency of detecting conflicts distribu- tively. Without the speculation, the cache access latency increases from 3-hop to 4-hop. According to our experiments with a representative eager HTM running a set of scien- tific and commercial applications, the interrogation and wasted data transfer accounts for 20% (up to 37%) and 36% (up to 50%) of the total transactional traffic respectively. It is worth noting that the traffic overhead also plagues lazy HTMs that enforce pessimistic concurrency control on the write set of committing transactions. The traffic overhead is expected to grow as CMPs become increasingly distributed and, applications with coarse-grain transactions are emerging. Limiting the energy consumption of on-chip networks is imperative as networks account for a substantial portion of the system energy budget even with elaborate low- power circuit techniques [HDV + 11, HVS + 07]. Unfortunately, the distributive conflict detection exacerbates energy due to its excessive traffic overhead, which not only indi- cates higher dynamic energy but also nullifies the power-gating effort to reduce the static energy as routers have less opportunity to be gated-off [CP12]. 125 To this end, we propose Consolidated Conflict Detection (C2D), a micro- architectural technique to minimize the bandwidth utilization of conflict detection. Con- ceptually, transactions send their requests to a logically central agent for conflict detec- tion. The agent bookkeeps sufficient transactional metadata to detect and resolve con- flicts, thereby removing the need to interrogate remote cores. Moreover, by mapping the agent to the home node, the home node no longer needs to speculate on the fate of the requests. That is, the home node does not send data to requestors if the requests fail due to a conflict. So, the wasted data transfer is eliminated without penalizing the cache latency. The consolidation of conflict detection does not create a scalability bottleneck because the logically central agent is physically distributed to the home nodes. Thus, conflicts on different partitions of the shared memory are processed by different home nodes in parallel. The implementation complexity is at most moderate. No change is required to the underlying coherence protocol except a few more bits in the coherence messages. Our evaluation shows that the proposed technique can reduce 35% (up to 66%) of the network traffic in an eager conflict detection scheme thereby reducing the network energy by 27% (up to 43%). Performance is improved by 2.7% on an average. Moreover, a consolidated eager conflict detection approach generates 20% less traffic than a lazy HTM, closing the gap between the bandwidth utilization of the eager and lazy conflict detection. The contributions of this work are three-fold. We identify numerous inefficiencies in the bandwidth utilization of the eager con- flict detection mechanism, which could become a fundamental limiting factor when deploying HTM on large CMPs. 126 Rather, we propose a novel technique to minimize the communication in conflict detection. To the best of our knowledge, this work is the first to address the traffic overhead of HTM conflict detection. We evaluate the proposed technique in extensive full system simulations to demonstrate its effectiveness in reducing network traffic of eager conflict detec- tion. 7.2 Background and Motivation This section first contrasts conflict detection with the detection of race conditions in the cache coherence scheme. Then, we show that the inherent difference between these two tightly coupled schemes leads to excessive coherence traffic. Furthermore, the traffic overhead of a conventional conflict detection mechanism is quantified. 7.2.1 Detecting Data Races in Cache Coherence The key functionality of a cache coherence protocol is to detect data races (i.e., data access conflicts) between processor cores. This discussion assumes a directory-based write-invalidate protocol as it provides a viable solution for scaling the cache coherence. The directory is a logically central structure to track which cores hold a private copy of a data block. This core ownership information enables a race condition to be detected at the home node. Once a race condition is detected, the home node sends invalidation messages to the sharer cores demanding them to relinquish the requested data block. The sharer cores must be contacted because the requestor always wins the race. Figure 7.1(a) illustrates the race condition detection. A key insight is that the detection of data 127 races is centralized. The home node bookkeeps the necessary information and handles the races. 7.2.2 Conflict Detection in HTM The detection of a conflict in HTM is analogous to the detection of a race condition as both prevent conflicting accesses to shared memory. Thus conflict detection typically piggybacks onto the coherence protocol. However, the home node cannot identify the active transactions that have accessed a data block because the directory tracks core own- ership instead of transaction ownership 1 . So, unlike data races, conflicts are generally not detected at the home node. Typical HTM systems distribute the conflict detection capability to individual cores. Each core tracks the read and write address set of local transactions among other metadata. When the home node receives a request, it interro- gates (via coherence invalidation) the cores in the sharer vector to detect conflicts. The interrogated cores detect conflicts between any local transaction and the requestor. If a conflict is detected, the core could either send a Nack (negative-Acknowledgement) message to the requestor and continue its execution or send an Ack (Acknowledge- ment) and aborts any local transaction. Different cores receiving the same interroga- tion could make different decisions to resolve the conflict. As long as one core nacks the request, the requestor stalls its transaction and retries the request. Although the requestor-always-win approach in cache coherence could be used to resolve transac- tion conflicts, previous studies [BMV + 07, JSG12] show that more sophisticated con- flict resolution policies are necessary for such concerns as forward progress and perfor- mance. So, eager HTM designs [MBM + 06, LMG10, WGW + 12] typically implement 1 It is assumed that conflict detection between transactional and non-transactional accesses is provided. 128 0.2% G E N O M E K M E A N S S S C A 2 V A C A T I O N Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation 0% 20% 40% 60% 80% 100% Percentage of TX traffic Effective Traffic Interrogation Overhead Data Speculation Overhead Figure 7.2: Breakdown of transactional traffic. more effective policies that arbitrate between transactions based on certain prioritiza- tion heuristics. Without loss of generality, the baseline HTM in this paper adopts the time-based policy [RG02] that prioritizes older transactions (smaller timestamp). Figure 7.1(b) depicts an example of the conflict detection. P0 and P1 signal conflicts between the local transaction and the requestor as the store address is found in the local transaction’s read set. P0 and P1 resolves the conflicts using the timestamps respectively. P6 skips conflict detection as it is not in an active transaction. Note the conflicts are detected distributively. 7.2.3 The Communication Overhead of Conflict Detection A conflict between two parties, namely the requestor and the set of sharer transaction(s), can be resolved by stalling or aborting one party while allowing the other party to pro- ceed. If the requestor can proceed with its request, the communication cost of inter- rogation is inevitable as the sharer transactions must be notified to abort themselves. Nonetheless, if the sharer transactions can proceed, they can be ignorant of the request. These transactions are interrogated by the home node anyway because the home node is unable to detect conflicts. In this case, the cost of interrogation is not well justified 129 Percentage of TX Traffic Broadcast Interrogation w/ Conflict Broadcast Interrogation w/o Conflict Figure 7.3: Broadcast traffic as a percentage of the total transactional traffic. and thus should be minimized. Here, we define the interrogation overhead in conflict detection as the coherence requests (responses) to (from) various cores to detect con- flicts that stop the requestors. In Figure 7.1(b), the Inv, Ack and Nack messages are all interrogation overhead. This overhead can be removed if the home node can detect conflicts. Moreover, a data speculation overhead is the wasted data traffic to speculatively transfer data to requestors whose requests fail due to conflicts. In Figure 7.1(b), the data packet from the home node to P5 constitutes the data speculation overhead as P5’s GetX request fails. The root cause of the data speculation overhead is that the home node itself cannot detect conflicts encountered by the requests it serves. Thus, it sends data specu- latively assuming the requests will survive the conflict detection. The speculation hides the latency for interrogating remote cores for conflict detection at the cost of superfluous traffic. The home node relies on the directory to identify sharer processors to be interrogated for conflict detection. When a data block is evicted from the shared cache, the corre- sponding directory entry (as part of the cacheline tag) is also evicted. In unbounded HTMs [AAK + 05, MBM + 06, BDLM07], when a transaction requests the evicted block, 130 conflicts are possible because the block could still belong to the memory footprint of another active transaction. However, the home node cannot identify the processors that have accessed the block. So, it conservatively broadcasts the interrogation to all cores but the requestor for conflict detection. The bandwidth utilization is significant. This approach is too conservative especially for applications with infrequent contentions as most of the cores receiving the broadcast interrogation do not conflict with the requestor nor even execute a transaction. We define the broadcast overhead as the broadcast inter- rogations to processor cores (and responses from them) that do not conflict with the requestor. Note that conflicts can still be detected correctly if the broadcast overhead is removed. We quantify the traffic overhead in an eager conflict detection on a 16-core CMP (see Section 5 for experiment detail). As it is observed in Figure 7.2, the interrogation overhead and data speculation overhead account for an average 20% and 36%, respec- tively, of the total transactional traffic. The aggregated overhead traffic even exceeds the effective traffic in two applications. Figure 7.3 shows the cost of broadcasting interro- gations as a percentage of the total transactional traffic. An average 27% (up to 56%) of the transactional traffic is the broadcast overhead. Only a very small fraction of the broadcast interrogations encounter a conflict. So, the broadcast is indeed too conserva- tive. While the first two overheads mainly plague workloads with moderate to high con- tention, the broadcast overhead is more substantial in low contention workloads. Thus, the communication overhead in conflict detection can affect a wide range of applications regardless of the frequency of conflicts. 131 7.3 Consolidated Conflict Detection The key thesis of this proposal is the re-assigning of conflict detection from individual conflicting cores to the home node so that the on-chip bandwidth utilization of conflict detection mechanisms can be minimized. The home node is an ideal point for the con- solidated conflict detection as all the coherence requests get serialized at the home node. With necessary metadata of concurrent transactions, the home node is capable of cor- rectly and promptly detecting conflicts, requiring no interrogation to remote cores. If a request should be nacked, the home node directly nacks the requestor and skips the data transfer to the requestor. Hence, other cores are completely isolated from the request. Communication to the sharer transactions is initiated only when they should abort or stall to make way for the requestor. A simple example of the C2D scheme is illustrated in Figure 7.1(c). Unlike the conventional scheme in Figure 7.1(b), conflicts are detected at the home node using the C2D logic. The home node resolves the conflict by nacking P5’s request as its transaction is younger. Note neither interrogation nor speculative data transfer is initiated. Also, P1’s transaction is not aborted by the nacked request thereby preserving more concurrency. The nucleus of the C2D design is a cost-effective mechanism to bookkeep trans- actional metadata at the home node, which enables the home node to handle conflicts. A minimal set of transactional metadata essential for conflict detection and resolution includes: 1. Transaction ownership of data blocks. 2. The ordering to serialize conflicting transactions. 132 Core Private Cache Shared Cache Router DIR Timestamp TX Per-core Address Set pointers Bloom filter pointer overflow home node mask C2D Control Logic XACTS-Table MOS-Tracker Core ID C2D to neighbor tile to neighbor tile Figure 7.4: Architectural overview of a CMP tile. Bold rectangles indicate C2D-specific components. Provided the above metadata is readily available at the home node, the C2D logic can (1) identify conflicting transactions and (2) resolve the conflicts. In subsequent discus- sion, the bookkeeping mechanisms are described. Then, we discuss the C2D logic that processes conflicts. 7.3.1 Morphable Ownership Tracker The transaction ownership of a data block is the set of sharer transactions whose read or write set contains the block address. The home node tracks the transaction ownership so as to identify conflicting transactions. This feature is provided by the Morphable Own- ership Tracker (MOS-Tracker). Its organization is shown in Figure 7.4. Conceptually, the MOS-Tracker has P entries to bookkeep P sets of addresses being accessed transac- tionally by each of the P processors. Each entry is equivalent to a combined read and write set. If a processor is not executing a transaction, its MOS-tracker entry is an empty set. It is important to note that the set is a subset of a transaction’s memory footprint as it tracks transactional accesses to the addresses that are mapped to only the home node. As the home node handles conflicts only on those addresses, it is unnecessary to track the transaction ownership for other addresses. Performing a set-membership test of an address against the MOS-Tracker entries generates the transactional sharer 133 vector, which is a bit vector identifying the processors executing an sharer transaction. The transactional sharer vector and the coherence state together indicate the data sharing activity of concurrent transactions (described in Table 7.1). The MOS-Tracker supports three operations. Insert. An address is inserted into an entry when the home node receives a coherence message from a transaction to unblock the directory. All the transactional accesses that encounter a L1 miss can be tracked this way. For other accesses served from the L1, the requestors explicitly notify the home node with the requests. The hit notification goes through the load store queue. The traffic overhead of hit notifications is marginal for two reasons. First, the shared data being accessed within transactions often exhibits a migratory sharing behavior [GW92, SBS93]. So it is common that the first access to the data in a transaction misses in the L1 and goes to the home node. Second, repeated hit notification is prevented by marking the already-notified cachelines with extra L1 state bits. Quantified results of the notification traffic are presented in Section 5. Test. A set membership test of an incoming address against all the P entries in the MOS-Tracker generates the P-bit transactional sharer vector. If the directory has a valid sharer vector for the requested block, the test operation can skip the entries corresponding to cores not present in the sharer vector as the transactional sharer vector is a subset of the sharer vector. Only when the sharer vector is invalid will a MOS- Tracker test walk through all the sets. In subsequent section, we present a design that allows the test of individual entries to proceed in parallel. Reset. The MOS-Tracker clears an entry corresponding to a core when the home node is notified by the core of the end of the transaction. This notification is discussed in the next section. 134 Table 7.1: Data sharing activities and conflict requests at different combinations of coherence state and transactional sharer vector (TSV) Coherence State TSV Size Data Sharing Activity Conflict Request NP 0 Data is not present in the on-chip cache. No active transaction has accessed it. None NP 1 Data is not present in the on-chip cache. An active transaction has accessed (read or write) it. GetS, GetX NP >1 Data is not present in the on-chip cache. 1+ active transactions have read it. GetX S 0 Data is shared. No active transaction has accessed it. None S 0 Data is shared. 1+ active transactions have read the block. GetX M 0 Data is modified and not in any private cache. No active transaction has accessed it. None M 1 Data is modified and not in any private cache. A transaction has accessed (read or write) it. GetS, GetX MT 0 Data is modified and in private cache. No active transaction has accessed it. None MT 1 Data is modified and in private cache. A transaction has written to it. GetS, GetX Motivated by the observation that a large fraction of transactions have a small set of read and write addresses while a few transactions have a large set of read and write addresses, we design the MOS-Tracker as a morphable structure that tracks a small set of addresses precisely while tracking an unbounded set of addresses conservatively. When the number of accesses to track is small, the MOS-Tracker uses the limited pointer scheme to bookkeep the addresses precisely. As transactions progress, the number of transactional accesses could overflow the pointer slots. Under such circumstances, the MOS-Tracker morphs into a set of Bloom filters [Blo70] to track the addresses conser- vatively but without false negatives. As in a conventional transaction signature, false positives in the MOS-Tracker can cause conservative serialization of transactions but do not jeopardize correctness and forward progress. Recall the morphable design can completely avoid false positives when the address set is small. A cost-effective imple- mentation of the MOS-Tracker is presented in next section. 7.3.2 Transaction Status Table The home node is augmented with a Transaction Status Table (XACTS-Table) to record the status of concurrent transactions. It tracks each transaction’s rank in the global transaction ordering, which is used for conflict resolution. The conflict resolution policy 135 is largely orthogonal to C2D: any policy is applicable to C2D as long as it provides a unanimous transaction ordering throughout the system. Without loss of generality, C2D adopts the time-based policy [RG02]. A timestamp is generated at the beginning of each transaction, and is tagged to every coherence message from the transaction. Transactions with smaller timestamp win the conflict resolution. The structure of the XACTS-Table is depicted in Figure 7.4. The table is indexed using core id. Each entry has aTS field to record the timestamp and aTX bit to indicate the validity of the timestamp. Maintaining the XACTS-Table requires every transaction to explicitly notify the home nodes with its beginning and end. However, the requirement can be relaxed based on the observation that transactions cannot conflict on a data block if they have not requested the data from the home node. So the home nodes need not update a trans- action’s status in the XACTS-Table until they receive a request (or a hit notification) from the transaction. As a result, a transaction does not explicitly notify its home node when it begins. The home node updates the transaction’s timestamp when a request (or a hit notification) from the transaction is received. When a transaction ends, it notifies only the home nodes to which the transaction has sent requests. To identify the home nodes being contacted by a running transaction, each core is augmented with a simple bit vector called the home node mask. When the core sends a transactional request or hit notification to a home node, the corresponding bit is set. When the transaction ends, the home node mask identifies the home nodes that should receive a notification of the transaction end. 136 7.3.3 The C2D Logic Now, we discuss the C2D logic that detects and resolves conflicts using the transactional metadata from the MOS-Tracker and XACTS-Table. It operates in parallel with the shared cache access thereby not penalizing the cache latency. When a home node receives a coherence data request, the C2D logic retrieves the transactional sharer vector of the requested block from the MOS-Tracker. The requestor is masked out from the vector. If the vector is empty, the request is conflict free because no other transaction is accessing the data. So, the home node proceeds to serve the request as in a normal coherence protocol. Otherwise, if the vector is not empty, the home node examines the combination of the vector, coherence state and request type to detect any violation of the “single-writer-multi-reader” invariant that signals a conflict between the requestor and the sharers. Table 7.1 describes the state combinations that lead to conflicts. Note that the C2D logic conservatively signals a conflict when there is a GETS request to a block that is in the NP state and has been accessed in an active transaction. Nonetheless, the home node no longer has to assume all the cores are conflicting with the requestor when the requested block is in the NP state. Once conflicts are detected, the C2D logic resolves the conflict by serializing the conflicting transactions. The transaction timestamp in the XACTS-Table is used to determine the serializing ordering: transactions with larger timestamps (younger) are serialized after those with smaller timestamps (older). Unlike the distributive scheme that enforces the ordering between the requestor and individual sharer transactions sepa- rately, the C2D logic treats the sharer transactions as a group. Specifically, if a requestor should be serialized after at least one sharer transaction, then all the sharer transactions can proceed safely. The home node sends a Nack to the requestor causing it to stall. More concurrency is preserved as the consolidated conflict resolution avoids the myopic 137 Figure 7.5: Procedure of the C2D logic to detect and resolve conflicts. AF: AbortFlag. TSV: Transactional Sharer Vector. aborting of some sharer transactions by requests that are nacked by other sharers. If all the sharer transactions should be serialized after the requestor, the home node sends to the sharers an invalidation message with a special flag (AbortFlag) asserted. Upon receiving the invalidation, the sharer transactions either abort or trap into a contention manager. An Ack message is sent from each sharer to the requestor. Figure 7.5 sum- marizes the conflict detection and resolution procedure implemented in the C2D logic. Note that the operations can be readily pipelined for higher throughput. 138 7.4 Implementation Details 7.4.1 Implementing MOS-Tracker The MOS-Tracker bookkeeps one address set per core either in a limited pointer format when the set size is small or in a Bloom filter format when the set size exceeds the pointer capacity. The implementation should address two main challenges. First, the common case latency of MOS-Tracker operations should be bounded by the L2 access latency (usually between 10 to 20 cycles) so that the conflict detection does not penalize the L2 access. Second, the morphing between two formats should be transparent. Otherwise, requests could be blocked by the C2D logic due to the unavailability of transaction ownership information. These challenges must be addressed with area- and energy- efficient hardware. The proposed MOS-Tracker implementation is presented in Figure 7.6. The basic storage structure is a simple single-ported SRAM array. In the pointer mode, each row contains pointer slots addressable with the column address. The first slot always points to the next free pointer slot in the row. One logical MOS-Tracker entry can span one or more rows. Here, we consider a straightforward 1-to-1 mapping between rows and entries. The test operation walks through rows in search for the target address. The output is the P-bit transactional sharer vector. A wide read port and parallel comparison accelerates the test. The sharer vector and test output from the filter-mode SRAMs can assist to skip the rows that need not be tested. So, the test typically finishes within the L2 access latency. In the filter mode, the SRAM combines the per-core Bloom filters as all the filters use the same hash function. The hashed address reads out a P-bit vector, where each bit corresponds to a bucket in a per-core Bloom filter. The test operation requires a single 139 Ptr- Mode SRAM Filter- Mode SRAM Filter- Mode SRAM column addr (for single word write) row addr C C C C AND wide word read single word read ptr ptr ... ptr test addr test addr AND Bit vector from other filters 1-bit bucket ... ... Figure 7.6: A hardware implementation of the MOS-Tracker. read access. Inserting an address into core i’s filter requires two accesses. One access reads out the P-bit vector and the other access writes back the modified P-bit vector with the ith bit set to one. Resetting core i’s filter clears the ith bit in each P-bit word. [SSN + 89] describes a cost-effective circuit design of a small SRAM with flush clear function. A minimalist MOS-Tracker consists of two SRAM arrays: one in filter mode and the other initially configured to pointer mode. The final transactional sharer vector is gener- ated by AND-ing the P-bit vector from each SRAM. More filter-mode SRAMs can help reduce false positives. k filter-mode SRAM arrays with k hash functions constitute a set of parallel Bloom signatures [SYHS07] for the P cores, where k is a design parameter. One pointer-mode SRAM is usually sufficient to track the accesses of fine-grain trans- actions. Allocating additional SRAM arrays for pointer storage results in diminishing return and elongated test latency. The pointer-mode SRAM morphs to filter mode when pointers overflow a row. Mor- phing can be performed by a software handler in two steps. First, the address pointers are flushed to a temporary storage. Then, these addresses are re-inserted into the empty 140 Mixed mode Morphing Filter mode 1 ptr-mode SRAM k filter-mode SRAM k filter-mode SRAM k+1 filter-mode SRAM TX address stream Transactional Ownership Vector to C2D logic Test 1 ptr-mode SRAM Test k filter- mode SRAM Test k+1 filter-mode SRAM Time Figure 7.7: Execution phases of the MOS-Tracker. Bloom filter. The morphing SRAM remains offline until the handler returns. The MOS- Tracker can still track transaction ownership as the filter-mode SRAMs have virtually continuous availability. So, morphing is transparent to the C2D logic. The morphed SRAM can return to pointer mode when no core is in a transaction. If frequent morph- ing is expected or detected, the SRAM can be pinned to filter mode by a programmer or a runtime mechanism. Figure 7.7 depicts the MOS-Tracker’s execution phases. The minimalist design with two SRAM arrays delivers good overall performance according to our experiments. Each SRAM in the design has 4096 bits organized as a 16x16 array. The word width is 16 bits. The pointer-mode SRAM stores 8 address pointers (32-bit each) in a row. If both SRAMs are in the filter mode, the per-core filter is equivalent to a parallel Bloom signature with 2 hash functions and 256 bits per hash function. When combining the MOS-Tracker in all the 16 home nodes, the aggregated address tracking capability is either 128 addresses/core or 8K-bit signature/core. The total number of bits required is 128K. The hardware cost is compensated by the removal of conventional signatures. 141 7.4.2 Analysis of False Positive in C2D It is well known that false positives in Bloom-filter based signatures indicate conflicts when no actual conflict exists. Thus, false positives are usually detrimental to perfor- mance by causing unnecessary stalls or aborts. In the distributive conflict detection, false positives occur in either the read or write signature. In the consolidated scheme, false positives occur in the MOS-Tracker. We use an analytical model to compare the frequency of false positives in the two conflict detection schemes assuming a similar hardware budget. Specifically, our model considers: i) two parallel Bloom signatures [SYHS07] tracking the read and write set of a transaction and ii) one MOS-Tracker entry tracking the address set of a transaction. Each parallel Bloom signature uses k bit- selection hash functions and k SRAMs, each with m/k bits. n r addresses are inserted into the read signature and n w addresses are inserted into the write signature. Let us assume the probability of an address being a read address (P r ) is n r /(n r +n w ). By extending the formal model proposed by Sanchez et al. [SYHS07], the probability of a false positive in the signatures is: P FP sig = n r n r +n w 1e nrk m k + n w n r +n w 1e nwk m k (7.1) The MOS-Tracker uses the same k hash functions and k SRAMs. Each SRAM has 2m bits so that the total bit count is equivalent to that of the two signatures. However, as the MOS-Tracker bookkeeps p address sets (one per transaction), 2m/kp bits are dedicated to track a transaction’s address set. Assume a transaction’s memory footprint is uniformly 142 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 Probability of False Positive Number of Addresses Pr=0.75, m=1024, k=1, Pof=1 P_FP_SIG P_FP_MOS 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 Probability of False Positivie Number of Addresses Pr=0.75, m=1024, k=4, Pof=1 P_FP_SIG P_FP_MOS 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 Propability of False Positive Number of Addresses Pr=0.95, m=1024, k=4, Pof=1 P_FP_SIG P_FP_MOS 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 Probability of False Positive Number of Addresses Pr=0.95, m=1024, k=4, Pof=1 P_FP_SIG P_FP_MOS Figure 7.8: Probability of a false positive as a function of the number of addresses being inserted. distributed among the p home nodes, the probability of a false positive in the MOS- Tracker is: P FP MOS =P of 1e nr+nw p k 2m p !k =P of 1e nr+nw 2 k m k (7.2) P of is the probability of pointer overflow. The P FP is plotted in Figure 7.8. P of is fixed at one to obtain the upper bound. As shown, if a fixed number of bits are invested to implement either the read and write signature or the MOS-Tracker, the latter has less frequent false positives as the combined tracking of read and write set in one bit field can improve the signature utilization [CD13]. 143 Table 7.2: Benchmark input parameters Benchmark Input Parameters Abort Rate Bayes 32 var, 1024 records, 2 edge/var 97.1% Intruder 2k flow, 10 attack, 4 pkt/flow 77.6% Labyrinth 32*32*3 maze, 64 paths 98.6% Yada 1264 elements, min-angle 20 47.9% Genome 32/512 nucleotides, 16384 segments 1.3% Kmeans 65k pts, 32d, 16 clusters, thld 0.1e-4 7.4% SSCA2 16k nodes, 9 len, 9 para edge 0.3% Vacation 1M record. 4K req. 60% coverage 38% 7.5 Evaluation 7.5.1 Methodology We evaluate the efficacy of the proposed C2D technique using cycle-accurate full system simulation [MCE + 02]. The Ruby memory timing module [MSB + 05] models a detailed memory subsystem. We use Garnet [AKPJ09] and Orion2 [KLPS09] to model the tim- ing and power of a packet-switched on-chip network respectively. For our analysis, we collect execution statistics from at least twenty runs of each workload in the STAMP benchmark [MCKO08], which is extensively used by the TM community. Table 7.2 lists the input parameters of the benchmark. The simulation uses a tiled CMP as the baseline architecture. Each tile has an in- order SPARC core, L1 data and instruction cache and an address-interleaved L2 bank. The L2 is inclusive. Cache coherence is maintained by a directory-based MESI protocol. The directory is distributed among the tiles by augmenting the L2 tags with directory information. The tiles are connected by a 2D mesh router network. The canonical 4- stage router uses virtual channel flow control and dimension-ordered routing. The link width is 128 bits. A coherence control message (64 bits) is transmitted in 1 flit, and a coherence data message (576 bits) requires 5 flits. We estimate the router and link energy for 40nm technology and 0.9V on-chip voltage. 144 Table 7.3: Baseline System Core 16 SPARC V9 cores, 2GHz, in-order, CPI=1 L1 Cache 32KB, split I/D, 4-way associative, write-back, 1 cycle latency L2 Cache 8MB, 8-way associative, 16 banks interleaved, 20 cycles latency Coherence MESI protocol, static cache bank directory Memory 4 GB, 4 memory controllers, 200 cycles latency Network 4x4 2D mesh, DOR, 4 flits/VC, 2 VCs/vnet, 5 vnet, 2GHz 4-stage router, 128-bit link HTM 2x2Kbit signature/core, 2 multi-bit-selection hashings, 256-entry log buffer, 25-cycle backoff The baseline HTM system follows the log-based approach [MBM + 06] for eager version management. A hardware log buffer is used to accelerate abort recovery. Cores track the transaction read and write set with two 2Kbit parallel Bloom signa- tures [SYHS07]. Each signature uses two bit-selection hash functions. Conflicts are detected eagerly and distributively by piggybacking onto the coherence protocol. The protocol is extended with sticky states [MBM + 06] to allow transactions to overflow the L1. Conflicts are resolved in hardware using the time-based policy. A backoff mech- anism delays transaction restarts by a fixed time to mitigate conflicts. Detailed system parameters are shown in Table 7.3. For the evaluation of C2D, we apply the C2D technique to the baseline eager conflict detection (Eager-Base) approach to implement the consolidated eager conflict detection (Eager-C2D). The coherence controller is augmented with a XACTS-Table, a MOS- Tracker and the C2D logic. Each MOS-Tracker has two 4Kbit SRAMs with the same hash functions as the baseline. These mechanisms operate in parallel with L2 access, thereby not penalizing the 20-cycle L2 response latency. We also present results from a lazy HTM system (Lazy-Base), as lazy systems are considered more gracious in band- width utilization due to the laziness in conflict detection. The lazy HTM is as described in [BMV + 07]. 145 7.5.2 Impact on Network Traffic Figure 7.9 shows the impact of various conflict detection schemes on the network traffic, which is measured as router traversals by flits. All the results are normalized to Eager- Base. As it is observed, the C2D technique reduces 39% (up to 57% in Labyrinth) of the network traffic when applied to the eager conflict detection scheme. Specifically, C2D reduces 60% of the transactional control traffic in the baseline eager system because the consolidation eliminates the need to interrogate remote transactions in order to detect conflicts. In particular, Labyrinth exhibits an 81% reduction in transactional control traffic. In Labyrinth, each thread is executing coarse-grain transactions that read a global maze grid, calculate a path between two points, and add the path to the grid. So, a writer transaction conflicts with all the concurrent sharer transactions. In the conventional scheme, all the sharer transactions must be interrogated to detect the conflicts. Enabled by C2D, the home node can detect the conflicts thereby significantly reducing the control messages devoted to interrogation. Meanwhile, C2D reduces 34% of the transactional data traffic as it obviates the data transfer from the home node to requestors whose requests are nacked. As the broadcast overhead exists in unbounded HTM designs only, we also per- formed experiments configuring the C2D to reduce all but the broadcast overhead to examine its impact on best-effort HTMs. The resulting average traffic reduction is 25% instead of 39%. The traffic reduction in the four high-contention workloads is 38% instead of 40%. So, both unbounded and best-effort HTMs can benefit similarly from C2D. Lazy HTM systems typically exhibit more gracious bandwidth utilization than eager systems due to the laziness in conflict detection. However, lazy conflict detection does have undesirable characteristics such as complicated overflow handling and longer 146 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Eager Lazy Eager-C Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Average Normalized Network Traffic Non-TX TX-Ctrl TX-Data Figure 7.9: Normalized on-chip network traffic. Table 7.4: Notification traffic in Eager-C2D Benchmark TX End L1 Hit Bayes 0.06% 0.22% Intruder 4.98% 6.70% Labyrinth 0.04% 0.20% Yada 0.24% 1.09% Genome 2.07% 2.91% Kmeans 1.14% 1.66% SSCA2 3.41% 2.03% Vacation 0.91% 2.48% aborting recovery. The results in Figure 7.9 confirm that the Lazy-Base generates 19% less traffic than the Eager-Base. Nonetheless, the Eager-C2D generates 24% less traffic than the lazy system. Thus, the C2D technique could preserve the benefits of detect- ing conflicts eagerly without paying the extra bandwidth cost. Lazy systems can also benefit from the C2D as they also detect conflicts eagerly to isolate memory updates of committing transactions. We leave such a study to our future work. Figure 7.10 shows the normalized coherence message count in each application. The Eager-C2D reduces 55% of the coherence messages compared with the Eager-Base and 45% of the messages compared with the Lazy-Base. The reduction in coherence message count is directly translated to coherence traffic savings. So, unlike the TMNOC scheme which reduces network traffic by reducing the average hop count [ZCCD13], 147 0 0.4 0.8 1.2 1.6 2 Normalized Message Count Eager-Base Lazy-Base Eager-C2D Figure 7.10: Normalized Coherence Message Count. C2D achieves traffic savings by reducing the coherence messages being injected into the network. As discussed, processor cores need to notify the home node with two events: 1) the transaction end and 2) transactional L1 hit. The notification traffic is already accounted for in the transactional control traffic of Eager-C2D in Figure 7.9. Here, we show the two types of notification traffic as a percentage of total network traffic in Table 7.4. On average, the transaction end notification accounts for 1.61% of the traffic, and the L1 hit notifications account for 2.16% of the traffic. The added traffic is dwarfed by the traffic savings of the C2D scheme. 7.5.3 Reduction in Network Energy One of the fundamental goals of the C2D technique is to reduce the network energy con- sumption for conflict detection. Figure 7.11 presents the normalized energy consump- tion of the on-chip network (including routers and links). On an average, Eager-C2D achieves an energy saving of 27.1% compared with the distributed conflict detection in the Eager-Base. Lazy-Base has a smaller network energy footprint than Eager-C2D in four workloads. However, the lazy scheme consumes significantly more energy than 148 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Normalized Network Energy Eager-Base Lazy-Base Eager-C2D 2.37 Figure 7.11: Normalized network energy. both eager schemes in Yada and SSCA2. Overall, Eager-C2D consumes 19% less net- work energy than Lazy-Base. The savings are mainly from the diminishing dynamic energy as C2D reduces the flits being transmitted over the network. Further investiga- tion is needed to see if even more energy savings could be achieved, given that fewer flits could result in substantial idle periods in the routers making power gating feasible for static energy savings. 7.5.4 Impact on Performance Figure 7.12 shows the normalized execution time. On average, the C2D scheme reduces execution time by 2.7%. Six workloads exhibit performance improvement of up to 16.3%. Two workloads (Intruder and SSCA2) are slowed down by around 5%. As dis- cussed, the C2D can avoid transaction aborting caused by unsuccessful requests. How- ever, if the transaction needs to abort eventually, delaying the aborting cause the transac- tion to either take longer to recover (Intruder) or stall peer transactions longer (SSCA2). The results of these two workloads echo the findings in [CD10] that early termination of ultimately conflicting transactions could sometimes improve performance. Nonetheless, 149 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 E L E-C E L E-C E L E-C E L E-C E L E-C E L E-C E L E-C E L E-C Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Normalized Execution Time Non-TX Aborted-TX Committed-TX Aborting Committing Stall Barrier Figure 7.12: Normalized execution time. the majority of the workloads in our evaluation do benefit from the increased concur- rency. As a matter of fact, the performance potential of C2D is not fully revealed as the network bandwidth is not a bottleneck in the evaluated workloads. The reduction in network traffic will lead to more significant performance improvements in use cases with intensive core-to-core communication (e.g., highly multithreaded applications and workload consolidation). The relative performance between Lazy-Base and the two eager systems varies from one workload to another. This variation also leads to the fluctuation of Lazy-Base’s network energy consumption (see Figure 7.11). On average, Lazy-Base is 1.12x slower than Eager-C2D. However, many other dimensions (e.g., version management) beside conflict detection contribute to the performance difference between lazy and eager HTMs [HCU + 07, BMV + 07]. The design tradeoffs within the numerous dimensions are explored in a plethora of research proposals. Our technique focuses on the bandwidth utilization and, is shown to close the bandwidth utilization gap between eager and lazy systems with negligible impact (even a slight improvement) on overall performance. 150 7.5.5 Reduction on Conflict Detection Performing conflict detection is an energy-intensive operation. Moreover, it can stall the instruction retiring [JSG12]. The ongoing execution is further disrupted if subse- quent conflict resolution needs to trap into a contention manager. Thus, frequent conflict detection at each tile is undesired. Figure 7.13 shows the number of conflict detections performed at each tile. We only present results from applications with a medium to high number of contentions. The results from the other applications show a similar trend as that in Yada. As it is observed, C2D can reduce the number of conflict detections to be performed at most tiles. Also, the total number of conflict detections is reduced signifi- cantly. Thus, the consolidation of the activity of conflict detection reduces the frequency of the processor cores being distracted to perform conflict detection. Another important observation from Figure 7.13 is that the consolidation of conflict detection does not create a performance or traffic bottleneck because no tile performs significantly more conflict detection than in the distributed scheme. In the C2D scheme, the per-home-node conflict detection count is essentially determined by the memory layout of transactional data. Apparently, home nodes of the conflict hot spots will handle more conflicts. For instance, in Bayes and Intruder, 90% of the conflicts that cause transaction aborting occur on three memory blocks (see Figure 4.2). Thus, these two applications have a few home nodes that perform more frequent conflict detection. In comparison, the conflict hot spots in Labyrinth and Yada are more evenly distributed in the memory space, and so are the per-home-node conflict detections. 7.5.6 Sensitivity Study The MOS-Tracker entry size is the bit field size dedicated to tracking the address set of one transaction. The entry size could affect the effectiveness of the C2D scheme as 151 0 20 40 60 80 1 6 11 16 Count Thousands Core ID Eager-Base Eager-C2D Intruder 0 1000 2000 3000 4000 5000 6000 1 6 11 16 Count Hundreds Core ID Eager-Base Eager-C2D Bayes 0 50 100 150 200 250 300 350 1 6 11 16 Count Thousands Core ID Eager-Base Eager-C2D Labyrinth 0 100 200 300 400 500 600 700 1 6 11 16 Count Thousands Core ID Eager-Base Eager-C2D Yada Figure 7.13: Conflict detections performed at each tile. it determines the probability of false positives. In the pointer mode, a larger entry can track more address pointers for a transaction. Thus, the entry is less likely to morph into the imprecise filter mode. In the filter mode, a larger bit field results in fewer conflicts when storing different hash values, thereby lowering the probability of false positives. We conducted a sensitivity analysis of the network traffic as well as performance to varying MOS-Tracker entry size. In this study, the SRAMs are pinned to the filter mode to simulate the worst case scenario (i.e., always imprecise). The results are shown in Figure 7.14. The values are normalized to Eager-Base. Regardless of the entry size, Eager-C2D consistently generates less network traffic than the Eager-Base. A majority of the applications see a slight reduction in network traffic when the bit field size increases. The reduction is within 5% except for Bayes 152 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 256-bit 512-bit 1024-bit 2048-bit Normalized Performance Number of bits per entry of the MOS tracker Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 256-bit 512-bit 1024-bit 2048-bit Normalized Network Traffic Number of bits per entry of the MOS tracker Bayes Intruder Labyrinth Yada Genome Kmeans SSCA2 Vacation Figure 7.14: Sensitivity to MOS-Tracker entry size. which has 19% less traffic when the entry size increases from 256 bits to 2048 bits. The execution time shows a similar sensitivity to the entry size: the performance change is within 7% in all applications but Bayes. The reason why Bayes is sensitive to the MOS-Tracker entry size is that it mainly uses coarse-grain transactions with nearly 150 accesses (up to 3750 accesses) to distinctive cache lines. In fact, transactions in Bayes are much coarser than the transactions in the remaining workloads in the STAMP bench- mark. Thus, a small MOS-Tracker entry cannot track the sizable set of addresses with a sufficiently low probability of false positives. As the entry size increases beyond 1024 bits, the traffic and performance are stabilized indicating that a 1024-bit MOS-Tracker entry at each home node is adequate to track the address set of coarse-grain transactions in Bayes. Overall, both the traffic and performance of the evaluated applications are relatively insensitive to the entry size of the MOS-Tracker; however, upcoming applica- tions with coarse-grain transactions will benefit from a larger entry size. 153 7.5.7 Hardware Cost Estimation We generate the 16x16x16bit single-ported SRAM using a commercial memory com- piler targeting 65nm TSMC bulk CMOS technology. The SRAM operates at 2GHz and supports power gating. The area of the SRAM array is 14346.9 um 2 and the access time is less than half a cycle. We estimate the area overhead based on the 16-core Rock processor which provides HTM support and is manufactured using 65nm bulk CMOS technology [TC08]. If The Rock processor augments each of the 16 nodes with two of the 4K-bit SRAMs, it adds a meager 1.16% to the overall area. 7.6 Summary The on-chip network bandwidth utilization of the conflict detection mechanism is of importance to the performance/joule of HTM-capable microprocessors. In this chapter, we first analyzed the on-chip traffic in a conventional conflict detection mechanism and found that the traffic overhead of conflict detection accounts for a significant portion of the total traffic. The root cause of this inefficiency is the distribution of conflict detection capability to individual cores across the entire chip. To reduce the bandwidth utilization of conflict detection, we propose a novel technique to consolidate the conflict detection into the logically central (but physically distributed) home nodes so that a home node can handle conflicts correctly and promptly without initiating further on- chip communication. While full system simulations show that C2D has a negligible impact on performance (1.03x speedup), it dramatically reduces network traffic and can be implemented with a meager area overhead. The consolidation technique, if adopted in the conventional eager conflict detection mechanism, reduces the network traffic by 35%, thereby saving 27% of the network energy. 154 Chapter 8 Conclusion As chip processor architectures are rapidly evolving to incorporate abundant processing elements (possibly heterogeneous) to exploit thread level parallelism and task level par- allelism, the capability to correctly and efficiently synchronize concurrent accesses is of paramount importance. The concept of Transactional Memory is an elegant program- ming paradigm that can facilitate developer- and performance-friendly synchronization in multithreaded applications. The Hardware Transactional Memory accelerates trans- action execution with specialized hardware components implemented in the processor. As technology scaling has continued to bring down the transistor cost, the latest prod- ucts from major microprocessor vendors (e.g., IBM and Intel) have started to incorporate such HTM features. Although HTM has been researched intensively in the past decade, my research focuses on HTM because of two reasons. First, recent evaluation results of industrial HTM implementations suggest that the design space of HTM is still far from being well-understood. Second, the tight coupling of HTM and the evolving parallel architecture continuously produces ample research opportunities. 8.1 Summary The central theme of this dissertation is to reduce the communication costs in HTM systems as the technology scaling trends dictate that data movement will be the dom- inant factor for energy dissipation in future machines. We derive a cost function of the on-chip communication in a generic transactional system from an analytical mode. 155 The cost function identifies the key contributing factors, thereby indicating the viable avenues towards efficient communication in HTM systems. Around the central theme of achieving efficient communication in HTM designs, four hardware techniques are proposed to address the problem from different aspects of the system. In Selective Eager-Lazy HTM (SEL-TM), we propose a design with mixed eager and lazy management policy at the granularity of a cacheline for individual trans- actions. SEL-TM attempts to capture the concurrency and communication benefit of a lazy policy by managing highly contended cachelines lazily while avoiding the lazy policys commit overhead and implementation complexity by managing the majority of a transaction memory footprint eagerly. In TMNOC, we leverage a HTM-NOC co-design to implement an in-network filtering scheme that removes superfluous transactional requests caused by conflicts. The TMNOC routers dynamically monitor the conflicts between concurrent transactions and proactively drop requests with a high probability to fail due to conflicts. The TMNOC technique is effective in reducing inter-transaction communication with a meager hardware cost. In Predictive Unicast and Notification (PUNO), we identify a mismatch between the cache coherence protocol and the con- flict detection mechanism, which leads to disruptive inter-transaction communication. The PUNO technique opportunistically replaces the exhaustive coherence forwarding from the home node with a unicast to the node that can handle the conflicts. Moreover, transactions notify dependent transactions with the estimated time when the data will be available to suppress frequent polling from dependent transactions. In Consolidated Conflict Detection (C2D), we propose a novel conflict detection scheme that decouples individual nodes from the task of conflict detection. Instead, the task is performed by logically central but physically distributed agents that are mapped to the home nodes. The main advantage is reduced communication because the home node needs not to 156 contact remote nodes in order to detect conflicts. With cost-effective hardware to track transaction metadata at the home node, C2D achieves its efficacy with only marginal hardware costs. 8.2 Looking-forward In general, HTM is an attractive solution to replace the traditional lock primitives. We have already seen commodity processors with the capability to execute transactions. Looking forward, there are still non-trivial obstacles to overcome before the TM concept becomes a powerful programing construct in practice. In what follows, I try to identify a few of the obstacles based on my research experience as a doctoral student. The first issue is programmability. Although it has long been claimed that the ease to program is TM’s main merit, the required effort to write TM programs on current HTM- capable microprocessors may prompt developers to arrive at a conclusion contrary to the claim. Those processors provide only “best-effort” transaction execution, thereby requiring developers to also provide a non-transaction version of the code as a fallback path. If TM cannot fulfill its original promise of programmability, it faces a major hurdle that prevents its large-scale usage. Secondly, tools should be developed for HTM performance debugging. As con- stantly observed in our experiments with the TM workloads from the STAMP bench- mark, the performance can be extremely sensitive to the transaction interleaving. Changes in the hardware configuration or, more likely in real systems, external events can cause non-negligible performance variation even with the same piece of code on the same machine configuration. Thus, software tools are needed so that developers can identify or even eliminate the source of such variation. 157 Thirdly, infrastructure for high-fidelity simulation should be readily accessible to researchers. A few simulators from various research groups model HTM in detail. These simulators have been developed for almost a decade with different assumptions about HTM. Among them, GEMS provides a general and detailed HTM model along with elaborate modeling of the cache hierarchy and protocol, making it widely used in the research community. However, with its discontinuation, there is no clear substitute to model HTM, especially under mainstream ISAs. Moreover, architecting next-generation HTMs also requires a benchmark suite that is able to capture and summarize the com- mon transaction characteristics in upcoming TM applications. 8.3 Reflection In this section, I provide some reflection and thoughts on my research in this thesis dissertation with the benefit of hindsight. When I started to look into HTM as my research topic in the year of 2009, there has already been a plethora of thought-provoking ideas published on this topic. With such abundant prior work, it is challenging for a rookie architect to develop a body of original research that is worthy of a PhD program. Initially, I devoted my effort into developing a high-performance conflict detection mechanism as conflicts are the major source of performance loss. This effort partially led to the design of selective eager-lazy conflict detection in 2011. Techniques to achieve high performance had been studied extensively by then. Since late 2011, my research effort transitioned to investigate the interplay between HTM and NOC, largely because of three reasons. First, my adviser pointed out the potential of the topic. Second, energy consumption dedicated to on-chip communication is becoming a major concern in chip design. Third, this specific topic 158 had yet to be explored. The “interdisciplinary” study brought the rookie architect a fresh angel to look into HTM designs. “Sequentiality is an illusion”. Our brains think in parallel. However, we also get used to instruct the machine to operate in a sequential fashion. Bestowing parallelism to machines certainly requires some means to bridge over the difficulties of parallel programming. So many are exploring in this wild and exciting frontier. Even though this thesis could be regarded as incremental or insignificant when we eventually trek through the wilderness to see the ocean, I am fortunate and grateful to be a tiny part of the exploration. 159 Bibliography [AAK + 05] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leis- erson, and Sean Lie. Unbounded transactional memory. In Proceedings of the 11th International Symposium on High-Performance Computer Archi- tecture, HPCA ’05, 2005. [AKPJ09] N. Agarwal, T. Krishna, Li-Shiuan Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In Proceedings of International Symposium on Performance Analysis of Systems and Soft- ware, 2009. [ATKS06] Ali-Reza Adl-Tabatabai, Christos Kozyrakis, and Bratin Saha. Unlocking concurrency. Queue, 4(10):24–33, December 2006. [BD06] James Balfour and William J. Dally. Design tradeoffs for tiled cmp on-chip networks. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS ’06, 2006. [BDLM07] Colin Blundell, Joe Devietti, E. Christopher Lewis, and Milo M. K. Mar- tin. Making the fast case common and the uncommon case simple in unbounded transactional memory. In Proceedings of 34th International Symposium on Computer Architecture, 2007. [BDM09] Geoffrey Blake, Ronald G. Dreslinski, and Trevor Mudge. Proactive transaction scheduling for contention management. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, 2009. [BDM11] Geoffrey Blake, Ronald G. Dreslinski, and Trevor Mudge. Bloom filter guided transaction scheduling. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA ’11, 2011. [Bec12] Daniel U. Becker. Efficient Microarchitecture for Network-on-Chip Routers. PhD thesis, Stanford University, 2012. 160 [Blo70] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), July 1970. [BMV + 07] Jayaram Bobba, Kevin E. Moore, Haris V olos, Luke Yen, Mark D. Hill, Michael M. Swift, and David A. Wood. Performance pathologies in hard- ware transactional memory. In Proceedings of International Symposium on Computer Architecture, 2007. [BRM10] Colin Blundell, Arun Raghavan, and Milo M.K. Martin. Retcon: Trans- actional repair without replay. In Proceedings of the 37th Annual Interna- tional Symposium on Computer Architecture, ISCA ’10, 2010. [CBM + 08] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory: Why is it only a research toy? Queue, 6(5):40:46–40:58, September 2008. [CCC + 07] Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, Chi Cao Minh, Woongki Baek, Christos Kozyrakis, and Kunle Oluko- tun. A scalable, non-blocking approach to transactional memory. In Pro- ceedings of the 2007 IEEE 13th International Symposium on High Perfor- mance Computer Architecture, HPCA ’07, 2007. [CD10] Woojin Choi and Jeffrey Draper. Locality-aware adaptive grain signatures for transactional memories. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010. [CD11] Woojin Choi and J. Draper. Unified signatures for improving performance in transactional memory. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011. [CD13] Woojin Choi and Jeffrey Draper. Improving utilization of hardware sig- natures in transactional memory. Parallel and Distributed Systems, IEEE Transactions on, 24(11), Nov 2013. [CP12] Lizhong Chen and Timothy M. Pinkston. Nord: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, 2012. [CTTC06] Luis Ceze, James Tuck, Josep Torrellas, and Calin Cascaval. Bulk disam- biguation of speculative threads in multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, 2006. 161 [DCW + 11] Luke Dalessandro, Franc ¸ois Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and Michael F. Spear. Hybrid norec: A case study in the effectiveness of best effort hardware transactional memory. In Pro- ceedings of the Sixteenth International Conference on Architectural Sup- port for Programming Languages and Operating Systems, ASPLOS XVI, 2011. [DFL + 06] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Daniel Nussbaum. Hybrid transactional memory. In Proceed- ings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, 2006. [GHKM09] B. Grot, J. Hestness, S.W. Keckler, and O. Mutlu. Express cube topologies for on-chip interconnects. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, 2009. [GW92] A. Gupta and W.-D. Weber. Cache invalidation patterns in shared-memory multiprocessors. Computers, IEEE Transactions on, 41(7), Jul 1992. [HCU + 07] T. Harris, A. Cristal, O.S. Unsal, E. Ayguade, F. Gagliardi, B. Smith, and M. Valero. Transactional memory: An overview. Micro, IEEE, 27(3):8– 29, May 2007. [HDV + 11] J. Howard, S. Dighe, S.R. Vangal, G. Ruhl, N. Borkar, S. Jain, V . Erra- guntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V .K. De, and R. Van Der Wijngaart. A 48-core ia- 32 processor in 45 nm cmos using on-die message-passing and dvfs for performance and power scaling. Solid-State Circuits, IEEE Journal of, 46(1):173–183, Jan 2011. [HEM93] M. Herlihy, J. Eliot, and B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Interna- tional Symposium on Computer Architecture, 1993. [HLM06] Maurice Herlihy, Victor Luchangco, and Mark Moir. A flexible framework for implementing software transactional memory. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications, 2006. [HLMS03] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III. Software transactional memory for dynamic-sized data structures. In Proceedings of the Twenty-second Annual Symposium on Principles of Distributed Computing, 2003. 162 [HOF + 12] R.A. Haring, M. Ohmacht, T.W. Fox, M.K. Gschwind, D.L. Satterfield, K. Sugavanam, P.W. Coteus, P. Heidelberger, M.A. Blumrich, R.W. Wis- niewski, A. Gara, G.L.-T. Chiu, P.A. Boyle, N.H. Chist, and Changhoan Kim. The ibm blue gene/q compute chip. Micro, IEEE, 32(2), March 2012. [HPST06] Tim Harris, Mark Plesko, Avraham Shinnar, and David Tarditi. Opti- mizing memory transactions. In Proceedings of the 2006 ACM SIG- PLAN Conference on Programming Language Design and Implementa- tion, PLDI ’06, 2006. [HVS + 07] Y . Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh interconnect for a teraflops processor. Micro, IEEE, 27(5):51–61, Sept 2007. [HWC + 04] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun. Transactional memory coherence and consistency. In Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA ’04, 2004. [Ita] Intel Itanium2 Processor Reference Manual. [JP09] Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks. Morgan Claypool, 1st edition, 2009. [JSG12] Christian Jacobi, Timothy Slegel, and Dan Greiner. Transactional mem- ory architecture and implementation for ibm system z. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microar- chitecture, MICRO-45, 2012. [JTV10] S. A. R. Jafri, M. Thottethodi, and T. N. Vijaykumar. Litetm: Reduc- ing transactional state overhead. In Proceedings of the 16th International Symposium on High Performance Computer Architecture, 2010. [KBK02] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Procs. of the 10th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems, 2002. [KCH + 06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony Nguyen. Hybrid transactional memory. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Paral- lel Programming, PPoPP ’06, 2006. 163 [KDA07] John Kim, William J. Dally, and Dennis Abts. Flattened butterfly: A cost- efficient topology for high-radix networks. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, 2007. [KDK + 11] S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, and D. Glasco. Gpus and the future of parallel computing. Micro, IEEE, 31(5):7–17, Sept 2011. [KLPS09] Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, 2009. [Kni86] Tom Knight. An architecture for mostly functional languages. In Proceed- ings of the 1986 ACM Conference on LISP and Functional Programming, 1986. [LL97] James Laudon and Daniel Lenoski. The sgi origin: a ccnuma highly scal- able server. SIGARCH Comput. Archit. News, 25(2), May 1997. [LMG08] Marc Lupon, Grigorios Magklis, and Antonio Gonz´ alez. Version manage- ment alternatives for hardware transactional memory. In Proceedings of the 9th Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture, MEDEA ’08, 2008. [LMG09] Marc Lupon, Grigorios Magklis, and Antonio Gonzalez. Fastm: A log- based hardware transactional memory with fast abort recovery. In Pro- ceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, 2009. [LMG10] Marc Lupon, Grigorios Magklis, and Antonio Gonzalez. A dynamically adaptable hardware transactional memory. In Proceedings of the 43rd International Symposium on Microarchitecture, 2010. [MBJ07] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In Procs. of the 40th Int’l. Symp. on Microarchitecture, 2007. [MBM + 06] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. Logtm: Log-based transactional memory. In Proceedings of 12th Interna- tional Symposium on High Performance Computer Architecture, 2006. [MCE + 02] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35, 2002. 164 [MCKO08] Chi Cao Minh, JaeWoong Chung, C. Kozyrakis, and K. Olukotun. Stamp: Stanford transactional applications for multi-processing. In Proceedings of International Symposium on Workload Characterization, 2008. [MCS91] John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Par- allel Programming, PPOPP ’91, 1991. [MHS12] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, 55(7):78–89, July 2012. [MSB + 05] Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33, November 2005. [MTC + 07] Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007. [NTGA + 12] A. Negi, R. Titos-Gil, M. E. Acacio, J. M. Garcia, and P. Stenstrom. pi-tm: Pessimistic invalidation for scalable lazy hardware transactional memory. In Proceedings of the 18th International Symposium on High Performance Computer Architecture, 2012. [PB09] Salil M. Pant and Gregory T. Byrd. Limited early value communication to improve performance of transactional memory. In Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, 2009. [RG02] Ravi Rajwar and James R. Goodman. Transactional lock-free execution of lock-based programs. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Sys- tems, ASPLOS X, 2002. [RHL05] Ravi Rajwar, Maurice Herlihy, and Konrad Lai. Virtualizing transactional memory. In Proceedings of the 32Nd Annual International Symposium on Computer Architecture, ISCA ’05, 2005. [RHP + 07] Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Bhandari Aditya, and Emmett Witchel. Txlinux: Using and 165 managing hardware transactional memory in an operating system. In Pro- ceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP ’07, 2007. [RRW08] Hany E. Ramadan, Christopher J. Rossbach, and Emmett Witchel. Dependence-aware transactional memory for increased concurrency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, 2008. [SATH + 06] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. Mcrt-stm: A high performance software trans- actional memory system for a multi-core runtime. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Paral- lel Programming, 2006. [SBS93] Per Stenstr¨ om, Mats Brorsson, and Lars Sandberg. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA ’93, 1993. [SD09] Arrvindh Shriraman and Sandhya Dwarkadas. Refereeing conflicts in hardware transactional memory. In Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, 2009. [SDS08] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott. Flexi- ble decoupled transactional memory support. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, 2008. [SIS05] William N. Scherer III and Michael L. Scott. Advanced contention man- agement for dynamic software transactional memory. In Proceedings of the 24th Symposium on Principles of Distributed Computing, 2005. [SSN + 89] K. Sawada, T. Sakurai, K. Nogami, T. Shirotori, Toshinari Takayanagi, T. Iizuka, T. Maeda, Junichi Matsunaga, H. Fuji, K. Maeguchi, K. Kobayashi, Tomoyuki Ando, Yoshiki Hayakashi, Akio Miyoshi, and Kazuyuki Sato. A 32 kbyte integrated cache memory. Solid-State Cir- cuits, IEEE Journal of, 24(4):881–888, Aug 1989. [Sub10] ASCAC Subcommittee. The opportunities and challenges of exascale computing. Technical report, US Department of Energy, 2010. [SYHS07] Daniel Sanchez, Luke Yen, Mark D. Hill, and Karthikeyan Sankaralingam. Implementing signatures for transactional memory. In Proceedings of the 166 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, 2007. [TC08] M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32- thread plus 32-scout-thread cmt sparc processor. In Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE Inter- national, 2008. [TGNA + 11] Rub´ en Titos-Gil, Anurag Negi, Manuel E. Acacio, Jos´ e M. Garc´ ıa, and Per Stenstrom. Zebra: A data-centric, hybrid-policy hardware transac- tional memory design. In Proceedings of the International Conference on Supercomputing, ICS ’11, 2011. [TPK + 09] Saˇ sa Tomi´ c, Cristian Perfumo, Chinmay Kulkarni, Adri` a Armejach, Adri´ an Cristal, Osman Unsal, Tim Harris, and Mateo Valero. Eazyhtm: eager-lazy hardware transactional memory. In Proceedings of the 42nd International Symposium on Microarchitecture, 2009. [WGW + 12] Amy Wang, Matthew Gaudet, Peng Wu, Jos´ e Nelson Amaral, Martin Ohmacht, Christopher Barton, Raul Silvera, and Maged Michael. Eval- uation of blue gene/q hardware support for transactional memories. In Proceedings of the 21st International Conference on Parallel Architec- tures and Compilation Techniques, PACT ’12, 2012. [WZPM02] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and S. Malik. Orion: a power-performance simulator for interconnection networks. In Proceed- ings of 35th International Symposium on Microarchitecture, 2002. [YBM + 07] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris V olos, Mark D. Hill, Michael M. Swift, and David A. Wood. Logtm-se: Decoupling hardware transactional memory from caches. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA ’07, 2007. [YHLR13] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar. Performance evaluation of intel® transactional synchronization exten- sions for high-performance computing. In Proceedings of SC13: Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis, SC ’13, 2013. [YL08] Richard M. Yoo and Hsien-Hsin S. Lee. Adaptive transaction scheduling for transactional memory systems. In Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’08, 2008. 167 [ZCCD13] Lihang Zhao, Woojin Choi, Lizhong Chen, and J. Draper. In-network traf- fic regulation for transactional memory. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, Feb 2013. [ZSH + 10] Ferad Zyulkyarov, Srdjan Stipic, Tim Harris, Osman S. Unsal, Adri´ an Cristal, Ibrahim Hur, and Mateo Valero. Discovering and understanding performance bottlenecks in transactional applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compila- tion Techniques, PACT ’10, 2010. 168
Abstract (if available)
Abstract
The architectural challenges for reaching extreme‐scale computing necessitate major progress in designing high performance and energy‐efficient hardware building blocks, such as microprocessors. The chip multiprocessor (CMP) architecture has emerged as a preferred solution to exploit the increasing transistor density for sustainable performance improvement. As the core count keeps scaling up, developing parallel applications to reap commensurate performance improvement becomes imperative and of paramount importance. The Hardware Transactional Memory (HTM) approach promises increased productivity in the practice of parallel programming. Recent research in academia and industry suggests that the design space and tradeoffs of HTM are still far from being well understood. To pave the way for more HTM‐enabled processors, two crucial issues in HTM designs must be addressed. The first issue is achieving high performance under frequent transaction conflicts. The second issue is designing energy‐efficient HTM techniques. Invariably, both issues demand efficient communication in transaction execution. This thesis dissertation contributes a set of hardware techniques to achieve efficient and scalable communication in such systems. ❧ First, we contribute the Selective Eager‐Lazy HTM system (SEL‐TM) to leverage the concurrency and communication benefit of lazy version management while suppressing its corresponding complexity and overhead with eager management. The mixed mode execution generates 22% less network traffic in high contention workloads representative of upcoming TM applications. The performance is improved by at least 14% over either a pure eager or a pure lazy HTMs. Second, we contribute Transactional Memory Network‐on‐Chip (TMNOC), an in‐network filtering mechanism that proactively filters away pathological transactional requests that waste network‐on‐chip bandwidth utilization. TMNOC is the first published HTM‐network co‐design. Experimental results show that TMNOC reduces network traffic by 20% averaged across the high contention workloads, thereby reducing network energy consumption by 24%. The third proposal mitigates the disruptive coherence forwarding in transactional execution when the cache coherence protocol is reused for conflict detection. We address the problem with a Predictive Unicast and Notification (PUNO) mechanism. PUNO is effective in reducing transaction aborting by 43% on average and avoiding 17% of the on‐chip communication. Fourth, we propose Consolidated Conflict Detection (C2D), a holistic solution that addresses the communication overhead in conflict detection with cost‐effective hardware designs. Evaluations show that the C2D technique, when being used to implement eager conflict detection, can reduce 39% of the on‐chip communication. The corresponding energy savings due to C2D is 27%.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Communication mechanisms for processing-in-memory systems
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Improving efficiency to advance resilient computing
PDF
Improving reliability, power and performance in hardware transactional memory
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Cache analysis and techniques for optimizing data movement across the cache hierarchy
Asset Metadata
Creator
Zhao, Lihang
(author)
Core Title
Hardware techniques for efficient communication in transactional systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/01/2014
Defense Date
04/25/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer architecture,energy efficiency,microprocessor,OAI-PMH Harvest,on‐chip network,parallel architecture,parallel programming,transactional memory
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey (
committee chair
), Annavaram, Murali (
committee member
), Gupta, Sandeep K. (
committee member
), Nakano, Aiichiro (
committee member
), Pinkston, Timothy M. (
committee member
)
Creator Email
lihangzh@usc.edu,LIHANGZHAO@GMAIL.COM
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-427031
Unique identifier
UC11286746
Identifier
etd-ZhaoLihang-2595.pdf (filename),usctheses-c3-427031 (legacy record id)
Legacy Identifier
etd-ZhaoLihang-2595.pdf
Dmrecord
427031
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhao, Lihang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer architecture
energy efficiency
microprocessor
on‐chip network
parallel architecture
parallel programming
transactional memory