Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Dynamically reconfigurable off- and on-chip networks
(USC Thesis Other)
Dynamically reconfigurable off- and on-chip networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DYNAMICALLY RECONFIGURABLE ON- AND OFF-CHIP NETWORKS by Bilal Zafar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) May 2011 Copyright 2011 Bilal Zafar Dedication I dedicate this dissertation to three people whose love, patience and prayers have sustained me through this journey: my father, Zafar, my mother, Sam- ina, and my wife, Nausheen. ii Acknowledgments After all these years, it is a pleasure to finally thank all those who made this dissertation possible. But for the grace of the Almighty and their support, this would not have happened. My thanks to my adviser, Dr. Jeff Draper. His advice and suggestions were always instrumental, his patience always a blessing, but perhaps what mattered the most was the faith he showed in me at a crucial juncture. And for that, I will always be grateful. Thanks also to Dr. Timothy Pinkston for his support, advice and help throughout my time at USC. He gave generously of his time and attention, and I learned a lot from him; to Professor Gandhi Puvvada for being an inspirational teacher, guide and mentor, and to Dr. D. N. (Jay) Jayasimha with whom I worked at Intel but continue to learn from even today. Thanks are due to members of my qualifying and dissertation committees, Drs. Vik- tor Prasanna, Aiichiro Nakano, Peter Beerel, Michel Dubois, and Leana Golubchik. Also instrumental in getting me across the finish line was the help and support of Dr. Alexander (Sandy) Sawchuk, and Todd Brun. I would also like to acknowledge my col- laborators, Jose Duato (Universidad Politcnica de Valencia), Aurelio Bermuacutedez, Yatin Hoskote (Intel), and David Skinner (Lawrence Berkeley National Laboratory). iii A special note of gratitude goes out to Diane Demetras, Academic Advisor at the Department of Electrical Engineering at USC. She was always there to offer help and advise, and went beyond her responsibilities when I needed her most. I would like to thank the members of the two research groups I had the pleasure to be a member of, especially my dear friend Jeonghee Shin. The list of friends and colleagues who made this long journey worth every minute of it is too long, but to all of you, thank you. And, finally, a thanks to the people who sacrificed the most: my family. In my parents, my children and wife, my in-laws and my sister, I have had the solid foundation that this long and sometimes difficult journey required. I asked too much from them, and they always gave more love, more encouragement and more support than I asked for. So, I close by thanking the Almighty for blessing me in so many ways, especially for my family. iv Table of Contents Dedication ii Acknowledgments iii List Of Tables viii Abstract xii Chapter 1: Introduction 1 1.1 Infiniband Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 On-Chip Interconnection Networks . . . . . . . . . . . . . . . . . . . . 6 1.3 Design Challenges for On-Chip Networks . . . . . . . . . . . . . . . . 8 1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 2: Background and Related Work 14 2.1 Dynamic Reconfiguration in Off-Chip Networks . . . . . . . . . . . . . 15 2.1.1 Reconfiguration-Induced Deadlocks . . . . . . . . . . . . . . . 17 2.2 Previous Work on Dynamic Reconfiguration . . . . . . . . . . . . . . . 17 2.3 Dynamic Reconfiguration in On-Chip Networks . . . . . . . . . . . . . 19 2.4 Characterizing Bandwidth Demand of Parallel Applications . . . . . . . 20 2.4.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Communication Behavior . . . . . . . . . . . . . . . . . . . . 23 2.5 Previous Work on Power Reduction in On-Chip Networks . . . . . . . 26 Chapter 3: Dynamic Reconfiguration over InfiniBand TM 29 3.1 The Double Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Exploitable Features of IBA . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2 Service Levels and Virtual Lanes . . . . . . . . . . . . . . . . . 34 3.2.3 Forwarding Tables . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.4 Queue Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 v 3.2.5 Path Establishment . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Applying the Double Scheme to InfiniBand Architecture . . . . . . . . 38 3.4 Proposed Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Algorithm to Detect VL Drainage . . . . . . . . . . . . . . . . . . . . 43 3.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7.1 Simulation Platform . . . . . . . . . . . . . . . . . . . . . . . 51 3.7.2 Performance Comparison with Static Scheme . . . . . . . . . . 51 3.7.3 Reconfiguration Cost . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.4 FT Computation and Distribution (T FT ): . . . . . . . . . . . . . 55 3.7.5 Changing SL-to-VL Mapping (T SL2VL ): . . . . . . . . . . . . . 55 3.7.6 Detecting Channel Drainage (T Drain ): . . . . . . . . . . . . . . 56 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 4: On-Chip Interconnection Networks 59 4.1 OCIN: What is different? . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1.1 Wiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1.2 Embedding in 2D . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.3 Latencies are of a different order . . . . . . . . . . . . . . . . . 63 4.1.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Wiring Density . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.2 Wire Segments: Measure of Delay, Power . . . . . . . . . . . . 68 4.2.3 Router Complexity . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.5 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Comparing Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 Bi-directional Ring . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.2 2D Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.3 2D Torus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.4 3D Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.5 3D Torus and Higher-Dimensionalk-aryn-cube Topologies . . 73 4.3.6 Direct Fat Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.7 Cube Connected Cycles . . . . . . . . . . . . . . . . . . . . . 75 4.3.8 Hierarchical Ring Topology . . . . . . . . . . . . . . . . . . . 76 4.3.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Summary and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . 82 vi Chapter 5: Cubic Ring Network 84 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Ring-Based Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Formal Description of Cubic Ring Topology . . . . . . . . . . . . . . . 87 5.3.1 Mixed-Radix and Isomorphic Cubic Ring Networks . . . . . . 90 5.4 Routing Function for Cubic Ring Networks . . . . . . . . . . . . . . . 91 5.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.2 cRing Routing Function . . . . . . . . . . . . . . . . . . . . . 92 5.4.3 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4.4 Deadlock Freedom . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5 Characterization of cRing Networks . . . . . . . . . . . . . . . . . . . 96 5.5.1 Average Distance . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.2 Maximum Distance . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Bisection Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 99 5.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 102 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter 6: Dynamic Reconfiguration of Cubic Ring Networks 107 6.1 Routing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.1 Torus Routing: R torus . . . . . . . . . . . . . . . . . . . . . . 109 6.1.2 cRing Routing: R cRing . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Dynamic Reconfiguration of Cubic Ring Network . . . . . . . . . . . . 110 6.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.2 R torus toR cRing Reconfiguration . . . . . . . . . . . . . . . . . 111 6.2.3 R cRing toR torus Reconfiguration . . . . . . . . . . . . . . . . . 112 6.2.4 R cRing1 toR cRing2 Reconfiguration . . . . . . . . . . . . . . . 112 6.2.5 R cRing2 toR cRing1 Reconfiguration . . . . . . . . . . . . . . . 113 6.3 Proof of Deadlock Freedom . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.1 Sufficient Conditions for Deadlock-free Reconfiguration . . . . 114 6.3.2 Deadlock Freedom ofR torus toR cRing Reconfiguration . . . . . 115 6.3.3 Deadlock Freedom ofR cRing toR Torus Reconfiguration . . . . 116 6.4 Implementation And Evaluation . . . . . . . . . . . . . . . . . . . . . 117 6.4.1 Power-gated Router Design . . . . . . . . . . . . . . . . . . . 118 Chapter 7: Conclusions 119 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Bibliography 123 vii List Of Tables 1.1 On-chip networks for chip multiprocessors. . . . . . . . . . . . . . . . 8 2.1 Comparison of the off times and lengths of off intervals in parallel appli- cations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Cost of various reconfiguration steps. . . . . . . . . . . . . . . . . . . . 55 3.2 Scaling of T FT and T SL2VL with network size . . . . . . . . . . . . . . 56 4.1 Comparison of link widths in off- and on-chip interconnects. . . . . . . 63 4.2 Maximum number of wires per tile edge at different wire pitch for tile size of 90K¸£ 90K¸. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Maximum number of wires per tile edge at different wire pitch for tile size of 140K¸£ 140K¸. . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Comparison of four 64-node H-Ring topologies. . . . . . . . . . . . . . 79 4.5 Comparing the cost of topologies with 64 nodes. . . . . . . . . . . . . . 80 4.6 Comparison of performance of topologies with 64 endnodes. . . . . . . 81 6.1 Comparison of active area and static power of 5-port and 3-port routers in 90nm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Comparison of active area and static power of 5-port router without power-gating and with power-gating in 90nm. The power-gated design 118 viii List of Figures 1.1 Share of interconnect families over time as a percentage of all systems in the June 2010 release of the Top 500 list. . . . . . . . . . . . . . . . 5 2.1 Injected bandwidth of (a) GTC, and (b) PMEMD over the entire execu- tion, viewed at 1s resolution. . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Injected bandwidth of GTC viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Injected bandwidth of AORSA2D viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Injected bandwidth of IMPACT-T viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Injected bandwidth of NAMD viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6 Injected bandwidth of PARATEC viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Linear Forwarding Tables in IBA. . . . . . . . . . . . . . . . . . . . . 35 3.2 Random Forwarding Tables in IBA. . . . . . . . . . . . . . . . . . . . 36 3.3 The old and the new routing functions implemented on two independent sets of LID/port in a Random Forwarding Table. . . . . . . . . . . . . . 41 3.4 An example IBA subnet. . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 CDG for the XY-routing function . . . . . . . . . . . . . . . . . . . . . 49 3.6 An IBA subnet consisting of eight switch and seven end nodes. . . . . . 52 ix 3.7 Simulation results for (a) static and (b) Double Scheme dynamic recon- figuration for the IBA subnet shown in Figure 3.6. Reconfiguration starts at time 61 sec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.8 The number of drainage GMPs versus the load rate. . . . . . . . . . . . 57 4.1 An illustration of a 16-core tiled chip multiprocessor with routers (R) at each tile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 A (a) 3D mesh network (b) embedded on 2D die. . . . . . . . . . . . . 64 4.3 Illustrating equivalent wiring budget across (a) unidirectional ring, (b) bidirectional ring, and (c) 2D Mesh/Torus topologies . . . . . . . . . . 66 4.4 A comparison of link and crossbar power. . . . . . . . . . . . . . . . . 70 4.5 A 64-node Fat Tree network. . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Layout of a 64-node Fat Tree network and one of the eight routers with one 8£ port shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.7 (a) 16-node H-Ring with one global ring and (b) its coresponding layout. 77 4.8 (a) 64-node H-Ring with one global ring and four taps per local ring, and (b) its coresponding layout. . . . . . . . . . . . . . . . . . . . . . . 78 5.1 A 64-node (a) 4-ary, 3-cube torus network, (b) 4-ary, 3-cube,R-ring cubic ring network with R = f0001;0101;1111g, (c) the xy plane of the 4-ary, 3-cube cRing network atz =0, and (d) theyz plane atx=0. 88 5.2 An 8-node,2,4-ary2-cubeR-ring cRing network with (a)R =f1010;11g, (b)R =f0101;11g, and (c)R =f1001;11g. . . . . . . . . . . . . . . . 90 5.3 The cRing routing algorithm. . . . . . . . . . . . . . . . . . . . . . . . 93 5.4 Routing of a message in a 4-ary 3-cube cRing withR =f0001;0001;1111g. 94 5.5 Average distance for 16- and 64-node cRing networks in all configura- tions, with optimally placed global rings. . . . . . . . . . . . . . . . . . 99 5.6 Percentage increase in average distance from the torus network in cRing networks of different sizes and different number of global rings. . . . . 101 5.7 Percentage of network links turned off and average distance for 256- node cRing networks in all configurations. . . . . . . . . . . . . . . . . 102 x 5.8 Performance results comparing a 16-node two-dimensional torus net- work, a cRing network, and a torus network that uses West-Last, East- Last routing, under (a) random uniform, (b) perfect shuffle and (c) local traffic with radius 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.9 Performance results comparing a 64-node two-dimensional torus net- work, a cRing network and a torus network that uses West-Last, East- Last routing, under (a) random uniform, (b) perfect shuffle and (c) local traffic with radius 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xi Abstract Computing systems are increasingly becoming communication bound. From mega-watt servers in data centers to mobile and embedded systems, the cost, performance and reliability of the communication fabric has rapidly become a first-order design concern in digital systems at every scale. At the same time, traditional solutions to connecting resources are fast approaching their natural limits beyond which they will not be able to offer the requisite performance at acceptable area, power or dollars cost. The growing need for a low-latency, high-bandwidth interconnect fabric for clus- ter systems has made InfiniBand Architecture (IBA)-compliant devices the fabric of choice in high-performance systems. IBA is an open, industry-standard specification that is designed specifically for high performance, quality of service guarantees and high robustness. For high robustness, availability and dependability, IBA-compliant networks must be able to adapt to changes in the topology caused by, for example, hot-swapping of components or faults in the network, without negatively impacting the end-to-end latency of application traffic. In this work, we present a mechanism for InfiniBand networks to reconfigure the routing algorithm without stopping injection of packets, discarding routable packets already in the network or significantly reducing the available bandwidth. The proposed mechanism uses a deadlock-free dynamic reconfig- uration scheme called the Double Scheme and features available in IBA to seamlessly update the active routing function in seven steps with very little overhead. Simulation xii results show that using the proposed mechanism InfiniBand network remain available for application traffic during the entire reconfiguration process. This availability in the presence of failure comes at the cost of only a slight increase in the reconfiguration time. Challenges facing traditional interconnects are not limited to off-chip interconnect fabric like InfiniBand networks. Point-to-point wires, fully-connected crossbars and shared busses which have been used as communication fabrics in multi-core processors also face a myriad of challenges like scalability, power dissipation and wiring complex- ity. Packet-switched on-chip interconnection networks have been proposed as a solution and this work prunes the design space of topologies for these on-chip interconnection networks using relevant quantitative and qualitative metrics. Our analysis shows that 2D mesh, torus and hierarchical ring topologies meet most of the requirements for intercon- nects in moderate-sized many-core CMPs. As the number of cores per die grows, power dissipation in the on-chip interconnec- tion network will increasingly becoming a key barrier to scalability. Studies have shown that on-chip networks can consume up to 36% of the total chip power, while analysis of network traffic reveals that for extended periods of execution time, network load is well below the network capacity in many applications. We analyze seven representative parallel applications to quantify the temporal variability in the bandwidth demand of typical applications. Analysis shows that in most of the characterized applications the demand for network bandwidth varies across applications as well as over time during the execution of the same application. In five of the seven analyzed applications, the band- width demand is close to zero for over 90% of the execution time. More importantly, the off periods are 1ms or longer, which means that there is a significant opportunity to turn network resource on and off based on the temporal variations in the bandwidth demand of the application. xiii To address the power dissipation problem by exploiting the temporal variability in the bandwidth demand of applications, this work proposes the Polymorphic Cubic Ring (pcRing) topology. Enabled by a deadlock-free dynamic reconfiguration proto- col, a pcRing network provides an elegant way to trade off network bandwidth for lower (static) power without a significant latency penalty. A complete formalism for the proposed Cubic Ring topology, the associated routing algorithm and a deadlock-free dynamic reconfiguration mechanism is presented in this work. Performance of 16- and 64-node cRing networks in various configurations is evaluated via simulation, and the power savings are quantified by synthesizing a polymorphic router with power-gated ports. xiv Chapter 1 Introduction Computing systems are increasingly becoming communication bound. From mega- watt servers that are the backbone of cloud computing and Petaflop supercomputers, to mobile and embedded systems, the cost, performance and reliability of the commu- nication fabric has rapidly become a first-order design concern in computing systems at every scale. At the same time, traditional solutions to connecting resources are fast approaching their natural limits beyond which they will not be able to offer the requi- site performance at acceptable area, power or dollars cost. This increasing gap between projected cost, performance and reliability of existing communication fabrics, and what will be required in future systems, has led to significant innovations in the design of the communication networks over the passed decade. Communication networks that connect different devices within a computer have become especially important recently. Referred to as “interconnection networks,” such networks were traditionally studied as part of the architecture of large multiprocessor computers or telecommunication switches. However, as the degree of integration con- tinues to increase according to Moore’s law, the core concepts and principles of intercon- nection networks are now being applied to the communication fabric on the chip. This has led to the addition of a forth domain – called On-Chip Interconnection Networks (OCINs) – to the traditional classification of interconnection networks as comprising of three domains: System Area Networks (SANs), Local Area Networks (LANs) and Wide 1 Area Networks (WANs) [PD06].This work focus on on-chip and system area networks that employ packet-switching as the underlying switching mechanism. Packet-switched on-chip networks, also referred to as network-on-chip (NoC), are used to connect functional elements within a system-on-chip. These functional ele- ments may be special-purpose hardware accelerators/engines, processing cores, mem- ory blocks, or any combination of the three. The physical links for OCINs comprise of metal wires etched on the surface of insulating layers over the semiconductor sub- strate housing the functional elements. The routers that switch packets over these wires share the substrate with the functional elements. OCINs that have been implemented in commercial chip multiprocessors typically connect from a few to a few tens of nodes (functional elements, in case of OCINs), but this number is expected to increase as the integration density continues to double every two years. Unlike OCINs, SANs have been researched and implemented since the earliest days of computing. As networks used to connect processors to the main memory and/or to each other in large multiprocessor systems, SANs implemented in today’s high per- formance systems connect thousands of nodes over tens of meters of physical distance. Typically, physical links are optical or electrical cables and metal traces running through backplanes. Each router is implemented as a separate device connected to the network (common in cluster systems), a separate chip on the compute node or as part of the microprocessor (e.g., the Alpha 21364 processor [MBL + 01]). Despite some important differences between OCINs and SANs, networks in both domains share the broad set of cost constraints and performance objectives that they are designed to meet. Maximizing the amount of traffic that the network can transport per unit of time (bandwidth) and minimizing the average end-to-end latency between all the source-destination pairs are key measures of performance in both domains. Similarly, 2 the average power consumption of the network is the most stringent cost constraint that both on- and off-chip networks are designed to minimize. Also, by virtue of the fact that both OCINs and SANs are designed to carry similar interprocessor and processor- to-memory traffic, they also share the design space of preferred topologies, routing algorithms and flow control mechanisms. Owing to these similarities, both OCINs and SANs face similar challenges, and innovative approaches developed for networks in one domain are often useful for networks in the other domain. As networks become central to the design of computing system, they must be able to adapt at run-time. That is, the network must be able to re-draw the routing maps and route packets according to these new maps if a voluntary or involuntary change in the physical connectivity of the links occurs. This is important in SANs because, with tens of thousands of links and routers forming the network, failures of links or routers are not infrequent. And, if the network lacks the ability to adapt – or reconfigure – the entire system can become unusable when a single cable is unplugged or a device is swapped. Furthermore, in high performance SANs the ability to reconfigure is simply not adequate if it interrupts the flow of application traffic. Therefore, robust, highly dependable and high performance SANs must be able to reconfigure their routing paths without interrupting application traffic. In short, they require the ability to reconfigure dynamically. Given their smaller size and fixed wiring, failures of links or routers are not major concerns in on-chip networks. But, on-chips networks must be able to reconfigure to exploit the variability in the demand for network bandwidth. By dynamically match- ing the available bandwidth to the application’s demand, a more power-efficient on- chip network can be designed. Much like their system area counterparts, reconfigurable OCINs must also modify the routing paths without interrupting application traffic. Thus, 3 dynamic reconfiguration is a highly desirable feature in high performance and power- efficient on-chip networks. This work is focused on the problem of dynamic reconfiguration in both on-chip networks and off-chip system/storage area networks. For SANs, a reconfiguration pro- tocol employing only the features available in the popular SAN architecture, Infiniband Architecture (IBA), is proposed. The proposed protocol is based on an existing theory about deadlock-free dynamic reconfiguration in networks with arbitrary routing func- tions. Experimental results show that the proposed protocol provides an efficient mech- anism for modifying the routing function in an IBA-compliant network at very low overhead. To the best of our knowledge, this is the first implementation of a dynamic reconfiguration protocol over any industry-standard network specification. For on-chip networks, a new network architecture is proposed in this work. This new architecture, called the Polymorphic Cubic Ring Network, allows parts of the network to be turned on and off dynamically without interrupting the flow of traffic through the network. This allows dynamic power management in the OCIN. The most important contribution of this part of our work is in demonstrating how, by designing the network topology, routing function and flow control mechanism in tandem, an efficient and light- weight mechanism for dynamic reconfiguration can be established. We describe the proposed network architecture in detail and evaluate the performance and power-savings of the Polymorphic Cubic Ring network. 1.1 Infiniband Architecture Infiniband TM Architecture (IBA) [Inf00] is an industry-standard specification for a low- latency, high-bandwidth switched interconnect fabric that has been gaining acceptance within the high performance computing (HPC) community, especially in the cluster 4 server market. Figure 1.1 illustrates that over the past five years, the share of total performance (measured in floating point operations per second (flops) when running Linpack Benchmark) of systems using Infiniband has been steadily increasing. In the June 2010 release of the top500.org list, 41% of the flops delivered by the Top 500 machines came from machines using Infiniband and, 14 of the top 25 top 25 systems use Infiniband, including the Nebulae (ranked # 2 in performance) and Roadrunner (# 3) machines. Figure 1.1: Share of interconnect families over time as a percentage of all systems in the June 2010 release of the Top 500 list. As Infiniband-based networks are deployed in high-end systems, these networks are expected to provide not only high-performance communication but also high reliabil- ity, availability and dependability. These attributes are increasing in importance with the emergence of bandwidth-hungry applications with strict latency requirements such 5 as cloud computing, on-demand high-definition video/audio processing, distributed on- line transaction processing, database and decision support systems, and the like. Such applications impose a great demand on the communication subsystem not only to be high-performance but also to be highly robust. To this end, the IBA specification includes various features and mechanisms useful for improving system reliability and availability. Features such as subnetwork man- agement, service levels, data and control virtual lanes, table-driven routing, end-to-end path establishment, packet time-out and virtual destination naming are all useful for implementing reconfiguration functions in IBA networks. To be truly dependable, how- ever, IBA-compliant networks should be capable of deadlock-free dynamic reconfigu- ration, i.e., able to efficiently adapt to changes in real-time if and when voluntary or involuntary changes occur. That is, IBA networks should remain up and running with high performance in the presence of hot-swapping of components, failure or addition of links/nodes, activation or deactivation of hosts/switches, partitioning or isolation of network components, etc., arising from changes in users’ needs and/or system state. The reliability, availability, and performance predictability (overall, the dependability) of IBA-compliant cluster computing and storage systems depend critically on the net- work’s ability to efficiently support such functions while maintaining certain service level targets. 1.2 On-Chip Interconnection Networks Problems facing the traditional on-chip communication fabric have also become more prominent in recent years. Point-to-point wires were long the mainstay of on-chip com- munication in general-purpose processors. But, during the past decade, the fundamen- tal paradigm in microprocessor design has shifted away from monolithic single-core 6 processors optimized for higher frequencies, toward power-efficient multi-core designs. This shift has been enabled by shared on-chip interconnects, as wire delay scalability and on-chip power dissipation made point-to-point wires between cores infeasible. A majority of the first generation commercial chip multiprocessors (CMPs), often implemented as “multi-core” microprocessors [SCG + 08][DSC + 07] [Poo05][SKT + 05][J. 05][HHS + 00], rely on shared medium interconnects such as busses, or simple switched interconnects like fully-connected crossbars, as the com- munication fabric between cores. The Cell Broadband Engine, which uses a ring, is a notable exception [AP07]. But, with its central arbitration, from a scalability standpoint, Cell’s ring interconnect also suffers from limitations similar to shared interconnects in most other commercial CMPs. As the number of cores implemented on the same die continues to grow and general- purpose microprocessors enter into the so-called “many-core” era [BDK + 05], studies have shown that these simple interconnects will not be able to meet the area, power and performance demands of future CMPs [JZH07][KZT05]. Packet-switched on-chip networks have emerged as the de facto communication fabric for these highly parallel CMPs [TLM + 04][Kar06][Sri07]. Packet-switching allows sharing of wiring resources between many communication flows, thereby providing high bandwidth at lower imple- mentation cost. In addition, structured and localized wiring of OCINs simplifies timing convergence and enables robust designs to scale with device performance. Table 1.1 summaries a few on-chip networks implemented in CMPs. At the beginning of this decade, packet-switched on-chip networks were pro- posed as a replacement for point-to-point global wires in ASICs to address the wire delay scalability and design complexity problem. The goal was 7 CMP # Endnodes Topology Bisection Bandwidth Frequency MIT RAW 16 2D Mesh 217 Gb/sec 425 MHz UT Austin TRIPS 40 2D Mesh 512 Gb/sec 500 MHz Intel Teraflops 80 2D Mesh 2.6 Tb/sec 5 GHz Table 1.1: On-chip networks for chip multiprocessors. to decouple the communication infrastructure from logic cores in “systems- on-chip” (SoC) [DT01][BM02][GG00][SSM + 01][KJM + 02]. Design of tools [PCSV03][JMBM04][BJ05] and methodologies [HP03] for synthesizing and optimizing on-chip networks for SoCs continues to be an active area of research. Unlike the on-chip networks for SoCs, however, the on-chip networks for general- purpose CMPs cannot be optimized for one application, task mapping or traffic pattern. They must be designed to provide acceptable performance (i.e, end-to-end latency and throughput) under a wide range of spatial and temporal traffic distributions. This makes on-chip networks for CMPs similar in their design requirements to interconnection net- works that were traditionally used in high performance multiprocessor systems. Indeed, many principles of interconnection networks developed for multiprocessor systems of the 80s and 90s are applicable to on-chip networks for CMPs. But, how the architectural choices, such as the topology, routing algorithm, flow control, and router microarchitec- ture, for multiprocessor interconnection networks can be applied to on-chip network is still an active research area. 1.3 Design Challenges for On-Chip Networks Design of on-chip interconnection networks is challenged by three competing require- ments: high performance (low latency and high bandwidth), low power dissipation 8 and fault tolerance. To meet the performance, especially bandwidth, requirement on- chip networks are often “over-designed,” i.e., they are designed to have capacity much higher than what is necessary under most circumstances. For example, in [GKM + 06] it was shown that the memory system network for the TRIPS microprocessor [Kar06] has maximum bandwidth of over 3.5 Tb/sec. Profiling the usage of the network shows that much of this bandwidth is under-utilized. Seven of the twenty applications in the SPEC CPU2000 benchmark suite (art,bzip2,gzip,parser,swim,vortex and wupwise) have an average bandwidth demand of less than 1% of the maximum band- width. But, this excess bandwidth is not a result of poor design. Across applications there is a significant variation in the bandwidth demand. For example, the difference between the average bandwidth requirement ofswim andtwolf is almost 50£. Also, during the execution of the same application, the bandwidth demands fluctuates sig- nificantly. The same study shows that for sixtrack, one of the applications in the SPEC CPU2000 suite, the instantaneous injection rate varies from 1 to over 22% of the maximum bandwidth during the course of execution. It is because of these varia- tions at different granularities (i.e., across application and within application) and the well-known exponential increase in end-to-end latency of a network near saturation that on-chip networks are designed with excess capacity. Excess network capacity, however, comes at the cost of wiring channels, silicon area and, more importantly, power consumption. For example, the on-chip network in the MIT Raw chip [KTMW03][TLAA03] consumes 36% of the total chip power while the network in the Intel Teraflops processor [Sri07] consumes over 28% of the chip power. Routers are the main contributors to the network power consumption [Han03]. 9 On-chip networks continue to consume power even when the bandwidth demands are very low. During idle cycles, some of the resources like the inter-router links con- tinue to consume power to maintain synchronization between transmitter and receiver ports, while others like input/output buffers, routing and arbitration logic and crossbar, continue to consume leakage power. Furthermore, the energy requirement for transfer- ring each packet remains roughly the same even when the network is nearly empty. An ideal on-chip network should consume power only when it is delivering packets, not when it is waiting for packets to be injected. More precisely, only those network resources (links, buffers, crossbar, etc.) that are along the path of packets currently in the network should consume power, while the rest should be off 1 . However, the state-of- the-art in on-chip networks is very different from this ideal model. On-chip networks in implemented designs are always ready to operate at their peak performance and consume a significant portion of the total chip power (40% and 28% of each core’s power budget in MIT Raw and Intel Teraflop processors, respectively). Given tight power constraints, there is a growing need for power-efficient on-chip networks with sufficient bandwidth. In [Sri07], for example, in order to prevent the network from becoming a bottleneck, it was argued that CMPs fabricated in 32nm must be able to provide bisection bandwidth of more than 2 TeraBytes/sec and consume under 10% of the total chip power. One potential approach for meeting power-performance goals is to turn off network links, ports and parts of routers when the traffic injection rate is low, and turn them back on only when demand for bandwidth increases sufficiently. To do so, a flexible network 1 Throughout this work, a network resource is said to be off when it has been power-gated. Mechanisms proposed in [HGK09] for powering down various network resources are assumed. A network resource is on if it is operating normally. This terminology is consistent with what was used in previous works, such as [SP03], [SP04] and [SP07]. 10 topology is needed which can allow segments 2 of on-chip routers to be turned off and on based on statically- or dynamically-determined demand for network bandwidth. In this work, we propose such a network topology, called the Polymorphic Cubic Ring (pcRing) topology. This polymorphic topology provides a simple yet flexible infra- structure for on-demand bandwidth provisioning in on-chip networks. A pcRing net- work is obtained from ak-aryn-cube torus network by dynamically removing selected network links in all but one dimension. Unlike previous proposals, however, dynami- cally reconfigurable pcRing networks can trade off network power and bandwidth with- out a significant increase in the average distance between nodes. This flexibility is made possible by the properties of the topology, a simple routing algorithm and a reconfig- urable scheme customized to the topology and the routing algorithm. As our analysis shows, a polymorphic cRing network can turn off upto 37% of network channels and associated buffers for less than 5% increase in packet latency in a 256-node network. The proposed reconfiguration scheme can also be used to provide resilience against fail- ures of links and router ports. 1.4 Research Contributions This work looks at the problem of deadlock-free dynamic reconfiguration in both off- and on-chip networks and makes three important contributions: ² First, by applying a theory for dynamic reconfiguration of networks to one of the most important off-chip interconnection networks of the day – the Infini- band network – we bridge the gap between theory and practice on the issue of dynamic reconfiguration. Despite their promise of increasing system availability 2 The combination of link, logic and buffers at both ends of a link is referred to as a segment [HGK09]. 11 and dependability, and a plethora of research on theoretical methods for doing so, dynamic reconfiguration schemes have not been implemented in real hardware. Our timely application shows an efficient way to implement dynamic reconfigu- ration in InifiBand networks using only the mechanisms and features that IBA- compliant devices already have. To the best of our knowledge, this is to date the only implementation of deadlock-free dynamic reconfiguration on Infiniband networks. ² Second, this work contributes to developing a clearer understanding of why on- chip networks that have been proposed for many-core CMPs are different from off-chip interconnects for multiprocessors and how they should be compared. Similar analyses have been conducted in other studies but we introduce new met- rics, analyze a wider range of topologies (including less well-understood topolo- gies like Cube Connected Cycles) and, most importantly, consider variations of hierarchical ring topologies that have not been considered in prior studies. Our work, therefore, gives new insights about the opportunities and challenges in applying decades of research in interconnection networks for multiprocessor sys- tems to on-chip networks for CMPs. Work reported in this dissertation was one of the first such efforts conducted in collaboration with the industry (specifically, Intel Corporation) and has featured prominently in discussions about the intercon- nect for future many-core CMPs at Intel [Man07][Rat09]. ² Finally, we motivate the need for and present a complete network architecture – topology, routing algorithm, router architecture, and a reconfiguration proto- col – for a dynamically reconfigurable on-chip network. The proposed Polymor- phic Cubic Ring network is designed based on new insights about the relationship between latency, bandwidth and the number of connected rings in k-ary n-cube 12 torus networks, which allows for trading off network bandwidth for reduced net- work power without significantly increasing average latency. As such, Cubic Ring network is a significant addition to the toolbox of designers of on-chip intercon- nection networks. 1.5 Organization The rest of this dissertation is organized into six chapters. Chapter 2 reviews the basics of network reconfiguration and why the issue of deadlock-avoidance during reconfigura- tion is important in interconnection networks. The communication behavior of represen- tative scientific applications is analyzed to motivate the need for using dynamic recon- figuration of on-chip networks for power reduction. Our implementation of a deadlock- free dynamic reconfiguration scheme for Infiniband networks is discussed in Chapter 3. Chapter 4 highlights the differences between on- and off-chip networks, presents metrics for comparing candidates for on-chip topologies and compares a wide range of network topologies. Cubic Ring network, a new network topology and routing algorithm, is pre- sented in Chapter 5. A dynamic reconfiguration protocol for Polymorphic Cubic Ring networks is discussed in Chapter 6. Finally, important conclusions from this work and possible future extension to it are discussed in Chapter 7. 13 Chapter 2 Background and Related Work The interconnection network is a critical component of any high-performance multi- processor system or distributed computer system. It forms the communication back- bone, and greatly impacts the communication performance, of such systems. Tradi- tionally, direct and indirect topologies like fat tree, mesh, torus and hypercube were used in propriety interconnection networks designed and used by individual vendors in their server systems. Examples of such networks include Sea Star used in the Cray T3D [cra93], Intel Paragon [int94], and Cray T3E [ST96]. However, with the emer- gence of cluster systems and storage-area networks, irregular/arbitrary topologies built using off-the-shelf switches became more popular. Examples of such networks include Autonet [SBB + 91], Myrinet[NDR + 95], ServerNet [Hor96], Fibre Channel [CKR95] and InfiniBand [Inf00]. These networks are favored in cluster systems for their low cost, and greater flexibility. Of these networks, InifiniBand has been rapidly graining prominence. Today, more than half of the supercomputers in the Top 500 1 list use either InifiBand or Gigabit Ethernet as their communication fabric. These high-performance cluster systems require interconnection networks that not only offer high performance (low latency and high bandwidth) but are also highly reli- able, available and dependable. These attributes are increasingly important for appli- cations with stringent quality of service requirements, such as high-definition video processing and distributed online transaction processing. The reliability, availability 1 Top 500 is the list of the 500 fastest super computers in the world. The list is revised biannually and published at www.top500.org. 14 and dependability requirements at the system level impose a great demand on the com- munication subsystem to be robust. That is, the communication fabric has to provide connectivity in the presence of transient faults or hot-swapping of components. 2.1 Dynamic Reconfiguration in Off-Chip Networks A resilient network should have the ability to change when the state of the system changes. That is, it should be able to reconfigure and maintain connectivity and high performance when either the user disables/enables certain nodes or fault appear in the network. To support this, the routing function – which determines the path that packets take from their source to their destination – must be able to change while the system is up and running. A commonly employed approach to detect changes in the network is to periodically discover the network. If the network discovery indicates a change in the state of the network such that the routing function needs to be updated, a reconfiguration of the network is initiated. There are three common approaches to network reconfiguration: ² Off-line Reconfiguration ² Static Reconfiguration ² Dynamic Reconfiguration In off-line reconfiguration, the system is reconfigured only at boot time. That is, before the applications start to use the network. This has the clear advantage of simplic- ity. If a change in the network state is discovered, the user is notified to bring the entire 15 system safely down and restart. The off-line approach does lead to lowered system availability and dependability. In the static reconfiguration approach commonly found in Fibre Channel and Autonet switches, the routing function is reconfigured without a guarantee that pack- ets already in the network will be delivered. Packets in the network are either discarded before the routing function is updated or the routing function is updated and packets that cannot make progress toward their respective destinations are discarded after the reconfiguration is complete. The static approach requires soft link-level flow control for dropping and re-transmission of application packets. Hard link-level flow control regu- lates the flow such that switches are not allowed to discard or re-transmit packets. Most high performance interconnects for multiprocessors use hard link-level flow control to avoid copying messages within the network. Both off-line and static reconfiguration approaches negatively impact application performance on account of reconfiguration. Dynamic reconfiguration approach, on the other hand, allows application packets to continue using the network while the reconfig- uration is taking place. This is made possible by always providing a path for packets in the network to make progress toward their respective destinations. While no networking technology can guarantee that no packet will be discarded under any condition, discard- ing packets should be an exception in high-performance, high-reliability networks. But, preserving connected path for packets during reconfiguration is non-trivial as it can lead to packet deadlocks. Deadlocks that result from network reconfiguration are referred to as reconfiguration-induced deadlocks. 16 2.1.1 Reconfiguration-Induced Deadlocks Network resources are finite so structural hazards on these resources are inevitable. These hazards delay packets by way of congestion and can even prevent them from any progress altogether. The later case is referred to as a deadlock. Deadlock occurs when there is a circular hold-and-wait dependency between packets in the network and certain resources (queues, physical channels or crossbar ports). This circular relation can last forever, unless some action is taken to resolve the dependency. Reconfiguration-induced deadlocks are those caused by the dependencies created in a network undergoing recon- figuration as a result of packets be routed under the influence of multiple routing func- tions. This can occur even if each of those routing functions is deadlock-free. When a network undergoes a reconfiguration of its routing function, it can render some existing packets unroutable. These are packets that were routed previously by the old routing function and may be occupying resources that are illegal under the current routing function. As such they create ghost dependencies which can interact with the dependencies allowed by the new routing function to form close dependency cycles. 2.2 Previous Work on Dynamic Reconfiguration Several schemes have been proposed in the literature to combat the problem of reconfiguration-induced deadlocks, but none has been applied to InfiniBand Architec- ture as yet. The NetRec scheme [DNV01] requires every switch to maintain information about nodes some number of hops away and is only applicable to wormhole networks. IBA does not provide mechanisms to keep track of such information in switches or channel adapters, and it is packet-switched, not wormhole switched. Partial Progressive Reconfiguration (PPR) [CBJ + 01], which is applicable to cut-through networks, requires 17 a sequence of synchronizing steps to progressively update old forwarding table entries to new ones while ensuring that no cycles form. As the forwarding tables are updated, certain input to output channel dependencies are progressively disabled. IBA does not take into account the input port when the output link for an incoming packet is com- puted. Therefore, PPR, in its original form, is not directly applicable to IBA networks. This problem of removing input to output channel dependencies can, however, be solved by using the Service Level to Virtual Lane (SL-to-VL) mapping tables. That is, pack- ets arriving at input ports corresponding to which output channels have been disabled can be dropped by directing them to the management virtual lane 2 . Discarding of these packets can result in some localized performance degradation which may not be accept- able in high performance networks. Link drainage schemes, similar to the one proposed in this work, is another option to ensure deadlock freedom without compromising per- formance. However, what the total cost of implementing PPR in IBA would be remains an unanswered question as no mention of this is given in the literature. Several researchers have proposed strategies for computing deadlock-free routing paths in IBA networks [SRD01, LFD01]. However, the only previous work that deals with deadlock-free reconfiguration for IBA-compliant networks considers static recon- figuration [BCQ + 03]. There is no literature to-date on solving the difficult problem of deadlock-free dynamic reconfiguration of IBA-compliant networks. In this work, a straightforward method for applying the Double Scheme over Infini- Band networks is presented. We show how features and mechanisms available in the InfiniBand Architecture for other purposes can also be used to implement Double Scheme dynamic reconfiguration. This work allows InfiniBand networks to better sup- port applications requiring certain quality of service (QoS) guarantees that do not well 2 IBA specifications require that any data packets in the management virtual lane be dropped. 18 tolerate intermittent performance drops-offs, as would be the case without deadlock-free dynamic reconfigurabilty. 2.3 Dynamic Reconfiguration in On-Chip Networks Availability and dependability are two key demands for off-chip networks in very large system, but failures of links and routers in on-chip networks are extremely rare. While a mechanism to detect and isolate failed cores or routers is desirable in on-chip networks, the cost-benefit trade-off is very different. Given the very low probability of failures, it is entirely reasonable to perform the reconfiguration statically or off-line. However, dynamic reconfiguration schemes can be used in on-chip networks pri- marily to address a completely different problem: network power. An ideal on-chip network should consume power only when it is delivering packets, not when it is wait- ing for packets to be injected. More precisely, only those network resources (links, buffers, crossbar, etc.) that are along the path of packets currently in the network should consume power, while the rest should be off 3 . However, the state-of-the-art in on-chip networks is very different from this ideal model. On-chip networks in implemented designs are always ready to operate at their peak performance and consume a signifi- cant portion of the total chip power (40% and 28% of each core’s power budget in MIT Raw [KTMW03][TLAA03] and Intel Teraflop [Sri07] processors, respectively). Given tight power constraints, there is a growing need for power-efficient on-chip networks with sufficient bandwidth. In [Sri07], for example, in order to prevent the network from becoming a bottleneck, it was argued that CMPs fabricated in 32nm must be able to 3 Throughout this chapter, a network resource is said to be off when it has been power-gated. Mech- anisms proposed in [HGK09] for powering down various network resources are assumed. A network resource is on if it is operating normally. This terminology is consistent with what was used in previous works, such as [SP03], [SP04] and [SP07]. 19 provide bisection bandwidth of more than 2 TeraBytes/sec and consume under 10% of the total chip power. Dynamic reconfiguration can allow parts of the network to be logically removed from active routing. These parts (links, buffers, link drivers) can then be turned off via power-gating to save leakage power. In order for power reduction through dynamic reconfiguration of the network to work, however, applications must exhibit long enough periods of low bandwidth demand to justify the reconfiguration overhead. Once a recon- figuration mechanism is in place, it an be used to provide resilience against permanent faults as well [ZD10b]. 2.4 Characterizing Bandwidth Demand of Parallel Applications Previous studies have shown that the bandwidth demands of applications can vary sig- nificant from each other. For example, in [WOT + 95], it was shown that in a 64-node network, bandwidth demands of applications in the SPLASH benchmark suite vary from less than 0.125 Bytes/FLOP (Barnes) to 2 B/FLOP (FFT), representing a 16x difference across applications. This variation in the bandwidth demand across applications means that networks are often “over-designed.” If the temporal variation occurs at a granular- ity that can be exploited using dynamic reconfiguration of on-chip networks, it may be possible to turn off significant portions of the network when the demand for bandwidth is low, turning them back on only when the demand is high [ZD10a]. In this work, we investigate the variation in the bandwidth demand within applica- tions over the execution of an application. Applications with significant variability (e.g., 20 applications with long periods of low bandwidth demand during the course of their exe- cution) will point to the feasibility of power saving through dynamic reconfiguration. The focus of this study will, therefore, be on answering the following questions: ² Is there variability in the bandwidth demand of applications over their respective execution times? ² At what timescale is this variability observed? ² What is the length of intervals over which bandwidth demand varies in different applications? To answer these question, the communication behavior of six representative sci- entific applications is analyzed. These applications are: AORSA2D, GTC, IMPACT- T, NAMD, PARATEC and PMEMD. These applications were selected based on their importance to the HPC community. AORSA2D All ORders Spectral Algorithm (AORSA) is a highly-scalable global-wave solver code that was developed within the SCIentific Discovery through Advanced Computing (Sci- DAC) project [Sci]. It has been used to demonstrate how electromagnetic waves can be used for driving current flow, heating, and controlling instabilities in the plasma. GTC The gyro-kinetic toroidal code (GTC) is a particle-in-cell (PIC) simulation code for solv- ing the linear or nonlinear gyro-kinetic equation. Part of the National Energy Research Scientific Computing Center’s NERSC-6 Benchmark suit [NER08], the code is used in 21 studying fusion, which has historically been an important area for the high-performance computing community. IMPACT-T Another code from the NERSC-6 Benchmark suit, IMPACT-T is a PIC code represent- ing beam dynamics simulation. Part of a popular suite of codes for accelerator simula- tions, IMPACT-T is specifically used for photoinjectors and linear accelerators. NAMD NAMD is a molecular dynamics program for simulations for large biological molecular systems. PARATEC PARallel Total Energy Code (PARATEC) performs ab initio quantum-mechanical total energy calculations. According to NERSC, Discrete-time Fourier Transform (DFT) codes similar to PARATEC account for 80% of all HPC cycles delivered to the Material Science community. PMEMD Particle Mesh Ewald Molecular Dynamics (PMEMD) code is a core component of a molecular dynamics package called AMBER. 2.4.1 Platform Results reported in this section were obtained from real execution of the aforementioned applications with 64-way concurrency on NERCS’s Carver system [ASW]. Carver is an 22 IBM iDataPlex system with 400 compute nodes. Each node contains two Intel Nehalem quad-core processors (3,200 cores total) and 24GB (320 nodes) or 48GB (80 nodes) of main memory. The nodes are connected via a 4£ QDR InfiniBand network which is locally organized in a fat-tree topology, and globally as a 2D mesh. The system’s theoretical peak performance is 34.2 TFlops. 2.4.2 Communication Behavior Figure 2.1 shows the bandwidth injected over the course of the execution of two parallel applications, GTC 2.1(a) and PMEMD 2.1(b). The plots show injected bandwidth at the resolution of one second, which means that all the messages injected into the network in a given one-second window are aggregated into a single bytes per second value. Any variation in the injection rate within the one-second window is masked at this resolution. As the plots show, at this resolution, GTC has an injection rate which is roughly constant at about 10 MB/sec, and PMEMD’s at about 100 MB/sec, with spikes at the end of the execution in both applications. As such, the two applications appear to have almost no variability in the bandwidth demand over their respective execution times, except for the momentary increase in the bandwidth at the the very end. 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 100 200 300 400 500 600 700 800 900 Injection Bandwidth (in kB) Execution Time (in s) GTC_carver_64 - Complete Execution Profile (resolution = 1s) (a) 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 700 800 900 1000 1100 1200 1300 Injection Bandwidth (in kB) Execution Time (in s) PMEMD_carver_64 - Complete Execution Profile (resolution = 1s) (b) Figure 2.1: Injected bandwidth of (a) GTC, and (b) PMEMD over the entire execution, viewed at 1s resolution. 23 This appearance of a constant injection rate, however, is an artifact of the resolution. By aggregating the injected bandwidth into one-second timeslots, variations at a finer granularity are masked. Figure 2.2 shows injected bandwidth of GTC at three different resolutions: 100ms (Figure 2.2(a)), 10ms (Figure 2.2(b)) and 1ms (Figure 2.2(c)). (To make the plots readable, random time windows were selected for each plot. Results are similar at other points during the execution.) It is clear from these plots that at finer resolutions the injection rate is not constant and is, in fact, quite sporadic. At 1ms resolution, significantly long periods of inactivity are followed by short periods of injections. 0.001 0.01 0.1 1 10 100 1000 10000 200 205 210 215 220 Injection Bandwidth (in kB) Execution Time (in s) GTC_carver_64(resolution = 100ms) (a) 100 1000 10000 200 200.5 201 201.5 202 Injection Bandwidth (in kB) Execution Time (in s) GTC_carver_64(resolution = 10ms) (b) 0.001 0.01 0.1 1 10 100 1000 200 200.02 200.04 200.06 200.08 200.1 200.12 200.14 200.16 200.18 200.2 Injection Bandwidth (in kB) Execution Time (in s) GTC_carver_64(resolution = 1ms) (c) Figure 2.2: Injected bandwidth of GTC viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. Variability in the injected bandwidth observed in Figure 2.2 is not unique to GTC. Other applications analyzed in this work exhibited similar variability, albeit at different timescales and to a different extend. Figures 2.3, 2.4, 2.5, and 2.6 show the injected bandwidth of AORSA2D, IMPACT-T, NAMD, and PARATECH at 100ms, 10ms and 1ms resolutions. Together they answer the first question posed above (Is there variability in the bandwidth demand of applications over their respective execution times?) in the affirmative for this representative set of applications. To determine the suitability of network reconfiguration for power reduction, and to answer the remaining two question (i.e., at what timescale is this variability observed? 24 and what is the length of intervals over which bandwidth demand varies in different applications?) we capture the qualitative information in these plots in a quantitative form in Table 2.1. The entire application execution is divided into 1ms timeslot, start- ing from when the first message is injected into the network until the last message is injected. A timeslot is called on on time if at least one message was injected during this timeslot, and off time if the injected bandwidth is zero. Column 2 of Table 2.1 notes the percentage of 1ms Off timeslots as a fraction of all 1ms timeslots. Columns 2 through 5 give an indication of the length of the off intervals. Fraction of intervals that are between 1ms and 10ms long appear in column 2, 10ms and 100ms appear in coumn 3, 100ms and 1s appear in column 4 and those that are longer than 1s appear in column 5. Application % Off Time Off Interval 1-10ms Off Intrvl 10-100ms Off Intrvl 100ms-1s Off Intrvl> 1s AORSA2D 97.74% 95.92% 1.31% 1.97% 0.80% GTC 96.15% 71.31% 13.40% 15.29% 0.01% IMPACT-T 99.87% 67.48% 25.35% 7.17% 0.01% MIFX8 79.09% 96.51% 2.46% 0.97% 0.06% NAMD 98.57% 97.71% 0.76% 0.00% 1.53% PARATEC 95.51% 92.72% 5.58% 1.18% 0.52% PMEMD 68.39% 89.53% 10.38% 0.08% 0.01% Table 2.1: Comparison of the off times and lengths of off intervals in parallel applica- tions. It is clear from Table 2.1 that vast majority of off intervals are less than 10ms long. We experimented by raising the threshold of off intervals from total injected bandwidth of zero to 10% of the injected bandwidth of an on period. This did not change the distri- bution of interval lengths significantly, indicating that the distribution is fairly resilient to small perturbations. Based on the results above, it can be concluded that significant variation in the injec- tion bandwidth of the chosen set of applications exists. The dominant portion of this 25 variation exists at 1ms granularity. Therefore, if network reconfiguration can be affect at nano-second scale, this variation can be exploited. 2.5 Previous Work on Power Reduction in On-Chip Networks Previous work on dynamic power reduction of on-chip networks has focused mostly on network links. Proposals have been made to reduce leakage power in links by turning off selected links in 2D mesh networks [SP07], reducing the width [Mar04] of links, and applying dynamic voltage and frequency scaling (DVFS) to network links [SK04]. Researchers have also proposed power-gating buffers [CP03] and segments of on-chip routers [HGK09]. While the work presented in [SP07] resembles ours most closely, polymorphic networks that use our proposed cRing topology are more suitable for exploiting coarse-grain variation in the bandwidth needs of applications. Recent studies, such as [MSSD08], have shown a greater potential for power reduction through recon- figuration at a coarse granularity (e.g., across applications) than at a fine granularity. 26 1000 10000 100000 200 205 210 215 220 Injection Bandwidth (in kB) Execution Time (in s) AORSA2D - From 200s to 220s (resolution = 100ms) (a) 0.1 1 10 100 1000 10000 200 205 210 215 220 Injection Bandwidth (in kB) Execution Time (in s) AORSA2D(resolution = 10ms) (b) 0.1 1 10 100 1000 10000 200 205 210 215 220 Injection Bandwidth (in kB) Execution Time (in s) AORSA2D(resolution = 1ms) (c) Figure 2.3: Injected bandwidth of AORSA2D viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. 0.001 0.01 0.1 1 10 100 1000 10000 100000 200 205 210 215 220 Injection Bandwidth (in kB) Execution Time (in s) IMPACT_carver_64(resolution = 100ms) (a) 0.001 0.01 0.1 1 10 100 1000 10000 200 200.5 201 201.5 202 Injection Bandwidth (in kB) Execution Time (in s) IMPACT_carver_64(resolution = 10ms) (b) 10 100 1000 200.16 200.18 200.2 200.22 200.24 200.26 200.28 200.3 200.32 200.34 Injection Bandwidth (in kB) Execution Time (in s) IMPACT_carver_64(resolution = 1ms) (c) Figure 2.4: Injected bandwidth of IMPACT-T viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. 0.1 1 10 100 1000 10000 316 318 320 322 324 326 328 Injection Bandwidth (in kB) Execution Time (in s) NAMD_carver_64(resolution = 100ms) (a) 0.1 1 10 100 1000 316 318 320 322 324 326 328 Injection Bandwidth (in kB) Execution Time (in s) NAMD_carver_64(resolution = 10ms) (b) 0.01 0.1 1 10 100 1000 316 318 320 322 324 326 328 Injection Bandwidth (in kB) Execution Time (in s) NAMD_carver_64(resolution = 1ms) (c) Figure 2.5: Injected bandwidth of NAMD viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. 27 0.001 0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 765 770 775 780 785 790 795 800 805 Injection Bandwidth (in kB) Execution Time (in s) PARATEC_carver_64(resolution = 100ms) (a) 0.001 0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 765 770 775 780 785 790 795 800 805 Injection Bandwidth (in kB) Execution Time (in s) PARATEC_carver_64(resolution = 10ms) (b) 0.001 0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 765 770 775 780 785 790 795 800 805 Injection Bandwidth (in kB) Execution Time (in s) PARATEC_carver_64(resolution = 1ms) (c) Figure 2.6: Injected bandwidth of PARATEC viewed at (a) 100ms, (b) 10ms and (c) 1ms resolution. 28 Chapter 3 Dynamic Reconfiguration over InfiniBand TM InfiniBand Architecture (IBA) [Inf00] is a general-purpose interconnect standard designed to solve a wide spectrum of interprocessor communication and I/O problems associated with servers and cluster systems. In addition to providing low latency and high bandwidth point-to-point communication support, it also includes certain features and mechanisms useful for improving system reliability and availability. Features such as subnetwork management, multiple service levels, separate data and control virtual lanes, node based table-driven routing, end-to-end path establishment, packet time-out and virtual destination naming are all useful for implementing reconfiguration functions in IBA networks. To be truly dependable, however, IBA-compliant networks should be capable of effi- ciently adapting to changes in real-time if and when voluntary or involuntary changes occur. That is, IBA networks should remain up and running with high performance in the presence of hot-swapping of components, failure or addition of links/nodes, activation or deactivation of hosts/switches, etc., arising from changes in users’ needs and/or sys- tem state. The reliability, availability and performance predictability of IBA-compliant cluster computing and storage systems depend critically on the networks’s ability to efficiently support such functions while maintaining certain service level targets. 29 To this end, design of deadlock-free routing and reconfiguration strategies that are compatible with IBA-specifications becomes highly critical as some applications are not designed to tolerate packet loss at link-level 1 . Therefore, IBA networks that employ deadlock-free routing strategies must also provide deadlock-free reconfiguration mecha- nisms to guarantee that no link-level flow control anomalies, including deadlocks, would effect network performance. In other words, reconfiguration-induced deadlocks, how- ever infrequent they may be, cannot be ignored in highly dependable interconnect sub- system such as Infiniband. The Double Scheme [PPD00] provides a straightforward way of updating a net- work’s routing function in a deadlock-free manner when the network undergoes dynamic reconfiguration. It is generally applicable to virtual cut-through (packet- switched) networks, independent of the routing function or topology being imple- mented. Many variations of the scheme exist, however the basic idea behind the scheme can be summarized as follows. At all times, packets are routed under the influence of one and only one routing function—either the old routing function (R old ) existing before reconfiguration or the new one (R new ) corresponding to the new configuration, but never both. This is accomplished simply by spatially and/or temporally separating the routing resources used by each routing function into two sets: one used exclusively byR old and the other by R new . By allowing dependencies to exist from one set of resources to the other but not from both at any given time, a guarantee on deadlock freedom during and after reconfiguration can be proved [PPD00]. 1 Deadlock can occur when packets block cyclically waiting for resources while holding onto other resources indefinitely [Dua93, Dua95]. If allowed to persist,deadlocks can bring the entire system to a standstill, making it vitally important for both the routing algorithm and the reconfiguration technique to guard against them. 30 In this work, a straightforward method for applying the Double Scheme over Infini- Band networks is presented. We show how features and mechanisms available in Infini- Band Architecture for other purposes can also be used to implement Double Scheme dynamic reconfiguration. This work allows InfiniBand networks to better support appli- cations requiring certain quality of service (QoS) guarantees that do not well toler- ate intermittent performance drops-offs, as would be the case without deadlock-free dynamic reconfigurabilty. 3.1 The Double Scheme The Double Scheme [PPD00] provides a systematic way of updating a network’s routing function in a deadlock-free manner when the network undergoes dynamic reconfigura- tion. The basic idea behind the scheme can be summarized as follows: At all times, packets are routed under the influence of one and only one routing function—either the old routing function (R old ) existing before reconfiguration or the new one (R new ) cor- responding to the new configuration, but never both. This is accomplished by spatially and/or temporally separating the routing resources used by each routing function into two sets: one used exclusively by R old and the other by R new . By allowing dependen- cies to exist from one set of resources to the other but not from both at any given time, a guarantee on deadlock freedom during and after reconfiguration can be proved. One possible scenario on how this could work is the following. The routing function before reconfiguration,R old , allows packets to be injected into a connected set of routing resources (designated as C old ) supplied by R old . Once the need for a reconfiguration event is determined and the new routing functionR new is computed, a connected set of routing resources (designated asC new ) is required to become available for use by newly injected packets routed under the influence of R new which supplies those resources. 31 This could be done by allowing packets in a subset of C old resources to be delivered to their destinations while not injecting new packets into that subset, which essentially drains those resources. Non-routable packets encountering any physical disconnectivity which may have caused the need for reconfiguration can be discarded. As packets are no longer injected into any of the C old resources after C new resources are used, C old resources eventually become free and can be incorporated into the set ofC new resources once completely empty, nullifyingR old . In order for Double Scheme dynamic reconfiguration to be applied to a network, sup- port for the following must exist: (a) support for subjecting some packets to one routing function (or routing subfunction) and other packets to a different routing (sub)function throughout packet lifetime in the network; (b) support for initiating, detecting and noti- fying drainage of resource sets in the network and network interfaces; and (c) support for changing (updating) the prevailing routing function across the network and network interfaces without dropping existing routable packets. For optimization purposes, there should also be support for segregating and re-integrating connected subsets of resources from a unified set so that resources can be used efficiently during the common state of no network reconfiguration. Below, many of the inherent features and mechanisms in InfiniBand Architecture that can be exploited to achieve the above are described. 3.2 Exploitable Features of IBA InfiniBand [Inf00] is a layered network architecture that employs switched, point-to- point links for the interconnect fabric. An IBA network is composed of one or more sub- networks, a.k.a subnets, through which communication is done using routers. A subnet is the smallest functional composition of IBA-compliant components which can operate independently. End-nodes within a subnet are interconnected through switches, and each 32 end-node has one or more channel adapters (CAs) attached directly to it. These CAs serve as the source and terminus of IBA packets. Each subnet is managed autonomously by an associated subnet manager (SM). The SM is responsible for discovery, configu- ration and maintenance of components associated with a particular subnet. Only one master SM is active at any given time for each subnet, and passive subnet management agents (SMAs) residing in IBA network components are used to communicate with the master SM through a set of well-defined protocols, referred to as the subnet management interface (SMI). Like any modern communication architecture, the IBA communication stack is divided into physical, link, network and transport layers. Services provided by each layer are mostly specified in the specification, with some room for vendor-specific implementation of the architecture. IBA defines five service types: Reliable Connec- tion (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD) and Raw Datagram (RAW). Details about the differences between these service types is beyond the scope of this work, but are given in [Inf00] and [ZPD03]. 3.2.1 Routing Routing in IBA is source-based but implemented in a distributed manner using forward- ing tables residing in each switch. Each CA or switch/router port has a globally unique identifier (GUID)—a physical name—but can have up to 128 local identifiers (LIDs)— a logical name—associated with it. A 3-bit link mask control (LMC) value can be used to map multiple LIDs to the same physical port. LIDs in the range of BaseLID to BaseLID+2 LMC ¡1 map to the same port to allow destination renaming [SRD01]. For instance, if a CA port has a LMC value of 3 and a hex base address of 0x0010, then addresses 0x0010 to 0x0017 all map to the same physical port. Mapping of GUIDs 33 to LIDs allows components in the network to persistently identify other components either logically or physically. How and where this mapping is done is not specified; it is assumed that the SM (possibly with the help of SMAs) can maintain this map- ping function. Since there is a unique LID entry corresponding to each address in the forwarding tables as described below, multiple logical addresses pointing to the same physical port can be used to implement source multipath routing which allows packets to reach destinations using different paths in the network. 3.2.2 Service Levels and Virtual Lanes IBA allows packets to be distinguished by service class into one of sixteen different service levels (SLs). The packet’s SL is contained in the local routing header (LRH) of the packet. IBA also allows packets to traverse the physical links of a network using different virtual lanes. A virtual lane is a representation of a set of transmit and receive buffers on a link. Up to sixteen virtual lanes (VL0-VL15) are allowed, but a minimum of two (VL0,VL15) are required by all ports. VL15 is used only for control packets, whereas the other virtual lanes (VL0-VL14) are used for data packets. Virtual lane assignment exists only between ports at each end of a link, and virtual lane assignment on one link is independent of the assignment on another link, given by a 4-bit VL field in the LRH of a packet. The actual number of data virtual lanes used by ports and how packets of a given service level map to virtual lanes is determined by the SM at the time that the network is configured or reconfigured. The VL assignment of a packet is not necessarily the same across a subnet. Packets of a given service level may need to be switched between different VLs as they traverse the network. Service level to virtual lane (SL-to-VL) mapping is used to change the VL assignment of packets as they traverse a subnet. 34 The SM is responsible for configuring a 16 entry, 4-bit wide SL-to-VL mapping table associated with each CA or switch port in which each entry indicates the VL to be used for the corresponding SL. The SL-to-VL mapping table can be read and modified by the SM through subnet management methods SubnGet() and SubnSet(). This allows, among other things, the possibility of two packets with the same SL and VL arriving from different input ports yet destined to the same output port to be assigned different VLs. 3.2.3 Forwarding Tables In addition to the SL-to-VL mapping tables, the SM is also responsible for configuring forwarding tables in CAs and switches. Routing functions in IBA are implemented explicitly through forwarding tables using LIDs. Forwarding tables are composed of a set of entries addressed by the LID of a packet’s LRH such that a matching entry specifies the output port that should be used by the packet. They can be organized as Linear (shown in Figure 3.1)or as Random (shown in Figure 3.2) Forwarding Tables. Linear Forwarding Table 6 4 e n t r i e s 8-bits M a x . 4 8 k e n t r i e s ( U n i c a s t ) 256-bits M a x . 1 6 k e n t r i e s ( M u l t i c a s t ) Port ... ... Port Block ID Figure 3.1: Linear Forwarding Tables in IBA. 35 32-bits 1 6 e n t r i e s M a x . 4 8 k e n t r i e s ( U n i c a s t ) Random Forwarding Table LID Valid LMC Reserved Port ... LID/Port Block ID Figure 3.2: Random Forwarding Tables in IBA. Alternatively, a special mechanism for exchange of control packets between the SM entities could also be used. This mechanism, called Directed Routes, allows the sender to specify the complete path that the packet must take from the source node to the desti- nation node and back. The Directed Routes mechanism implemented in IBA also allows packets to be routed using the normal LID routing on either side of the directed route. Linear forwarding table (LFT) entries are configured by the SM through an attribute modifier. This modifier is a pointer to a list of 64 forwarding table entries or port block elements, where each entry or element is an 8-bit port identifier to which packets with LIDs corresponding to this entry are forwarded. All entries in a LFT are in sequential (linear) order starting from the address specified in the attribute modifier. The level of granularity at which the LFT can be modified is one block, i.e., 64 entries. Assuming the SM can do a read-modify-write on each block, entries within a particular block would be unavailable for lookup only during the time of writing back the block, in the worst case. Random forwarding tables (RFTs) provide greater flexibility since a finer granularity is used for table modification. The attribute modifier for RFTs points to a block of only 16 LID/Port block elements to which this attribute applies. Also, unlike LFTs, consecutive entries in RFTs are not necessarily in sequential addressing order. This flexibility comes at the cost of more complex implementation and higher 36 access time, as RFTs may require implementation using content addressable memories. Furthermore, the block size for RFTs is smaller than that for LFTs, 16 entries in case of RTFs as opposed to 64 for LFTs. Hence, distribution of the forwarding tables takes longer (and more SMPs) in case of RFTs than it does for LFTs. In the worst case, this cost difference is bounded by O(n), where n is the number of switches in the network. In Section 3.4 we discuss the tradeoffs between using RFTs verses LFTs in more detail. 3.2.4 Queue Pairs The virtual interface between IBA hardware and an IBA consumer process is the send and receive queue pair (QP). Each port can have up to 2 24 QPs which are operationally independent from one another. Connection-oriented service types bind QPs at the send- ing and receiving ends whereas datagram (connectionless) service types target QPs at the receiving end by specifying the QP number along with the target CA’s port LID. QPs are not directly accessible to consumer processes; instead, consumer processes use “verbs” to submit work requests (WRs) to a send queue or a receive queue. The CA pro- cesses this request and places it on the respective queue. IBA allows CAs to pre-allocate QPs or allocate them in response to a request for communication. For simplicity, this paper assumes that queue pairs are allocated only when a request for communication is received. 3.2.5 Path Establishment When a consumer wants to establish a path, it queries the SM with a “PathRecord” request. The initiator may specify the destination LID (DLID) in the request message, in which case the SM will return only the information related to the path or paths to the specified destination. Alternatively, it may not specify the DLID value, in which case the 37 SM will return information for all the ports that are reachable from the source. Having received path record information, the initiator sends a request message to the target. The requested service type’s service level is placed in the request message. Should the request be accepted, the target responds with a response message containing the QP number (for connection oriented services) or end-to-end context (for reliable and unreliable datagram services). The target may reject the request by sending a reject message or the target may specify a different set of variables on which communication can be done through a response message. The communication establishment sequence is completed when the initiator sends a “Ready to Use” message to the target. 3.3 Applying the Double Scheme to InfiniBand Archi- tecture The support necessary to implement the Double Scheme over InfiniBand was mentioned at the end of Section 3.1. Here, specific mechanisms introduced in Section 3.2 that can be used to provide that support and how those mechanism should be used to implement the Double Scheme is described [PZD03][ZPBD03]. Spatially separating resources used by each routing function can be supported by assigning two sets of LID addresses for each GUID. As routing in IBA is source-based and dependent on LID addresses, this allows two different routing functions to route packets, one using one set of LIDs and the other using the other set. This, in effect, means that only half of the total possible number of LIDs and routing table entries are useable during normal operation, which should not typically be a problem. It is not necessary to divide LIDs equally among the routing functions, but this may be preferred to allow source multipath capability in both routing functions. In the extreme case, 127 38 out of the maximum of 128 allowed LIDs per port may be used by one routing function and only one by the other. In that case, the second routing function which has only one available destination LID will not have any source adaptivity. The drainage of resources can be supported by allowing only half (or any restricted number) of the SLs to be available to packets at any given time outside of reconfigura- tion. During reconfiguration when both routing functions exist in the network simultane- ously, packets under the influence of one routing function use one set of SLs while pack- ets under the influence of the other use the other set of SLs. 2 During normal operation, these SLs can be mapped to all the available VLs, allowing the optimization mentioned in Section 3.1 to be supported as well. During reconfiguration, the SM can modify the SL-to-VL mapping to allow a set of VLs to be drained. The SM can also initiate a “Send Queue Drain” to drain QPs [Pfi00]. The drainage state of VLs and QPs can be tracked and notified by the SM. Finally, changing the prevailing routing function can be supported by having the SM update the forwarding table and GUID-to-LID mapping. By exploiting these IBA fea- tures and mechanisms, Double Scheme dynamic reconfiguration can be accomplished with a sequence of steps, as presented below. 3.4 Proposed Reconfiguration The proposed reconfiguration procedures is explained below. Notice that some of these steps can be executed in parallel, as explained later in the chapter. 1. Initiate reconfiguration: The need for reconfiguration is established by the SM. Reconfiguration could be triggered by a physical link or node being down or up, 2 It is expected that this would not cause any significant QoS degradation as most implementations are likely not to use all sixteen service levels at once. 39 which is either detected by the SMA in a neighbor switch and notified to the SM via IBA Trap mechanism or is detected directly by the SM during network sweep- ing [BCJ + 03]. Exactly how this is done is beyond the scope of this study. We, therefore, assume that the SM is notified for the need to reconfigure by some IBA supported mechanism. Subnet management packets which cannot be routed due to physical disconnectivity are discarded using IBA’s packet timeout mechanism. 2. Forwarding Table Computation: As the reconfiguration initiates, the SM re- discovers the network and based on the new topological information, computes the forwarding tables. Depending on the complexity of the routing function and the size of the network, table computation can potentially be the most time-consuming step during the reconfiguration process. 3. Modify SL-to-VL Mapping: The SM reads the SL-to-VL mapping tables from each port of a CA or switch and modifies them such that the set of SLs that is currently being used by the packets map to only half the VLs (or any restricted number of VLs between 1 and 14). The basic idea is to drain at least one VL for packets that would be using the new routing function. Subnet management packets continue to use VL15. 4. Update Forwarding Tables: The SM updates forwarding table entries at the switches and CAs that correspond to the LID addresses used for the new routing function. This can be done using a process similar to that used during network initialization. If forwarding is implemented using RFTs, updates can be done without obstructing current routing operations since the old and the new routing functions can be implemented on two independent sets of LID/port blocks, as shown in Figure 3.3. 40 Old Routing Entries New Routing Entries Unused Entries LID/Port Block ID Old Forwarding Table LID/Port Block ID New Forwarding Table ... LID/Port Block ID ... LID/Port Block ID ... ... Figure 3.3: The old and the new routing functions implemented on two independent sets of LID/port in a Random Forwarding Table. If, however, LFTs are implemented, the SM will have to do a read-modify-write on each port block that needs to be modified. Packets may not be able to be forwarded concurrently with the update if their destination LIDs lie in the block being written back. This presents a tradeoff between using more flexible RFTs that can be modified transparently verses using simpler LFTs whose port blocks may become unavailable for a short period of time during reconfiguration. For the purpose of this study, we assume RFTs are used. Performance of both steps 3 and 4 can be improved if the SM stores current SL- to-VL mapping tables and Forwarding tables. 5. Modify PathRecord and GUID-to-LID Mapping Information: Once all the forwarding tables have been updated, the SM modifies the PathRecord informa- tion for each port such that it now supplies the set of SLs that were previously unused. By doing this, the SM will ensure that any new packets being injected into the network use only the VLs that are reserved for them. Notice that by changing the PathRecord information, the SM will force all newly formed QPs to 41 comply with the new set of SLs; QPs which had been formed earlier and which contain messages with old SLs will have to be drained using the “Send Queue Drain” mechanism invoked by the SM. In parallel with this is the modification of the GUID-to-LID mapping by the SM. The addresses which were previously unused (but within the valid range of Base LID+2 LMC for each port) are now supplied. Recall that the new routing func- tion is implemented in the forwarding tables on this set of LIDs. So, supplying these addresses as destination addresses essentially means that packets will now be routed using the new routing function. It is important that the modification of the PathRecord and GUID-to-LID information be performed in synchronism for a particular node so that newly injected packets with LIDs corresponding to the new routing function are placed in Send queues with the appropriate SL. These modifications may be performed asynchronously over the network. 6. Detect Old VL Drainage: Modification of the PathRecord and GUID-to-LID mapping information essentially implies that no new packet will be injected into the network such that it uses the old routing function. However, some old packets might exist in the network and, in order to ensure deadlock freedom, these packets must be drained before new packets can be allowed to use all VLs. Therefore, the SM must detect drainage of old VLs in a systematic fashion before proceeding with the restoration of SL-to-VL mapping. To this end, we propose a drainage algorithm in Section 3.5 which is applicable to any deterministic routing function because drainage is based solely on channel dependency properties of the new routing function. 42 7. Restore the SL-to-VL mapping: Once the network has been drained of packets using the old routing function, the SM can restore the SL-to-VL mapping tables at all nodes such that they now provide all available VLs to packets using the new routing function. A process similar to that described in the first part of Step 3 can be used to restore SL-to-VL mapping. 3.5 Algorithm to Detect VL Drainage By modifying the SL-to-VL mapping tables such that they allow the use of only a restricted set of VLs for packets with SLs associated with the old routing function, VLs for the new routing function can be drained in a single hop. However, before packets using the new routing function can be allowed to use these VLs, complete drainage of these VLs across the entire subnet must be guaranteed. This is because the actions of the steps given previously need not be carried out synchronously across the entire network. A particular node can maintain the state of buffers at its input and output ports and, thus, detect local drainage of VLs, but it has no way of knowing whether or not it will receive more packets on these VLs from its neighboring nodes. There needs to be some form of synchronization between the nodes in order to detect drainage across the entire subnet. Presented here is a simple yet general algorithm that can be used to detect virtual lane (or channel) drainage across the network. The algorithm uses the channel depen- dency information available in the deadlock-free new routing function in order to deter- mine which channels must be drained. This information is represented in the form of a directed graph, which encodes the dependencies between the channels as allowed by the routing function. By systematically collecting channel drainage information at indi- vidual nodes along this dependency graph, channel drainage across the entire network for that particular routing function can be achieved. 43 The key data structure in this algorithm is the channel dependency graph (CDG) [DS87] [DYN02], which gives dependency relations between different chan- nels. A CDG is simply a directed graph in which vertices (or nodes) of the graph are channels connecting the various routing nodes of the network. Each bidirectional chan- nel is represented as two independent nodes in the graph. Arcs in the CDG represent the dependencies between channels. For example, an arc from channelc i toc j indicates that a packet can request c j while holding resources associated with c i . In order for a deterministic routing function to be deadlock free, the CDG must be acyclic [DS87]. The drainage algorithm can be implemented in an IBA subnet with the following steps. Step 1: The SM computes the CDG for the routing function to be implemented. IBA’s source-based routing is deterministic for all DLIDs; therefore, the CDG must be cycle-free in order for the routing function being implemented to be deadlock-free. Using the CDG, the SM builds a list of all valid paths along the edges of the CDG. The first switch in each path list is the switch connected to a source channel (i.e., source nodes, from the standpoint of the CDG) whereas the last switch is the switch connected to the leaf channel (node) in the CDG. Step 2: Having built the path list, the SM sends control packets to the switches at the head of the list. Switches respond to these control packets with the number of packets in each VL. Upon receiving the reply messages, the SM determines whether a switch has been drained of packets in the old VLs or not. If the switch is not drained, another drainage packet is sent to it, and the process repeats until the switch has been drained of old packets. Step 3: Once a switch is found to be drained of old packets, the next switch in the path list is drained. If drainage information for this node has already been received 44 (through a different path), the SM drains the following switch. This check guarantees that no redundant drainage queries are made throughout the network. Step 4: The drainage process continues until all switches in the CDG have been drained. Note that by collecting the drainage information along the paths indicated by the CDG, the SM ensures that no old packets exist in the network. 3.5.1 Implementation The SM has knowledge of the routing function to be implemented and, therefore, can compute the CDG. From an implementation standpoint, the key issue is that of sending the control packets by the SM to the hosts connected to switches at the originating end of the source channels in the CDG and then collecting them from the leaf channels. These packets have to traverse all the legal paths in the CDG before they reach the leaf channels, which forward them to the SM. Since, only the SM is assumed to have knowledge of the CDG, a combination of LID-based and directed routing could be used to accomplish this task. The number of control packets sent to these source channels and gathered by the SM is equal to the total number of possible paths to leaf channels accessible by source channels. The SM can specify the forward route for these packets, i.e., the routeto the source channel, simply by specifying the LID of the host connected to the switch and using forwarding tables. However, the return path from the leaf nis specified as a directed route. Although, routing of directed route packets involve some processing at each node, which results in an increase in the time to route the packet, it must be noted that the number of these direct routed packets is equal to the number of leaf channels in the CDG. Also, these packets use direct routes only in the return path. 45 3.6 Example As an example, let us consider an IBA subnet with nine switches connected in a 2-D mesh topology. Six of these switches connects to a channel adapter, as shown in Fig- ure 3.4. For simplicity, let us assume that SL1 through SL8 are allocated to the current routing function, while the remaining SLs are reserved for the new one. Also, let the number of data VLs per physical channel across the subnet be equal to four. To illustrate the various steps of the reconfiguration process, we will assume that the source-based deterministic routing function implemented on this network has to be reconfigured from XY-routing to YX-routing. Both these routing functions are deadlock free indepen- dently. Figure 3.4: An example IBA subnet. Notice that both routing algorithms are defined on a C £N domain [DS87], i.e., these routing functions take into account the input channel and the destination node to compute the output channel of a packet in the network. In IBA, forwarding is defined on anN £N domain [Dua93] because the forwarding tables consider only the current 46 and destination nodes of a packet to determine its output port. In a C £N based rout- ing algorithm, if the incoming port is not considered while making the routing decision, routing rules cannot be enforced, and, thus, deadlock-freedom cannot be guaranteed. However, previous work reported in [LFD01] has shown thatC £N based routing algo- rithms can be implemented on IBA by use of destination renaming. The basic idea is given below. Given anyC £N !C routing table, when there is some switch that supplies two different paths for the packets arriving at different input channels but destined for the same node/host, the destination of one of them is modified by selecting an unused address within the valid range of addresses assigned to that destination node/host. As the destination addresses of these packets are now different from the point of view of the switches within the subnet, they can be routed along the different paths without consid- ering the input channel. This technique undoubtedly will have an impact on the size of the forwarding tables (and, consequently, the time for table lookup), and the maximum number of renamings that can be done is limited by the maximum available number of addresses for each host. For the implementation of the Double Scheme that we are proposing, each host must have at least one LID address reserved for the new routing function during the time reconfiguration of the network is in progress. From this point onwards, we shall refer to all the addresses associated with a host that are being used by the old or the new routing function as including the addresses required for renaming. Note that this reservation of addresses for renaming is necessary only for those cases where the routing function to be implemented is defined on theC £N domain. At the start of the reconfiguration process, consider that the master SM, residing on CA-E, establishes the need for reconfiguration of the subnet. The SM reads in the SL-to- VL mapping table from each CA and switch port, and modifies it such that SL1 through 47 SL8—which are the SLs being used by current routing function—map to only two of the four available data VLs at each channel, i.e., VL0 and VL1. Once the modification has been done, the tables are written back to their respective ports. Concurrently, the SM starts updating the forwarding tables at the switches. In the case of RFTs, the SM reads in the LID/Port blocks that are not being used by the current routing function, modifies them to include entries corresponding to the new routing function, validates these entries, and then writes back the modified block to respective tables. In the case of LFTs, each 64-entry port block may contain addresses corresponding to both the current and the new routing functions, and thus has to be modified. A port block is unavailable at most during the time that the SM writes back the modified block to the forwarding table. Once the SL-to-VL mapping table at each CA and switch port has been modified and forwarding table updated, the SM atomically updates the PathRecord and GUID- to-LID mapping information corresponding to each CA in the subnet. This information resides with the SM, however the CAs may have cached the information depending on the particular implementation. In that case, the update has to be done at each CA port in the network. Note that the PathRecord and GUID-to-LID information corresponding to different CA ports may be modified in any order. At this point, the CAs start injecting packets which are routed using the new routing function. Old packets may still be routing in the network using the old routing function (i.e., old LIDs). Since the SL- to-VL mapping at each port has already been updated, an old packet can remain in the VL dedicated for new packets for at most one hop. Also, once the PathRecord and GUID-to-LID mapping corresponding to all CA ports in the subnet have been updated, no more packets with old DLIDs will be injected into the network. 48 As the final step in reconfiguration process, the SM must allow packets using the new routing function to use all the four data VLs. However, before new packets can be allowed to use VL0 and VL1, the SM must ensure complete drainage of these VLs across the subnet. The SM uses the drainage algorithm described in Section 3.5 to systemat- ically gather VL drainage information from each port. The SM begins by computing CDG for the old routing function (i.e., XY-routing) as shown in Figure 3.5 C12 C21 C67 C76 C74 C47 C03 C36 C63 C58 C85 C25 C52 C14 C41 C34 C43 C45 C54 C30 C01 C10 CA0 C0A CC3 C3C CE6 C6E C78 C87 C8F CF8 C5D CD5 C2B CB2 E C A B D F Leaf channel Source channel Non-leaf/non-source channel 0 3 6 1 4 7 2 5 8 Figure 3.5: CDG for the XY-routing function Next, the SM sends General Management packets (GMPs) 3 to all switches that are connected to one or more CAs in the subnet. Each of these GMPs executes a Vendor- Get() method to read occupancy information about VLs being used by the data packets (VL0 and VL1). IBA specification does not define a management attribute that directly provides this information. However, VL occupancy information is required for the cal- culation of credit limit specified in the spec. Therefore, it is safe to assume that vendors will provide a mechanism by which the SM can retrieve the status of different virtual 3 IBA specifications call the vendor-specific management packets as ”General Management Packets”, as the term ”Subnet Management Packet” is restricted to the subnet management class, as defined in the specifications. Similarly, management methods such as Get() and Set() for vendor-specific attributes are called VendorGet() and VendorSet(), as opposed to SubnGet() and SubnSet(). 49 lanes at each port. A vendor-specific GMP must also be defined to retrieve this informa- tion. We defined a management attribute called VLInfo, which provides the total number of packets in each VL on all the ports of a switch. The SMA at each switch responds to the SM with a response packet (using VendorGetResp() method) with VL occupancy information from all ports. Upon receiving drainage confirmation from a source node, the SM sends the VendorGet() GMPs to the next node(s) in the CDG. For example, once the SM receives VL drainage confirmation from switch 0, it sends GMP to switch 1 to check drainage of VL0 and VL1 at its ports. A path is considered drained when the SM receives VendorGetResp() GMPs from the switch connected to the corresponding leaf nodes (channels) in the CDG. Finally, the SM modifies SL-to-VL mapping along each path that has been drained of old packets. The reconfiguration process is completed once SL-to-VL mapping across the entire subnet has been changed such that packets using the new routing function are allowed to use all available virtual lanes. 3.7 Performance Evaluation In this section, we evaluate the cost and performance of dynamic reconfiguration using the Double Scheme over an InfiniBand subnet. The performance advantage of Double Scheme, i.e. lower latency and higher throughput for application traffic during and after reconfiguration, has been reported in [PPD00] as perviously explained. Therefore, in this work we only investigate the IBA-specific cost of implementing the Double Scheme. “Cost” here refers to the time taken and/or the number of SMPs exchanged to complete the reconfiguration process. It is important to note that InfiniBand SMPs use a dedicated management VL (VL15) and packets in the management virtual lane get priority over those in data VLs. As a result, SMPs do not encounter significant queueing delays and, 50 therefore, the number of SMPs exchanged for a particular subnet-wide operation is a measure of total time spent for that operation. 3.7.1 Simulation Platform The platform used for simulating the Double Scheme consists of a physical and link layer IBA model developed using OPNET Modeler TM 9.1.A [opn]. We have modelled 1x links, 4-port switches and end nodes with one channel adapter per end node. Each port has two data VLs and one management VL. The switches support both random and linear forwarding tables, however, for reason explained earlier in this paper, we only considered RFTs in this study. In case of Double Scheme reconfiguration, SMPs were exchanged using directed routing even if LID routing is possible. This issue is discussed later in this section. All network topologies were randomly generated and use restricted up*/down* routing [BCQ + 03]. Except for the data shown in Table 3.2, all other simu- lation results correspond to the IBA network shown in Figure 3.6. The network consists of eight switches and seven end nodes connected in a randomly generated topology. Each source node generates uniform random traffic at a rate of 145,000 data packets/sec which translates to roughly 25% of saturation load rate as each packet uses a payload of 256 bytes. 3.7.2 Performance Comparison with Static Scheme Figure 3.7 shows the IBA subnet of Figure 3.6 undergoing reconfiguration triggered at simulation time 61 sec. Both static and Double Scheme dynamic reconfiguration are 51 Figure 3.6: An IBA subnet consisting of eight switch and seven end nodes. shown for comparison purposes. In case of static reconfiguration (Figure 3.7(a)), all net- work ports are brought to INITIALIZE state 4 before the forwarding tables are updated. This results in dropping of approximately 15,000 data packets at the load rate mentioned above. Furthermore, no data packets are injected into or delivered by the network during this period. This drop in the network throughput is highlighted in Figure 3.7(a). Total time taken for static reconfiguration is approximately 65 milliseconds and the total cost, in terms of SMPs, is 388 SMPs. It is important to note that as the network size or the applied load rate increases, the number of data packets being dropped by the static scheme also increase. 4 According to IBA specifications, a port in INITIALIZE state accepts management packets but not data packets. 52 In case of Double Scheme dynamic reconfiguration (Figure 3.7(b)), the total recon- figuration time is 87.67 milliseconds and the total number of management packets exchanged between the network entities is 716 packets 5 . The important differences between the static and Double Scheme dynamic reconfiguration results are that - for Double Scheme - no data packets are dropped, and the latency and throughput of the network remains unaffected throughout the reconfiguration process. Packets continue to be injected and routed during the entire reconfiguration period. It must be noted that the overhead of additional management packets in case of Double Scheme is also negligible 6 . 3.7.3 Reconfiguration Cost Next, we look at the composition of the total reconfiguration time for dynamic reconfig- uration. This analysis will provide us some insight into which reconfiguration steps are most costly in terms of time and/or management packets exchanged. For an IBA subnet, total reconfiguration time using the Double Scheme (T total ) can be expressed as: T total =T FT +T Drain +T SL2VL (3.1) where T FT is the cost (time) of FT computation and distribution, T SL2VL is the cost of changing SL-to-VL mapping (in steps 3 and 7), and T Drain is the cost of collecting 5 This includes SMPs and vendor-specific GMPs used for drainage. In this section, we will use the term SMPs for both subnet and general management packets. 6 At application data injection rate of 145,000 packets/node/sec that we used in this experiment, the increase in the number of management packets from 388 for static to 716 for Double Scheme corresponds to less than 0.04% of the total network load. 53 (a) Static Reconfiguration (b) Dynamic Reconfiguration Figure 3.7: Simulation results for (a) static and (b) Double Scheme dynamic reconfigu- ration for the IBA subnet shown in Figure 3.6. Reconfiguration starts at time 61 sec. 54 channel drainage information. Table 3.1 gives a break down of these costs for the 8- switch subnet for load rates below saturation. Reconfiguration Step Cost (SMPs) Cost (msec) %age of total time FT Computation 28.67 32.70% FT Distribution 32 5.04 5.74% Changing SL to VL mapping 668 51.75 59.03% Channel Drainage 16 2.21 2.52% Total Reconfiguration 716 87.67 100% Table 3.1: Cost of various reconfiguration steps. 3.7.4 FT Computation and Distribution (T FT ): In essence, this is the base cost of reconfiguring the network. For static reconfiguration, in addition to this base cost, the cost for deactivating all ports (to discard all data packets) and later re-activating them will be incurred. As explained in Section 3.2, if RFTs are used the Double Scheme does not require the port state to be changed. Since this base cost remains unchanged with the reconfiguration scheme, it is not the focus of our study. 3.7.5 Changing SL-to-VL Mapping (T SL2VL ): Unlike FT computation and distribution cost, this component of the total cost is specific to Double Scheme dynamic reconfiguration and should, therefore, be minimized in order for the scheme to be attractive. Interestingly, the first change of SL-to-VL can happen in parallel with FT computation and distribution. Therefore, the cost of this change, in terms of time, is almost completely hidden for medium to large networks. As a result, the network only sees the latency of changing SL-to-VL mapping once, i.e., in the last step of reconfiguration. Table 3.2 clearly shows this trend. For network consisting of more than 16 switches the FT computation time becomes the dominant factor. 55 Number of switches T FT (sec) T SL2VL (sec) T SL2VL as %age of T total 8 0.0337 0.0517 59.03 12 0.1144 0.1894 62.36 16 0.2143 0.1046 32.80 32 1.1009 0.1998 15.36 48 8.6338 0.3115 3.48 64 18.2012 0.4179 2.24 96 19.3370 0.5777 2.90 Table 3.2: Scaling of T FT and T SL2VL with network size Time spent in restoring the SL-to-VL mapping (step 7) can be significantly reduced by using LID routed SMPs instead of directed routed ones. This, however, may not be possible in step 3 as the old routing function (which is active in step 3) may not be connected due to the change in the network. 3.7.6 Detecting Channel Drainage (T Drain ): Compared to the static scheme, the cost of collecting drainage information too is an overhead and should therefore, be minimized. As shown in Figure 3.8, total time spent in ensuring that the network is completely drained of old packets is a strong function of network load. Figure 3.8 shows that drainage cost rapidly increases as the network reaches satura- tion. However, we argue that even with its high cost, in terms of reconfiguration time and network bandwidth, our scheme is a favorable solution to the difficult problem of reconfiguration because of the following: 1. Static reconfiguration will result in a large number of packets being lost, thus negatively effecting application performance. 2. The probability of deadlocks increases as the network tends to saturate [War99]. Therefore, a deadlock-susceptible dynamic reconfiguration scheme is more likely 56 0 200 400 600 800 1000 1200 1400 1600 1800 90 110 130 150 170 190 Load Rate (packets per node per sec) # drainage pkts Figure 3.8: The number of drainage GMPs versus the load rate. to cause reconfiguration-induced deadlocks at higher load rates, thus degrading the overall network performance. Furthermore, even though the number of drainage GMPs at higher load rate seems significant (over 1600, for an 8-switch network), total bandwidth consumed by these GMPs is still only 0.12% of the application packets injected into the network each sec- ond. 3.8 Summary In this chapter, a systematic method for applying the Double Scheme to IBA networks was presented. Three key challenges for implementing the Double Scheme over IBA networks were identified. A number of IBA features and mechanisms that address these challenges and how they should be used are also described. It was shown that spatial and/or temporal separation of resources—which is the basic idea behind the Double 57 Scheme—can be accomplished in an IBA subnet by distinguishing sets of service lev- els and destination local identifiers used to route packets in the network. Drainage of resources can be accomplished under the direction of subnet management using various methods and attributes. An algorithm was proposed that uses mechanisms allowed by IBA specifications to accomplish selective resource drainage. It was also shown that dynamic update of forwarding tables and destination names is also supported by IBA in a manner consistent with that needed for the Double Scheme. Finally simulations results presented in this paper show that the cost of implementing the Double Scheme on IBA, in terms of reconfiguration time and additional management packets, justify the reliability and performance benefits achieved. As a result, this work enables InfiniBand networks to better support applications requiring certain QoS guarantees that would not well tolerate intermittent performance drops-offs as would be the case without deadlock- free dynamic reconfigurabilty. Interesting future work could focus on optimizing vari- ous reconfiguration steps, such as developing more efficient drainage schemes based on exponential back-off rather than periodic polling. 58 Chapter 4 On-Chip Interconnection Networks The opportunities afforded by Moore’s Law - doubling transistor density with each process generation - are increasingly challenged by three problems: on-die power dissipation, wire delays and design complexity [BG04]. The tiled architecture approach [DT01][BM02] to designing chip multiprocessors (CMPs) proposes to allevi- ate these problems by dividing each die into a modest number of low-power tiles which are connected via a packet-switched fabric. Each tile can be a CPU core tile, a cache tile, a specialized engine tile or some combination of the three. This modular approach enables rapid scaling in the number of core implemented on a die with 10s of cores emerging [ABB + 07] and 100s of cores expected within a decade [Bor07]. These tiles are laid out in horizontal and vertical dimension by abutment using some interconnect topology, as shown in the example in Figure 4.1. This on-chip interconnection network thus becomes the primary “meeting ground” for various on-chip components such as cores, memory hierarchy, specialized engines, etc. in CMPs. When the core-count grows to tens of cores on a die, the on-chip network becomes a potential bottleneck from both performance and power standpoints. And, the design parameter that affects the power and performance of the network most profoundly is its physical topology. Selecting the topology is often the first step in designing a net- work because the rest of design parameters (routing algorithm, flow control method, etc.) depend on the topology. Selecting a topology requires balancing the requirements 59 R Core + Cache I/O Interface I/O Interface Inter-die Interface R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache R Core + Cache Memory Controller Memory Controller Memory Controller Memory Controller Inter-die Interface Figure 4.1: An illustration of a 16-core tiled chip multiprocessor with routers (R) at each tile. (minimum bandwidth, maximum acceptable latency, etc.) with the available resources (power budget, wires or pins, etc.). Topologies for off-chip interconnects have been extensively studied and are dis- cussed in detail in interconnection network textbooks [DT04][DYN02][PD06]. But, in case of on-chip networks, most of the research and almost all of the on-chip networks implemented in many-core CMP chips, use on a very narrow set of topologies. As dis- cussed in Section 1.2 and shown in Table 1.1, fully-connected crossbar, bi-directional ring, and 2D mesh have been the topologies of choice so far. In this chapter, we argue that there are essential differences (Section 4.1) between on-chip and the well-studied off-chip networks. These differences necessitate the intro- duction of a new set of metrics (Section 4.2) related to wiring and embedding, which affects delay, power and overall design complexity of on-chip networks. We apply these 60 metrics to analyze a wide range of well-known topologies (Section 4.3) and show inter- esting tradeoffs and non-intuitive results relating to the choice of topologies. Further- more, using the example of hierarchical ring topology, we demonstrate how key dif- ferences between on- and off-chip networks mean that innovative changes to standard topologies can be made to improve the properties of on-chip topologies. Research presented in this chapter is derived from one of the first studies conducted at Intel Corporation to systematically look at how on-chip networks differ from – qual- itatively as well as quantitatively – off-chip networks, and prune the design space of topologies using a methodology based on an expanded set of metrics that is application to interconnects for many-core CMPs [JZH07]. 4.1 OCIN: What is different? Comparing various interconnection networks has long been understood to be a difficult problem. Liszka [LAS97] has argued that “scientifically determining the best network is as difficult as saying with certainty that one animal (an alligator) is better than another (an armadillo).” Therefore, in comparing networks, the clear understanding of the oper- ating environment and an appropriate selection of metrics is extremely important. This section highlights important differences between on-chip networks for CMPs and off- chip networks for multiprocessor systems so as to motivate a different set of metrics for comparing topologies for on-chip implementations. 4.1.1 Wiring In off-chip networks, wires are intra-board (using printed circuit boards), inter-board (through backplanes), and across cabinets (through cables). Links are not wide primarily 61 because of pin-out limitations (Table 4.1). There is flexibility in their layout arising from physical reality (almost 3D space: many layer boards, multiple boards, relative freedom of wiring with backplanes and cabinets) and wire length is not a first order concern. In on-chip interconnects, on the other hand: ² Wiring is constrained by metal layers laid out in horizontal and vertical directions only. Typically, only the two upper layers are available for global interconnects (in the future, this could increase to 4, which nonetheless still imposes limitations). ² Wire lengths need to be short and fixed (spanning only a few tiles) to keep the power dissipation and the resistive-capacitive (RC) product low. This allows wires to get the same benefits as logic from technology scaling [HMH01]. ² Wires should not “consume space.” That is, they should not render the die area below them unusable since silicon continues to be at a premium. This implies that only a portion of the tile (over the last level cache, for example) is available for OCIN wiring. Dense logic (CPU, first level cache, etc.) provide less oppor- tunity for routing global interconnect wires and, especially, for repeater insertion. Hence, wiring density or “wires per tile edge” becomes an important measure. As will be seen, wiring density can preclude higher dimensional topologies as OCINs, even though they have nice graph-theoretic properties like high bisection bandwidth and low diameter. ² There is a notion of directionality: topologies that distribute wire bandwidth per node more evenly across all four edges of each tile are more attractive, since they avoid unbalanced wiring congestion in one dimension. For example, with a ring topology, wires tend to get used primarily in one dimension only. 62 Off-die Multiprocessors Link (bits) On-die Multicores Link (bits) Intel ASCI Red Paragon 16 MIT RAW (Scalar, Dynamic) 256, 32 IBM ASCI White (SP Power3) 9 STI Cell BE 144 Cray XT3 (SeaStar) 12 TRIPs (Operand, non-Operand) 110, 128 IBM Blue GeneL 1 Sun UltraSparc T1 128 Table 4.1: Comparison of link widths in off- and on-chip interconnects. 4.1.2 Embedding in 2D Every topology needs to be laid out in 2D on die (we do not consider “3D stacking” in this section, but revisit it later in this chapter). This means that with higher (greater than 2D) dimensional networks, topological adjacency does not lead to spatial adjacency. This has significant implications both on the wire delay and the wiring density. Consider 2D embedding of a 3D mesh, as shown in Figure 4.2. For the longest path (from 0,0 (brown) to 3,3(green)) the topological distance is 9 hops but 3 of these hops span half the length of the chip! Therefore, the distance in tile-span units is 18. Furthermore, long wires could affect the frequency of operation and will impact the link power. Finally, some of the tiles towards the center of the topological graph have up to four bidirectional links crossing at least one of their edges, while tiles around the edges have one. Large number of links crossing a tile edge may force the link width to be less than that required by the architecture. 4.1.3 Latencies are of a different order Consider the ratio of the latency of a link (setup, transmitter drive, actual transfer on wires, receiver and handoff) to that of a router (including buffering, switching and flow control) for a single flit. In classical multiprocessor systems, it is of the order of 2-4 for single board systems and increases with the introduction of the backplane and cabinets. 63 0,1 0,2 0,3 1,1 1,2 1,3 2,1 2,2 2,3 3,3 3,2 3,1 0,0 1,0 2,0 3,0 0,1 0,2 0,3 1,1 1,2 1,3 2,1 2,2 2,3 3,3 3,2 3,1 0,0 1,0 2,0 3,0 0,1 0,2 0,3 1,1 1,2 1,3 2,1 2,2 2,3 3,3 3,2 3,1 0,0 1,0 2,0 3,0 0,1 0,2 0,3 1,1 1,2 1,3 2,1 2,2 2,3 3,3 3,2 3,1 0,0 1,0 2,0 3,0 (a) 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,1 0,2 0,3 1,1 1,2 1,3 2,1 2,2 2,3 3,3 3,2 3,1 0,0 1,0 2,0 3,0 (b) Figure 4.2: A (a) 3D mesh network (b) embedded on 2D die. In OCINs, this ratio is 1 at best (in unbuffered networks) and is 0.4 to 0.2 or lower in implemented designs. In addition, in a tiled architecture with tens or hundreds of cores, the latency through the interconnect (roughly, (number of hops)£ (router delay + wire delay)) will be a significant portion of the total protocol message latency. For these two reasons, it is critically important to minimize the interconnect latency through proper router design under both no-load and loaded scenarios. 4.1.4 Power Power is a major factor to consider in OCINs since the power consumption of the inter- connect relative to the rest of the components on-die is a much higher than in classical multiprocessor systems. We expect that at least 10-15% of the die power budget will be devoted to the interconnect. OCINs for research prototypes, where power optimization is not a first-order objective, have shown even higher percentages. For example, in the MIT Raw processor, interconnection network consumes almost 35% of the total chip 64 power [KTMW03]. Note that with OCINs, interconnect power grows with the number of tiles. Studies have shown that for many common network design, the overall intercon- nect power is roughly evenly distributed between links, crossbars, and buffers [Han03], whereas for off-chip networks link power is the overwhelming component. Combined with the fact that OCIN power dissipation happens within the die, requiring efficient and costly heat removal, this calls for new techniques to deal with OCIN power. 4.2 Metrics While comparing different network topologies it is often difficult to find suitable met- rics or weighted set of metrics that a designer must use [CK94][LAS97]. In this chapter, in addition to traditional metrics used to compare topologies, we present a new set of simple metrics that can be used to compare topologies for on-die networks. The require- ments for this set of measures includes the ability to meaningfully compare topologies without requiring extensive calculations or simulations. Furthermore, it was required that the conclusions based on these measures be relatively robust and not change sig- nificantly the choice of topology, for example, with changes to the underlying micro- architecture. A few assumptions are made with regards to the wiring and tile size that are important to note. ² To make the analysis fair in terms of the total bandwidth available at a router, the total number of wires entering or leaving a router is held constant across all topologies, as shown in Figure 4.3. ² With regards to tile size, a 4-6 sq-mm core in the 45nm process is assumed (other researchers have made similar assumptions [KZT05]). This yields 40-60 tiles for 65 W Unidirectional Ring W (a) Bidirectional Ring W 2 W 2 W 2 W 2 (b) 2D Mesh 2D Torus W 4 W 4 W 4 W 4 W 4 W 4 W 4 W 4 (c) Figure 4.3: Illustrating equivalent wiring budget across (a) unidirectional ring, (b) bidi- rectional ring, and (c) 2D Mesh/Torus topologies a die size of 256 sq-mm. Using this data point, we calculate that a tile edge will be approximately 90K¸ to140K¸ in size, where ¸ is a technology independent parameter equal to half the nominal gate length. The number of tiles will approx- imately double in successive process generations. We assume that tiles are square and all tiles are of the same size. ² Finally, it is assumed that the interconnect operates at a single frequency (but not necessarily synchronously). 4.2.1 Wiring Density Definition: Wiring density is the maximum number of tile-to-tile wires routable across a tile edge. In some contexts, the equivalent measure of the number of links or channels crossing a tile edge will be used. Wiring density is a quantitative measure of the wiring complexity of a topology. It relies on the estimate of the pitch (width + spacing) for each wire in a particular metal layer. The upper metal layers are typically used for OCIN. Since the global intercon- nect is usually not routed over dense logic (e.g., core logic and first level cache), the routable length per edge is a fraction of the total tile edge. This fraction is assumed to be 40-60%. 40-60% usability implies dense logic occupies 36-16% of the tile 66 area. To make allowances for via density, power and clock grids, the effective pitch is increased by another fraction (60%). Consequently, the wiring density is calculated for wire pitches ranging from 16¸ to 32¸ based on pitch estimates made in prior publi- cations [HMH01][CLKI06][Itr]. Routable Area Per Tile Edge 16¸ 18¸ 24¸ 28¸ 32¸ 40% 1125 900 750 643 563 50% 1406 1125 938 804 703 60% 1688 1350 1125 964 844 Table 4.2: Maximum number of wires per tile edge at different wire pitch for tile size of 90K¸£ 90K¸. In Table 4.2 we consider the tile edge of 90K¸ and in Table 4.3 we consider the tile edge of 140K¸ with routable area per edge of the tile being 40-60% of the total edge length. Tables 4.2 and 4.3 show that, in the worst case, it will be possible to route over 550 wires per tile edge. Thus, for 16B bidirectional links (256 bits), a topology requiring more than 2 channels/tile edge will be problematic (assuming 1 wire/bit). Thus, con- trary to the usual belief, we see that on-chip wire bandwidths could be at a premium. Our analysis assumes relatively small percentage areas for the dense logic. A higher per- centage implies an exacerbation of the available wires for the global interconnect. On the other hand, other factors such as changing the aspect ratio of the tile or restricting underlying logic to lower metal layers to enable routing over the logic hierarchy could improve the routability. Routable Area Per Tile Edge 16¸ 18¸ 24¸ 28¸ 32¸ 40% 1750 1400 1167 1000 875 50% 2188 1750 1458 1250 1064 60% 2625 2100 1750 1500 1313 Table 4.3: Maximum number of wires per tile edge at different wire pitch for tile size of 140K¸£ 140K¸. 67 4.2.2 Wire Segments: Measure of Delay, Power The wire segments (or wire lengths) needed to implement a topology are classified as short: spanning 1 tile, medium: spanning a few tiles (typically 2 or 3), long: spanning a fraction of the die (typically 1/4 or more of the die). The average length of wire segment is given by the relation: avg lenof wireseg = totallenof wiresegmentneededby thetopology (noof routers£noof wiresentering arouter) (4.1) The longest segment determines the wire delay and, hence, is the determinant of the frequency of operation. The total length of wire segments (in tile-span units) yields the metal layer requirements of the topology. The average length of a wire segment (in tile- span units) is an indicator of the link power of the interconnect under the assumption of equivalent wiring budget, as discussed earlier in this Section. 4.2.3 Router Complexity The router implementation complexity impacts the cost of design and validation, as well as the area and power consumption. It is primarily determined by whether the network is buffered or not, the number of input and output ports, the number of virtual channels necessary or desired, the crossbar complexity and the routing algorithm. Some of these factors are not easily quantifiable. Additionally, a design may be able to absorb the complexity in favor of desirable properties offered (such as routing path diversity, etc.). Since the router complexity is largely implementation-dependent, an important abstract measure to capture this complexity is the degree of the crossbar switchk. 68 The crossbar switch is the centerpiece of the router in terms of area and functionality. The cost of the crossbar can be measured in terms of: 1. The arbitration control logic, which increases quadratically with degreek because the number of arbiters and the complexity of each arbiter both increase linearly withk. 2. The area of the crossbar, which also increases quadratically with k because both the X and Y dimensions of the crossbar grow linearly. 3. The latency through the crossbar, which grows linearly with k since the distance and wire length through the crossbar from an input port to an output port grows linearly with the degree. 4. The power consumption, which grows quadratically withk because the total wire capacitance within the crossbar grows quadratically withk. Thus, the cost or complexity of the crossbar, and by extension that of the router, is closely tied to its degree. 4.2.4 Power In OCINs, the chief sources of power consumption are: (1) the switching activity on the links, (2) the buffers in the router and (3) the crossbar switch. The average wire segment length is an indicator of the link power (Section 4.2.2). Figure 4.4 compares link power for various topologies assuming the same switching activity for all topologies. The crossbar degree is an indicator of the switch power. We normalize the switch power to be 1.0 for a 3-port switch (one that would be used in a bidirectional ring). Thus a 5-port router (used in a 2D mesh or 2D torus) would have a switch power of (5/3)2 69 =2.8. Another factor to note is that most of this power dissipation occurs due to switch- ing activity in the crossbar and links, not device leakage. Thus, the effect of topology on activity can make a big difference on the total power. Characterizing buffer power is more difficult. For a buffered interconnect, the number of buffers and corresponding power consumption increases linearly with the number of input ports or degree of the router. The number of buffers and how they are managed is extremely design-dependent. Finally, the relative distribution of power among the three sources needs to be taken into account and the proper weighted average computed. Assuming ring as the baseline, the relative trend in power growth is calculated for various topologies based on the above observations. Higher-dimension networks suffer from severe growth in either crossbar or link power. While clever design optimizations could reduce the power for any specific implementation, the chart shows that more effort is required as we increase the degree. On the other hand, if we compare energy consumption, which is the work done to move packets through the network, the higher average hop count of a network will adversely impact its energy consumption. This factor counters the higher power dissipation for higher dimension networks. 0 2 4 6 8 Ring 2D Mesh 2D Torus 3D Mesh 3D Torus Fat Tree Crossbar Links 45.5 6.5 4.8 3.1 2.3 1.0 Figure 4.4: A comparison of link and crossbar power. 70 4.2.5 Performance Measures Performance metrics most commonly used in comparing network topologies are aver- age and maximum number of hops and bisection bandwidth. Average and maximum number of hops serve as measures of network latency, even though they are only a par- tial indication of the actual message latency. The total message latency is the latency through the router times the number of hops. The latency through the router includes the link (wire) latency and it varies with the load: typical conditions are no-load (espe- cially with bypassing of the router pipeline), light load and high load. A bisection of the interconnect is one which partitions the network into equal or nearly equal halves. The bisection bandwidth is the minimum bandwidth, measured as the number of channels (or links), over all possible bisections of the network. Under the constant switch bisec- tion that is used in this paper, simpler topologies such as un-buffered rings could have wider links than those that use high degree crossbars. 4.3 Comparing Topologies In this section, we compare several well-known topologies for OCINs, using the met- rics described in Section 4.2. For completeness, some critical characteristics of each topology are briefly discussed and issues with their on-chip implementation are also pre- sented. With buffered networks, it is assumed that each topology has the same virtual channel resources so that they can be used either for deadlock-freedom or for perfor- mance enhancements (to prevent head-of-line blocking). 71 4.3.1 Bi-directional Ring Bi-directional ring, which is a 1D torus, is the simplest topology considered in this paper. This simplicity comes at the cost of high average hop count (N/4, where N is the number of nodes) and low bisection bandwidth that remains constant at four uni-directional channels. In case of un-buffered rings routing at each node can be accomplished in a cycle. Since the ring channels are wider (2£ that of 2D mesh/torus) and the per-router delay is low the ring is a worthwhile candidate for small N. From the crossbar and wiring complexity issues, the ring has a clear advantage as shown in Table 4. The main drawbacks of the ring are its behavior as N increases: a) High average hop count. With un-buffered rings this leads to throttling at the source and high latencies, which could further be exacerbated if packets that are unable to sink are misrouted. B) Absence of routing path diversity has the potential to degrade performance under loaded conditions. C) The topology is inherently not fault tolerant to route around router or link failures. 4.3.2 2D Mesh 2D mesh is an obvious choice for tiled architectures, since it matches very closely with the physical layout of the die. It is a well-understood topology with relatively simple implementation and wiring density of 1 channel per tile edge. A low wiring density means that no stringent constraints on channel width are placed. Router for 2D mesh requires a careful design since high frequency designs are usu- ally pipelined. Pipeline bypassing at low loads is necessary to achieve low delay through a router [Han03]. The crossbar power is relatively high compared to the ring - split cross- bars (quadratic reduction in crossbar power traded for a linear increase in link power) could be a way to address this problem [NPK + 06]. 72 One of the main drawbacks of 2D mesh is the non-uniform topology view from node standpoint. That is, less bandwidth is available to nodes at corners and edges (i.e., fewer wiring channels enter/leave node) while these nodes have a higher average distance from other nodes. 4.3.3 2D Torus Adding wrap-around links to a mesh creates a torus topology which decreases the aver- age and maximum hop counts and doubling the bisection bandwidth. The wrap-around links, however, also double the number of wiring channels per tile edge to 2. The disad- vantage of long wires which span the length of the die is overcome by the technique of ”folding” which yields a maximum wires length spanning only 2 tiles. 4.3.4 3D Mesh Unlike 2D mesh, the 3D mesh and other higher dimensional topologies do not satisfy the adjacency property. Although, 3D mesh has lower average and maximum hop counts, the topology requires long wires (spanning half the die) and the wiring density is much higher near the center than along the edges, i.e., 4 channels per tile edge as shown in Figure 4.2. Finally, the crossbar degree of 7 has implications on router complexity, no-load latency and power. 4.3.5 3D Torus and Higher-Dimensionalk-aryn-cube Topologies 3D torus shares some of the advantages of the 2D torus such as symmetry in traffic distribution and improved distance measures. It has disadvantages similar to those for 3D mesh, such as crossbar degree of seven and long wires. The wiring density for 3D 73 torus is even worse than 3D mesh with 5 channels per tile edge. The wiring density and crossbar degree only gets worse in higher dimensional meshes and tori (k-ary n-cube withn > 3). On these accounts, these topologies have been weeded out, except in case of CMPs with very large number of cores where the average distance of 2D topologies becomes prohibitive. However, there is one variant of the hypercube (2-ary n-cube) - cube connected cycles - which merit further discussion (Section 4.3.7). 4.3.6 Direct Fat Tree Fat tree topology has a unique characteristic that allows all non-conflicting end-to-end connections to have full wire bandwidth. That is, any node can communicate with any other node in the network while having 100% per channel bandwidth available for this communication (as long as no other nodes are communicating with the two nodes involved). Interestingly, implementation of a direct fat tree network does not require long wires. For example, the maximum wire length is two tilespan. This is possible because tiles can be connected so that the routing resources can be shared. Figure 4.5 shows a 64-node direct fat tree network. Of the 16 routers in this arrange- ment, four have two 4£ ports, eight have two 4£ and one 8£ port, while four have two 4£ and two 8£ ports to them (all of this, in addition to the ports connecting the four adjacent nodes). Figure 4.6(c) shows one of the eight routers with one 8£ port. It is obvious that the implementation cost of the crossbar in this case is prohibitive. Further, as the network size grows so does the complexity (degree) of the router. Hence, this topology is also weeded out. 74 32 16 16 16 16 0 1 2 3 4 5 7 6 8 8 8 9 10 11 12 13 15 14 0 1 2 3 4 5 7 6 8 8 8 9 10 11 12 13 15 14 0 1 2 3 4 5 7 6 8 8 8 9 10 11 12 13 15 14 0 1 2 3 4 5 7 6 8 8 8 9 10 11 12 13 15 14 Figure 4.5: A 64-node Fat Tree network. 4.3.7 Cube Connected Cycles The Cube Connected Cycles (CCC) topology [PV81] is derived from the hypercube with each node in an n-dimensional hypercube being replaced by a ring of size n. Hence, an n-dimensional CCC has (n£2n) nodes. CCC exhibits some favorable qualities: good bisection bandwidth, acceptable aver- age and maximum latency and moderate crossbar degree. The cons for this topology include unevenly distributed channel bandwidth and the need for some long wires span- ning almost half of the chip length. However, the biggest drawback of this topology is that, unlike the other networks discussed so far, adding nodes to the network requires the rewiring and layout of the entire chip, since a node has to be added to each cycle 75 0 1 2 3 4 5 6 R 7 8 9 10 11 12 13 14 R 15 0 1 2 3 4 5 6 R 7 8 9 10 11 12 13 14 R 15 0 1 2 3 4 5 6 R 7 8 9 11 12 13 14 R 15 0 1 2 3 4 5 6 R 7 8 9 10 11 12 13 14 R 15 10 4-7 4-7 8-B X 0 1 X 2 3 Figure 4.6: Layout of a 64-node Fat Tree network and one of the eight routers with one 8£ port shown. in the hypercube. Finally, the ratio of nodes in an (n+1) dimensional CCC to an n- dimensional CCC is 2(1+1/n) – an inconvenient and high ratio for network growth (e..g, n = 4 yields 64 nodes; n= 5 yields 160 nodes), necessitating incomplete networks for many configurations of interest. 4.3.8 Hierarchical Ring Topology Topologies compared in Sections 4.3.1 through 4.3.7 are quite typical in off-chip inter- connects and some of them have been commonly proposed for on-die implementation as well. There are, however, variations of these basic topologies that can provide good cost-performance tradeoffs [BD06]. One such topology is the hierarchical ring topol- ogy. In this section, the hierarchical ring topology is discussed to illustrate how such non-traditional topologies can offer interesting alternatives for on-chip implementation. Arranging rings in a hierarchical manner has the potential to overcome some of the drawbacks of simple bidirectional rings discussed in Section 4.3.1 while retaining the key advantages of that topology. Hierarchical ring (H-Ring) topologies can be arranged 76 in several different ways, based primarily on the number of levels in the hierarchy (usu- ally two, three or four for network sizes up to 256 nodes). In this work, we introduce two new ideas that can be used to improve the bisection bandwidth and reduce average latency of the hierarchical ring topology. We also introduce a 3-tuple representation to uniquely describe various H-Ring configurations. A typical H-Ring network, as proposed in [RS97][HJ01], consists of three or more local rings, where each local ring connectsn endnodes. These local rings are connected together in a hierarchical arrangement, culminating at a single global ring. Previous studies have found that in off-die multiprocessor systems organization of the memory hierarchy as well as memory access pattern impacts the optimal hierarchy [HJ01]. In this work we do not attempt to find the optimal arrangement but focus on the cost and performance implications of embedding H-Ring topologies in 2D. 0 1 3 2 4 5 7 6 9 8 11 12 13 15 10 14 (a) 0 1 2 3 7 4 5 6 10 11 8 9 13 14 15 12 (b) Figure 4.7: (a) 16-node H-Ring with one global ring and (b) its coresponding layout. Figure 4.7(a) shows a two-level H-Ring with four nodes in each of the four local rings. Figure 4.7(b) shows how this 16-node H-Ring can be embedded on a 2D die. Note that maximum length of a wire segment is one tile. This layout can be extended 77 for 32, 64 and higher network sizes. Wire segments longer than one tile-span are not needed so long as the topology is organized as four adjacent local rings. To improve performance and fault tolerance, we propose two new design parameters to the basic H-Ring topology. The first parameter is the number of taps that a non-local ring has on the ring at the next lower level. For example, Figure 4.7(a) shows a 16-node H-Ring where the global ring has one tap on each global ring. Figure 4.8(a) shows a 64-node H-Ring where the global ring has four taps on each local ring. Figure 4.8(b) shows a wire-efficient layout of this topology that does not require wire segments longer than one tile-span. By placing the two pairs of taps at diametrically opposite ends of the local rings, the average distance from an endnode to a tap is reduced, but having four taps increases the average distance on the global ring. 1 2 3 4 5 6 7 8 0 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 0 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 0 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 0 9 10 11 12 13 14 15 (a) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 (b) Figure 4.8: (a) 64-node H-Ring with one global ring and four taps per local ring, and (b) its coresponding layout. The second parameter we propose is the number of ring in each level of hierarchy. Networks shown in Figures 4.7(a) and 4.8(a) both have one global ring connecting four 78 local rings. This may cause the global ring to become a performance bottleneck - a problem that having additional global ring can mitigate. Finally, in order to uniquely represent the various configurations of H-Ring topolo- gies, we propose a 3-tuple representation for each level of the hierarchy. The 3-tuple is n,r,t, wheren is the number of nodes at a particular level of the hierarchy,r is the num- ber of rings connecting this level to its next lower level, andt is the number of taps. For example, the 64-node H-Ring network shown in Figure 4.8(a) is denoted asf16,1,1g: f4,1,4g. Table 4.4 shows a comparison between four different H-Ring topologies, each with 64-nodes. At the lowest hierarchical level, each of these four topologies has 16 nodes (i.e., n = 16), connected via one ring (r = 1) and one tap per node on to the ring (t = 1). To conserve space, we only show the 3-tuple for the second level of the hierarchy in each case in Table 4.4. Parameter H-Ring f4,1,1g H-Ring f4,1,2g H-Ring f4,1,4g H-Ring f4,2,2g Max. Hop Count 18 12 12 10 Avg. Hop Count 9.5 8.6 5.7 6.8 Bisection BW 4 4 4 8 Routers w/ Degree 3,5 4, 60 8, 56 16, 48 16, 48 No. of Wire Seg. 68 68 72 72 Total Wire Seg. Len. 68 78 80 92 Avg. Wire Len. 11.1 1.2 1.25 1.44 Table 4.4: Comparison of four 64-node H-Ring topologies. Qualitatively, H-Ring topologies can offer better fault isolation and avoid single point of failure - two of the main drawbacks of the simple ring topology. Furthermore, H-Ring topologies with more than one global rings and taps per local ring can provide marginally higher bisection bandwidth and some path diversity. On the whole, H-Rings have improved performance metrics as compared to the simple ring and compare well 79 with other topologies. However, if applications are not mapped properly to the net- work’s hierarchy, the inter-ring interfaces can become major performance bottlenecks. Furthermore, since traffic must be buffered at least at the inter-ring interface, ensuring starvation freedom and deadlock avoidance is not as easy in H-Ring topologies as it is in simple rings. 4.3.9 Results To make the discussions concrete, Tables 4.5 and 4.6 present the values for metrics discussed earlier in the paper for an example 64 node network. Table 4.5 compares the cost of various topologies while Table 4.6 presents a comparison of the same 64-node example network using commonly used performance metrics Parameter Ring 2D Mesh 2D Torus 3D Mesh 3D Torus Fat Tree CCC Wire Seg.: L,M,S 0,0,64 0,0,112 0,96,32 0,128,16 64,64,64 0,128,0 26,8,62 Total No. of Wire Seg. 64 112 128 144 192 128 96 Total Wire Seg. Len. 64 112 224 272 448 256 182 Avg Wire Seg. Len. 1 1.75 3.5 4.25 7 4 2.8 Wiring Density 1 1 2 4 5 4 4 No. of Switches 64 64 64 64 64 16 64 Crossbar:Link Power 1 2.3 3.1 4.8 6.5 44.5 5 Table 4.5: Comparing the cost of topologies with 64 nodes. In addition to the standard performance metrics, Table 4.6 shows the latency measure for each network under no-load condition. Also, a recently-proposed measure of “effec- tive bandwidth” [DYN02] is used to compare the upper-limit on the effective bandwidth of each of the network under diverse traffic patterns. The following conclusions can be drawn from this investigation: 80 Parameter Ring 2D Mesh 2D Torus 3D Mesh 3D Torus Fat Tree CCC Avg. Distance 16 7 4 14 3 9 5.4 Max. Distance 32 14 8 7 6 11 10 Bisection BW 8 16 32 32 43 21 21 Eff. BW: Uniform 16 32 64 64 64 42 42 Eff. BW: Hot-spot 1 1 1 1 1 1 1 Eff. BW: Bit Comp. 8 16 32 32 43 21 21 Eff. BW: NEWS 22.4 64 64 64 64 64 64 Eff. BW: Transpose 7 28 56 56 64 36.8 36.8 Eff. BW: PS 8 32 64 64 64 42 42 Table 4.6: Comparison of performance of topologies with 64 endnodes. ² The fat tree topology is weeded out, because of the prohibitive crossbar cost among other issues. ² The cube connected cycles has relatively good measures, except for the wiring complexity. The latter combined with some of the qualitative cons mentioned earlier (scalable design) makes this topology unattractive. ² The 3D torus and 3D mesh have attractive metrics, except for wire complexity and a high crossbar degree. It is possible that these may become attractive candidates for very large number of cores (high hundreds, low thousands). ² The ring has poor measures, has the potential for interconnect bandwidth wastage, and suffers from other drawbacks (no path diversity, poor fault tolerance). Hence we find this topology not suitable for high node counts. ² The H-Ring, 2D mesh, and 2D torus have attractive topological properties as the table summary shows, though the H-Ring’s bisection bandwidth is borderline. 81 The hierarchical ring has some qualitative cons mentioned earlier (poor fault tol- erance, susceptibility to performance bottlenecks at interface between the rings, poor path diversity, etc). The 2D mesh and 2D torus are clear candidates for a more detailed study. While the mesh has simplicity on its side, the torus has superior topological and performance advantages, which could potentially be offset by wiring complexity. The hierarchical organization merits a deeper study requiring the impact of the coherence protocols on the overall complexity, which is outside the scope of this investigation. 4.4 Summary and Open Issues In this chapter, we have argued that OCINs for tiled architectures necessitates the intro- duction of new metrics related to wiring and embedding which affects delay, power, and overall design complexity. The application of these metrics has shown several interest- ing results: A) Wires are not plentiful and bandwidth is not cheap, especially, for higher dimensional networks - this is contrary to normal belief. B) Higher dimensional net- works, while topologically appealing, have serious limitations arising from long wire lengths needed to embed them on the 2D Silicon substrate: they could limit the fre- quency of operation and they consume power. Further, they typically need high degree crossbars which also exacerbates the wiring and power issues. C) The wiring analysis gives an indication of the wiring complexity of each topology. This could be used, for example, in analyzing if additional metal layers are needed to provide a richer intercon- nect considering the attendant costs and design decisions. The methodology presented here can not only be used to weed out topologies but also to examine if microarchitectural or circuit optimizations can be done to improve 82 the cost metrics when a topology has other desirable features. For example, if the cross- bar power is excessive in a 2D mesh, could alternative designs (e.g., partitioned cross- bar) be used to reduce the power? Since this is the first investigation which proposes a new methodology needed for analyzing the emerging area of on-die interconnects for tiled chip multiprocessors, there are a number of interesting areas for further study: A) The buffer component of power in the OCIN has not been considered in this chapter. B) The energy consumption of networks is another important area which has not been considered in this chapter in sufficient detail. C) An important class of networks which we have not considered are indirect networks. At first blush, these seem to require long wires spanning half the chip and may have issues with layout since they seem to need skewed aspect ratios. A deeper analysis of indirect networks is merited, however. Additionally, there are a number of other interesting variants of topologies we have not considered. For example, one interesting topology - a variant of the 2D mesh [BD06] - which reduces the average distance merits a study in accordance with the methodol- ogy we have outlined. D) Non-uniform sized tiles will affect router layout and wiring. Whether such aspects can be incorporated as a meaningful metric remains to be seen. E) Our analysis assumes that two metal layers are available for OCINs - the analysis can be extended in a straightforward manner if additional metal layers become available. If they do, different tradeoffs may exist and higher dimensional topologies could become appealing. F) A wire efficient embedding of the 3D mesh is possible where wires are at most 2 tile-spans long, but this embedding has an issue with design re-use. When a topology grows, the layout and wiring of the entire chip needs to be redone. Wire efficient embedding is a topic for further research. 83 Chapter 5 Cubic Ring Network 5.1 Introduction An ideal on-chip network should consume power only when it is delivering packets, not when it is waiting for packets to be injected. More precisely, only those network resources (links, buffers, crossbar, etc.) that are along the path of packets currently in the network should consume power, while the rest should be off 1 . However, the state-of-the-art in on-chip networks is very different from this ideal model. On-chip networks in implemented designs are always ready to operate at their peak performance and consume a significant portion of the total chip power (40% and 28% of each core’s power budget in MIT Raw [KTMW03][TLAA03] and Intel Teraflop [Sri07] processors, respectively). Given tight power constraints, there is a growing need for power-efficient on-chip networks with sufficient bandwidth. In [Sri07], for example, in order to prevent the network from becoming a bottleneck, it was argued that CMPs fabricated in 32nm must be able to provide bisection bandwidth of more than 2 TeraBytes/sec and consume under 10% of the total chip power. A potential approach for meeting power-performance goals is to turn off network links, ports and parts of the router when the traffic injection rate is low, and turn them back on only when demand for bandwidth increases sufficiently. The key observation 1 Throughout this chapter, a network resource is said to be off when it has been power-gated. Mech- anisms proposed in [HGK09] for powering down various network resources are assumed. A network resource is on if it is operating normally. This terminology is consistent with what was used in previous works, such as [SP03], [SP04] and [SP07]. 84 behind this approach is the temporal variance in network traffic injection rate across applications and during the execution of large scientific applications. In [WOT + 95], it was shown that in a 64-node network, bandwidth demands of applications in the SPLASH benchmark suite vary from less than 0.125 Bytes/FLOP (Barnes) to 2 B/FLOP (FFT), representing a 16x difference across applications. Similarly, our analysis of seven benchmark applications in Section 2.4 showed that during the execution of large paral- lel applications there are long periods of low traffic injection followed by bursts of high injection. To take advantage of this course-grain variability in the volume of traffic, a flexible network topology is needed which can allow segments 2 of on-chip routers to be turned off and on based on statically- or dynamically-determined demand for net- work bandwidth. In this work, we propose such a network topology, called the Cubic Ring (cRing) topology. This polymorphic topology provides a simple yet flexible infra- structure for on-demand bandwidth provisioning in on-chip networks. A cRing network is obtained from a k-ary n-cube torus network by removing selected network links in all but one dimension. The resulting topology – a hierar- chical arrangement of torus rings – can have different configurations, with normalized bisection bandwidth ranging from 1 bidirectional link to k (n¡1) ¡ 1. Unlike previous proposals, however, cRings can trade off network power and bandwidth without a sig- nificant increase in the average distance between nodes. This flexibility is made possible by the properties of the topology and a simple routing algorithm. The rest of the chapter is organized as follows: Section 5.2 surveys run-time power management schemes for on-chip networks proposed in the literature and compares the proposed cRing topology with other ring-based topologies. The cRing topology is for- mally described in Section 5.3 and characterized in its most general (n-dimensional) 2 The combination of link, logic and buffers at both ends of a link is referred to as a segment [HGK09]. 85 form in Section 5.5. A deadlock-free routing algorithm for cRing networks is described, and its deadlock-freedom is proven, in Section 5.4. In Section 5.6, performance evalua- tion of the cRing topology is presented. We conclude with a brief summary of the cRing topology and its applications in Section 5.7. 5.2 Ring-Based Topologies Several ring-based topologies have been proposed in previous work on shared memory multiprocessors. One of the most well-studied is the hierarchical slotted ring, which received significant attention several years ago in academia [HS94][ZY95][FVS95][HJ01] as well as in indus- try [CGB91][Wil87]. Early CMP designs commonly featured the 2D mesh topology [TLAA03][Sri07][Kar06][D. 07] because of its short (one-core length) network links and natural embedding in the planar 2D die layout. However, hierarchical rings have been shown to outperform 2D meshes in systems with up to 128 nodes in analytical [HJ01][HJ94] and experimental studies [RS97]. The performance gain is realized primarily because of two factors: the memory access locality of the workloads, and the simplicity of ring routers compared to mesh routers. The proposed cRing network is different from hierarchical rings in that it allows for more than one ring to connect two levels of the hierarchy. As such, the cRing topology avails both the locality-friendliness and simplicity of the hierarchical ring topology as well as the increased bisection bandwidth of the network through additional rings to connect each level of the hierarchy. 86 5.3 Formal Description of Cubic Ring Topology The cubic ring topology can best be described formally by comparing it with the more well-known k-ary, n-cube torus topology. A cRing network is obtained from a torus network by removing selected communication channels in all but one dimension. Stated formally, a torus network is a graph I T (N T ;C T ), where vertices of the graph N T rep- resent the nodes (routers) in the network, and the edges of the graph C T represent the communication channels (links) connecting the nodes. A cRing network I C (N C ;C C ), derived fromI T , hasN C =N T andC C ½C T . A bidirectional n-dimensional, radix-k torus, or k-ary n-cube torus, consists of N = k n nodes arranged in an n-dimensional lattice with k nodes along each dimension. Each node, A, has a uniquen-digit, radix-k address(a n¡1 ;a n¡2 ;:::;a 1 ;a 0 ) that denotes its location in the lattice [DT04]. Each node is connected via bidirectional channels to 2n nodes with addresses that differ by§1 (modk) in exactly one address digit. That is, any two nodes, A and B, are connected to each other iff b i = a i for all 0· i· n¡1 except one,j, whereb j = a j §1. This results in a total ofnk n bidirectional inter-node channels in the network, organized as nk n¡1 bidirectional rings each of size k. For example, consider the networks in the 4£4£4 lattice shown in Figure 5.1. We use Cartesian coordinates to simplify the graphical representation of the networks and their explanation. Figure 5.1(a) shows 64 nodes connected in a 4-ary 3-cube torus topology. Ann-dimensional, radix-k,R-ring cubic ring network, orn-k-R cRing for short, can be derived from the corresponding n-dimensional, radix-k torus by removing a subset of rings from the torus topology in a hierarchical fashion. If a ring connecting a given node to its neighbors in thei-th dimension is removed, all rings incident on the node in dimensionsi 0 >i are also removed. This property has the effect of creating a hierarchy of rings where each dimension becomes a level in the hierarchy. In this hierarchy, all 87 (a) (b) x y 0,3,0 0,3,1 0,3,2 0,3,3 0,2,0 0,2,1 0,2,2 0,2,3 0,1,0 0,1,1 0,1,2 0,1,3 0,0,0 0,0,1 0,0,2 0,0,3 z=0 plane (c) y z x=0 plane 3,0,0 3,1,0 3,2,0 3,3,0 2,0,0 2,1,0 2,2,0 2,3,0 1,0,0 1,1,0 1,2,0 1,3,0 0,0,0 0,1,0 0,2,0 0,3,0 (d) Figure 5.1: A 64-node (a) 4-ary, 3-cube torus network, (b) 4-ary, 3-cube,R-ring cubic ring network withR =f0001;0101;1111g, (c) thexy plane of the 4-ary, 3-cube cRing network atz =0, and (d) theyz plane atx=0. the rings in a given dimension are at the same hierarchical level and each ring in a given dimension has at least one node that is part of another ring in the next higher dimension, if there exists one. The network is fully connected if and only if at least one ring in each dimension is connected and all the rings in the lowest dimension, dimension 0, are connected. Dimension 0 does not imply a particular dimension in the Cartisan coordinates. Any dimension in which all rings are connected can become the lowest hierarchical level, or dimension 0. We discuss the ordering of dimensions in more detail in Section 5.3.1. In the notation for cRing networks, R =fr n¡1 ;r n¡2 ;:::;r 1 ;r 0 g, wherer i , 0· i· n¡1, is ak-bit string. Thel-th bit of each stringr i ,0·l·k¡1 (bit 0 being the LSB and bit k¡1 being the MSB) corresponds to a specific set of one or more torus rings in the i-th dimension, and the value of each bit indicates the presence (if value = 1) or 88 absence (if value = 0) of the corresponding set of rings (or ring, if i = n) in the cRing topology. That is, ifr i [l] = 0, the set of rings connecting nodes witha i¡1 = l in thei-th dimension is removed from the corresponding torus. This also means that nodes with a i¡1 = l do not have any active channel in the dimensions greater thani. We have that r 0 [l] = 1 for alll,0· l· k¡1, so the expressiona i¡1 = l, which is meaningless for i = 0, applies only fori > 0. Finally, for alli,0 < i· n¡1,r i [l] = 1 for at least one value ofl. Consider the example of the cRing network shown in Figure 5.1(b). This 64-node, 4- 3-R cRing network withR =f0001;0101;1111g shows how higher-dimensional cRing networks can be constructed. In this network, each x-dimension ring connects to two rings in the y dimension; at x = 0 and x = 2 as indicated by r 1 [0] = r 1 [2] = 1. This results in eight y-dimension rings connecting the sixteen x-dimension rings across the fourxy planes. Figure 5.1(c) shows thexy plane atz =0 as an example. Finally, eachy-dimension ring connects to one ring in thez dimension, as indicated byr 2 [0]=1. This results in the twoz-dimension rings connecting the eighty-dimension rings, across the twoyz planes atx = 0 andx = 2. Figure 5.1(d) shows theyz plane at x = 0. The plane atx = 2 would be identical, whereas the planes atx = 1 andx = 3 will not have any nodes connected in they orz dimension. The network shown in Figure 5.1(b) has three levels of hierarchy corresponding to the three dimensions. Rings in the x dimension constitute the lowest level of the hierarchy, rings in the y dimension constitute the next higher, and finally the rings in the z dimension constitute the highest level of the hierarchy. In the rest of the chapter, we refer to ring(s) in the lowest level (x-dimension in Figure 5.1(b)) as “local” ring(s), ring(s) at the next higher level (y-dimension in Figure 5.1(b)) as “intermediate” ring(s), and ring(s) in the highest level (z-dimension in Figure 5.1(b)) as “global” ring(s). A 89 cRing with n > 3 will have multiple “intermediate” levels, denoted as intermediate1 for ring(s) at i = 1, intermediate2 for ring(s) at i = 2, and so on. Network nodes are classified based on the highest level ring of which they are a part. So, in the example network shown in Figure 5.1(b), node (0,0,0) is a “global ring node” as the highest level ring of which it is a part is the global ring. Similarly, node (0,1,0) is an “intermediate ring node” and node (0,0,1) is a “local ring node.” Finally, the local ring on which the source (destination) of a packet lies is called the “source (destination) local ring” and packets for which the source and the destination lies on the same local ring are referred to as “local packets.” 5.3.1 Mixed-Radix and Isomorphic Cubic Ring Networks Like mixed-radix torus networks, each dimension in a cubic ring network may have a different radix. This will mean a differentk for each dimension. Figure 5.2(a) shows a 2,4-ary 2-cube R-ring cRing with R =f0101;11g. An isomorphic pair of this mixed- radix network is shown in Figure 5.2(b). Two cRing networks are isomorphic if the r i , 0· i· n¡1, of one network can be obtained by rotating the r i of the other. The network shown in Figure 5.2(c) is not isomorphic to the other two networks. Isomorphic cRing networks have identical performance characteristics like average and maximum hop count between two nodes and maximum throughput. 1,0 1,1 1,2 1,3 0,0 0,1 0,2 0,3 (a) 1,0 1,1 1,2 1,3 0,0 0,1 0,2 0,3 (b) 1,0 1,1 1,2 1,3 0,0 0,1 0,2 0,3 (c) Figure 5.2: An 8-node, 2,4-ary2-cubeR-ring cRing network with (a)R =f1010;11g, (b)R =f0101;11g, and (c)R =f1001;11g. 90 5.4 Routing Function for Cubic Ring Networks Hierarchical arrangement of rings in the cRing topology simplifies design of the routing algorithm. The route that each message takes from its source to its destination can be decomposed into (at maximum) two segments – the up segment and the down segment. In the up segment of the route, the message tries to reach the highest dimension in which the source and the destination differ. This is done by routing to the nearest node that connects in the next higher dimension, and so on. Once the highest dimension in which the source and the destination nodes differ is reached, the message travels down the hierarchy of rings, crossing dimensions in strictly decreasing order, reducing to zero the offset in one dimension before routing in the next until the destination node is reached. In effect, the route in the down segment is identical to that supplied by the e-cube routing function [DS87]. If the message is injected into the network at the node connected in the highest dimension in which the source and destination differ, there is no need for the message to travel up the hierarchy. Therefore, its route to the destination will consist only of the down segment. The cRing routing function described above is connected because each ring at every level of the hierarchy is guaranteed to have at least one node with a link to the next higher level (if there exists one). This means that a message can traverse dimensions in increasing order starting from the lowest dimension (i.e., the dimension in which all rings are connected) until the desired dimension is reached. Conversely, each node at a given level of the hierarchy is part of exactly one ring at the next lower level (if there exists one). This means that a message can follow the minimal path as it travels down the hierarchy, traversing dimensions in decreasing order until the destination node is reached. 91 5.4.1 Preliminaries Before a formal description of the cRing routing function and the proof of deadlock freedom are presented, some notation is provided below. ² Letm be a message with sources =(s n¡1 ;s n¡2 ;:::;s 2 ;s 1 ;s 0 ) and destinationd = (d n¡1 ;d n¡2 ;:::;d 2 ;d 1 ;d 0 ). ² Let¢ =f¢ n¡1 ;¢ n¡2 ;:::;¢ 1 ;¢ 0 g be the offset between the sources and destina- tiond ofm such that¢ i =(d i ¡s i )modk, for all0·i·n¡1. ² Leth be the highest dimension in which the source and the destination differ. That is,¢ h 6=0 and eitherh =n¡1 or¢ h 0 = 0, for allh<h 0 ·n¡1. ² Letu =(u n¡1 ;u n¡2 ;:::;u 1 ;u 0 ) be the current node wherem is queued. The offset between the current node and the destination is± =f± n¡1 ;± n¡2 ;:::;± 1 ;± 0 g, where ± i =(d i ¡u i )modk,0·i·n¡1. Foru =s,± =¢. ² Let the highest dimension in which the current nodeu is connected beÂ. 5.4.2 cRing Routing Function The cRing routing algorithm is shown in Figure 5.3. Messages are created with their direction set to up. The Route-Message(m) procedure first determines the correct direc- tion of the message (using the Get-Direction(m) procedure). If the direction is down, the message is routed using e-cube routing, traversing dimensions in decreasing order such that ± i is reduced to zero in each dimension i. If the direction is up, the message is routed toward the nearest node that connects in a higher dimension. At each node in the network, the output channel that takes a message toward the node that connects in a higher dimension is fixed and independent of the message destination. Therefore, up 92 messages do not undergo routing like the down messages do; the output channel for all up messages is known a priori. Procedure Get-Direction (m) if (direction == down) return (down); elseif (direction == up) if ( < ) return (up); else return (down); Procedure Route-Message (m) 0. if ( == ), consume message 1. = Get-Direction( ) 2. if ( = down), route using the e-cube hop 3. if ( = up), route along dimension toward a node with ’, where ’ > Figure 5.3: The cRing routing algorithm. To avoid routing-induced deadlocks across rings, two virtual channels, VC0 and VC1, are used in the network. VC0 is reserved for messages travelling in the up direction and VC1 is reserved for messages travelling in the down direction. Example: As an example, consider the routing of a message from sources = (0,1,1) to destination d = (2,3,2) in the 4-3-R cRing network, with R = f0001;0001;1111g, shown in Figure 5.4. The offset between the source and the destination for this message is¢=f2;2;1g, andh = 2. The path taken by the message is:s=(0;1;1)!u 1 =(0;1;0)!u 2 =(0;0;0)! u 3 =(1;0;0)!u 4 =(2;0;0)!u 5 =(2;3;0)!u 6 =(2;3;1)!d=(2;3;2). At the source s,  = 0 because the node (0,1,1) is connected only in the 0 th - dimension (i.e., the x-dimension). This means that  < h, resulting in the message direction to be set as up. This causes the message to be routed toward u 1 , the nearest node that connects to a higher dimension. Atu 1 , = 1 but is still less thanh, so the message direction remains up. Atu 1 , the nearest node with a higher value isu 2 so the message is routed towardu 2 . Atu 2 ,Â=h=2, so, the message direction is changed to down. All subsequent hops are governed by the e-cube routing algorithm. The ± at u 2 93 x y z d = {2,3,2} s = {0,1,1} u1 u2 u3 u4 u5 u6 Figure 5.4: Routing of a message in a 4-ary 3-cube cRing withR =f0001;0001;1111g. isf2,3,2g, so the message first makes two hops in the z+ direction (tou 3 andu 4 ), then at u 4 , the message makes a hop in the y- direction (to reach u 5 ), and finally at u5 the message makes two hops in the x+ direction to reach the destinationd. From s to u 1 and from u 1 to u 2 , the message travels on VC0. At u 2 , the message switches to VC1 and remains on VC1 until it is consumed at the destination. It is important to note that if the cRing routing algorithm were to be used to route messages on a torus topology, the effective routing function would be e-cube, since h= at each (injection) node. 5.4.3 Flow Control Bubble flow control [CBGV97] is used on both virtual channels to avoid deadlocks within rings. Bubble flow control requires that new messages being injected into the network, as well as messages making a turn from one dimension to the other, satisfy 94 the bubble condition before they can make progress. The bubble condition dictates that messages can be injected into a ring (through the injection port or after making a turn) if and only if after the injection there is at least one empty message buffer (i.e., a bubble) in the set of queues for the whole ring in the dimension and direction requested by the to-be-injected message. By guaranteeing that at least one bubble exists in each ring in each direction at all times, it can be assured that messages will continue to make forward progress within each ring. Formal proof of how bubble flow control guarantees deadlock freedom can be found in [CBGV] and mechanism for starvation presentation can be found in [CBGV97]. 5.4.4 Deadlock Freedom We now prove that cRing routing as described above is deadlock-free using only two virtual channels and bubble flow control. Lemma 1: Ring cycles in each dimension are broken by using bubble flow control. Proof: Bubble flow control works on the principle that a message can be injected into a ring iff after the injection of the message there remains at least one empty message buffer (or bubble) in the ring. It has been proven that bubble flow control is sufficient to avoid within ring deadlocks in torus networks [CBGV]. Lemma 2: The cRing routing algorithm is deadlock-free for messages traveling in the up direction. Proof: For deadlock to occur, there has to be a cyclic dependency on virtual channels acquired by the messages involved in the deadlock. Messages going in the up direction use VC0 and can request resources either in the current dimension or a higher dimension. This results in a total ordering of resources that messages in VC0 can request. In the routing example given in Figure 5.4, this ordering is x+/x-! y+/y-! z+/z-. This total 95 ordering leads to a channel dependency graph which is acyclic except for the ring cycles, and we have shown in the proof of Lemma 1 that those cycles are broken by bubble flow control. Lemma 3: The cRing routing algorithm is deadlock-free for messages traveling in the down direction. Proof: A corollary to the statement of Lemma 2 is that messages going in the down direction, using VC1, also cannot cause a deadlock to occur as there is a total order in which resources can be requested by the down messages. Again using the example given in Figure 5.4, the order in which queues can be requested by messages going in the down direction is z+/z-! y+/y-! x+/x-. This results in an acyclic dependency graph with the same exception as stated in the proof of Lemma 2. Therefore, the reasoning of Lemma 2 holds for Lemma 3. Theorem: The cRing routing algorithm is deadlock-free. Proof: Routing of up messages (VC0) is deadlock-free (Lemma 2) and routing of down messages (VC1) is deadlock-free (Lemma 3). Hence, the only way the cRing routing function could be deadlock-susceptible is if there can be a cyclic dependency across messages in VC0 and VC1. Up messages can request resources reserved for the down messages when they switch their direction (i.e., there is a direct dependency from VC0 to VC1) but down messages always sink and cannot request resources reserved for up messages (i.e., there is no dependency from VC1 to VC0). As dependencies between VC0 and VC1 resources are acyclic, the cRing routing function is deadlock-free. 5.5 Characterization of cRing Networks Unlikek-aryn-cube tori, cRing networks are not regular (i.e., all nodes do not have the same degree), and they possess neither edge nor node symmetry. This lack of regularity 96 and symmetry makes the derivation of generalized expressions for the computation of analytical performance metrics like average and maximum distance, network capacity and maximum channel load very difficult. Below, we present approximate measures of average and maximum latency and bisection bandwidth. 5.5.1 Average Distance In formulating the average distance (or average hop count) for a cRing network, we exploit the hierarchical arrangement of the network and the up and the down segments of a packet’s path. To facilitate derivation, we make two simplifying assumptions which are later removed. First, we assume thatk is even. Second, it is assumed that the number of rings connecting dimensioni to dimensioni+1 is one for all0<i<n¡1. In terms of the notation described in Section 5.3, this means that the number of 1’s in each string r[i] is exactly one. The average distance between two nodes in a 2D cRing network (° 2D ) can be approx- imated by the following expression. ° 2D = 2k 4 + 2k 4 (1¡ k k 2 ) (5.1) In a torus network, the average distance between any two nodes along the shortest path is given by nk 4 . In cRing networks, once a packet reaches the highest level in the hierarchy in which its source and destination nodes differ, it can take the shortest path downwards. This means that the average distance for the downward segment of each packet’s path is equal to nk 4 . However, the upward segment of a packet adds additional distance to the total average distance in a cRing network. In Equation 5.1, we add this additional distance by taking the product of the average distance from any given node 97 to the global ring node in each local ring ( 2k 4 ) and the fraction of traffic that must exit the local ring (1¡ k k 2 ), where k is the number of nodes in each ring and k 2 is the total number of nodes in a 2D cRing network. Generalizing Equation 5.1 to n dimensions gives us the following expression for average distance. ° nD = nk 4 + n¡1 X i=1 k 4 (1¡ k i k n ) (5.2) Equation 5.2 considers cRing networks with only one ring in each dimension i, wherei > 0. To generalize this expression further, cRing networks with more than one ring in each dimension (above dimension 0) must also be considered. This complicates the analysis, and therefore, we approximate this by the following expression. ° nD ' nk 4 + n¡1 X i=1 k 2¢2 numRi (1¡ k i k n ) (5.3) In Equation 5.3, the number of rings in dimension i are given by numRi, and the expression assumes thatnumRi rings are maximally apart. Finally, in the above expressions, k 4 can be substituted by k 4 ¡ 1 4k ifk is odd. 5.5.2 Maximum Distance Similarly, the maximum distance, or diameter, between two nodes can be approximated by the following expression. 98 ° max ' nk 2 + n¡1 X i=1 k 2 numRi (1¡ k i k n ) (5.4) 5.5.3 Bisection Bandwidth The hierarchical nature of cRing networks implies that, regardless of the configuration, the least connected cut of the graph representing the network will always be along the highest dimension. Therefore, the bisection bandwidth of a cRing network is equal to the connected rings in then-th dimension, ornumRn from the definition above. 5.6 Performance Evaluation We evaluate the performance of cRing networks in various configurations and compare their performance to on/off networks proposed in the literature. But first, we quantify the power-bandwidth trade-off offered by cRing networks. To do so, we investigate the effect of adding/removing global rings on average distance of a cRing network. Given that this work is focused on on-chip networks, we limit our analysis to 2D cRing networks, but the analysis can easily be extended to higher dimensional networks. !" # $ %& ’ ( % " $) * + , - ./ 0 1 23 - 4 - ./ 0 1 23 - 4 Figure 5.5: Average distance for 16- and 64-node cRing networks in all configurations, with optimally placed global rings. 99 Figure 5.5 plots the average hop count of 16- and 64-node cRing networks in all possible (optimal) configurations. An optimal configuration is one in which the global rings are spaced maximally apart to reduce average distance. The plot shows an impor- tant feature of cRing networks: there is a significant decrease in average distance when the number of global rings is increased from 1 to 2. However, as the number of global rings is increased further, there is a diminishing effect on average distance. For example, the 64-node cRing network with only four global rings has only 2.75% higher average distance than the 64-node torus. This trend is less pronounced in smaller networks (e.g., the 16-node network also characterized in the plot) but becomes further pronounced in larger networks, as shown in Figure 5.6. Figure 5.6 shows an important trend that fundamentally motivates cRing topologies. The number of global rings necessary to keep the average distance within 5% of the average distance of a torus network is insensitive to the size of the network. With only four of the 16 available global rings connected, a16£16 (256-node) cRing network has an average distance of 8.3, which is only 3.75% higher than the average distance of the 16-ary, 2-cube torus (average distance = 8). With only 6 global rings (i.e., with 31% of the links turned off ), the increase in average distance drops to a mere 1.65%. This shows that additional global rings contribute mainly to increasing network bandwidth, rather than to reducing average distance. Therefore, by turning off these global rings when bandwidth demands of the network permit it, significant reduction in network (static) power consumption is achieved. Figure 5.7 illustrates this more clearly by plotting the percentage of network links that are turned off, as a fraction of the total number of links in the 16-ary, 2-cube torus, along with the percentage increase in average distance from the equivalent torus. 100 ! " # $ % & ’ ( ) * + ( % , + - . / 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 Figure 5.6: Percentage increase in average distance from the torus network in cRing networks of different sizes and different number of global rings. Finally, throughout this section we have assumed cRing networks with optimally- placed global rings. Optimal placement, as mentioned earlier, implies that intermediate and global rings are placed maximally apart. We studied the impact of sub-optimal placement of global rings on the average distance in cRing networks and found that in a 64-node cRing network, worst-case placement of global rings increases the average distance by up to 17%. This points to the importance of spacing global rings maximally apart. – include results of sub-optimally places rings and how that changes the average distance. Results from Matlab. 101 ! " # $ % & " ! ’ ( ) * + + , " - ! Figure 5.7: Percentage of network links turned off and average distance for 256-node cRing networks in all configurations. 5.6.1 Simulation Results The simulation environment used in this study is based on a flit-level network simulator SICOSYS (SImulator of COmmunication SYStems)[PGB02]. SICOSYS is a detailed network simulator written in C++ that incorporates key parameters of the low-level implementation and provides results close to those from Verilog/VHDL simulators, but at lower computational cost. In a previous study, it has been shown that, compared to an RTL description of a router, results from SICOSYS simulations had less than 4% error in latency and even less in throughput. This accuracy came with up to 45£ speed-up over RTL simulation. Simulations presented here use the basic architecture of a 4-stage Bubble Adaptive Router [PBG + 99] with link latency of one cycle. Two flit packets were assumed and simulations were run for 100,000 cycles with a warm-up period of 10,000 cycles. Three synthetic traffic patterns were used in these experiments: uniform random traffic, perfect shuffle traffic, and local (near neighbor) traffic where packets are sent only to the nodes 102 due east, west, north and south of the injecting node. Injected load rate was normalized to the bisection bandwidth of a torus network with values ranging from [0,1]. A network was assumed to be saturated when the average packet latency exceeded 2£ the zero-load latency. We compare the performance of cRing networks with on/off networks based on the West-Last, East-Last (WLEL) routing algorithm proposed in [SP07]. Figure 5.8 shows the performance of the following five 16-node networks: (a) A 2D torus network, (b) 4-ary, 2-cube cRing network with R=f0001,1111g, (c) with R=f0101,1111g, (d) with R=f0111,1111g, and finally (e) a 16-node 2D torus network that uses West-Last, East-Last routing algorithm proposed in [SP07] with the maximum number of links turned off. The average packet latency of a cRing network with two global rings is about 11% higher than that of a torus under uniform random traffic, 10% higher under perfect shuffle traffic and 16.7% higher under local traffic patterns. Given that only 12.5% of the router segments are turned off in this configuration, the power- performance (latency) trade-off is linear. The results indicate that cRing networks are not well-suited for smaller networks, as the latency penalty of turning off rings is quite high. Figure 5.9 shows results for a network size of 64 nodes. Three cRing configurations were simulated with one, two and three global rings, respectively. With only three global rings active (i.e., almost 30% of the router segments turned off) the latency overhead is only 10.6% with uniform random traffic, 11.3% with perfect shuffle traffic and 16.5% with local traffic patterns. What is important to note here is that the simulator assumes the same router delay in 5-port and 3-port mode. A dimensionally-partitioned router can be designed to have lower router delay in 3-port mode than in 5-port mode. While the focus of this chapter is not on the design of a partitionable router, the results presented 103 here show that even a modest reduction in router delay in the 3-port configuration can make the latency overhead almost negligible for a 64-node network even with just two or three global rings. The torus network which uses WLEL routing performs well under perfect shuffle traffic because this traffic pattern is well-matched to the set of on links. For all other traffic patterns, however, the torus network using WLEL performs worse than the cRing network with three global rings. Finally, Figure 5.9 also shows the throughput penalty of cRing networks. Under uniform random traffic, for example, while the latency penalty for a cRing network with three global rings is lower than that for a torus with WLEL routing, the cRing net- work has lower saturation throughput. This, however, is not problematic because as the demand for network bandwidth increases, the polymorphic cRing network will turn on additional global rings, thus increasing the maximum throughput, until all global rings have been turned on and the network is restored to a fully-connected torus topology. In other words, a polymorphic network only operates in the degraded cRing mode when the the throughput needs of the application are well below the throughput available in the torus configuration. 5.7 Summary An ideal on-chip network whould consume power only when it is delivering packets. However, in practice, such efficiency is difficult to achieve primarily because of the delay and energy overhead in turning network resources on and off. Practical on-chip networks have, therefore, used only fine-grain dynamic power management techniques that can conserve power only during short periods of inactivity on a link-by-link basis. But these techniques are not suited to take advantage of significant variations in band- width needs across applications. Coarse-grain power management approaches, such 104 as the one enabled by the polymorphic cRing networks described in this work, allow for greater savings as a significant number of network resources can be turned off for prolonged periods of time. However, in order to be effective, these techniques must maintain low latency even when the network is in the low-power state(s), and they must provide high throughput during normal operation. This work lays the foundation for such an effective coarse-grain power management approach by presenting a flexible polymorphic topology which can be used to trade off power for bandwidth without significantly increasing average packet latency. The definition of the topology presented in this work is general enough to accommodate 3D- die stacks and heterogenous core sizes. The routing algorithm is also formalized, and proven to be correct and deadlock-free for the general case. Also in this work, through analytical and experimental performance evaluation, key advantages of the proposed cRing topology are shown, and important cost-performance trade-offs are highlighted. 105 ! " # $ # % ! & ’ ! & ! ( ) * + , - . / 0 1 2 1 3 4 5 6 7 8 9 9 9 : ; : : : : < 3 4 5 6 7 8 9 : 9 : ; : : : : < 3 4 5 6 7 8 9 : : : ; : : : : < (a) ! " # $ # % ! & ’ ! & ! ( ) * + , - . / 0 1 2 1 3 4 5 6 7 8 9 9 9 : ; : : : : < 3 4 5 6 7 8 9 : 9 : ; : : : : < 3 4 5 6 7 8 9 : : : ; : : : : < (b) ! " # $ % & % ’ # ( ) # ( # * + , - . / 0 1 2 3 4 3 5 6 7 8 9 : ; ; ; < = < < < < > 5 6 7 8 9 : ; < ; < = < < < < > 5 6 7 8 9 : ; < < < = < < < < > (c) Figure 5.8: Performance results comparing a 16-node two-dimensional torus network, a cRing network, and a torus network that uses West-Last, East-Last routing, under (a) random uniform, (b) perfect shuffle and (c) local traffic with radius 4. ! " # $ % ! & ’ # ( $ ) ( ’’& * + ,- ./0 - 1/2 3 - 4 5 / 4 - 6 2 7 8 2 7 2 9 - : ; (a) !" # $ % $ & " ’ ( " ’ " ) * + , - . / 0 1 2 3 2 4 5 6 7 8 9 : : : : : : : ; < 4 5 6 7 8 9 : : : ; : : : ; < 4 5 6 7 8 9 : ; : : ; : : ; < (b) !" # !$ % & ’ ! & ( $ ) * $ ) $ + , - . / 0 1 2 3 4 5 4 6 7 8 9 : ; < < < < < < < = > 6 7 8 9 : ; < < < = < < < = > 6 7 8 9 : ; < = < < = < < = > (c) Figure 5.9: Performance results comparing a 64-node two-dimensional torus network, a cRing network and a torus network that uses West-Last, East-Last routing, under (a) random uniform, (b) perfect shuffle and (c) local traffic with radius 4. 106 Chapter 6 Dynamic Reconfiguration of Cubic Ring Networks A Polymorphic Cubic Ring network (pcRing) is comprised of ak-aryn-cube torus phys- ical topology. Logically, the pcRing network can operate as k-ary n-cube torus, or as one of the configurations of the Cubic Ring network (cRing) [ZDP10]. This polymor- phic topology provides a simple yet flexible infra-structure for on-demand bandwidth provisioning in on-chip networks. In this chapter, we will describe a methodology to dynamically morph a pcRing network from torus topology and a cRing configurations, from one cRing configuration to a different cRing configuration and from a cRing con- figuration to back to a torus. The router architecture and network management mecha- nism necessary to do so will be explored, and the power-savings quantified. A cRing network is obtained from a torus network by hierarchically removing (i.e., turning off ) selected rings in all but one dimension. The resulting logical topology is a hierarchical arrangement of torus rings, where each node is connected to at least one ring at the lowest level of the hierarchy, and each ring is connected to at least one ring in the next higher level, if there exists one. Stated formally, a torus network is a graph I T (N T ;C T ), where vertices of the graphN T represent the nodes (routers) in the network, and the edges of the graphC T represent the communication channels (links) connecting the nodes. A cRing networkI C (N C ;C C ), derived fromI T , hasN C =N T andC C ½C T . Each physical channelc i 2C T is shared byv virtual channels. 107 In this chapter, we restrict the discussion to only 2D torus and cRing networks, since they offer the simplest and most natural embedding on a 2D die. In the proceeding sections network nodes that are connected in the Y-dimension are referred to as “global ring nodes” and the nodes that connected only in X-dimension are referred to as “local ring node.” The rest of the chapter is organized as follows. The two routing functions used by the pcRing network are described in Section 6.1. The dynamic reconfiguration protocol that enables polymorphism in pcRing networks is presented in Section 6.2, along with the proof of its deadlock-freedom in Section 6.3. Finally, a 5-port pcRing router was designed and synthesized and the results are presented in Section 6.4. 6.1 Routing Function Physically, a pcRing network is comprised ofk-aryn-cube torus topology. But, logically (i.e., from the routing standpoint), the network operates in two distinct modes: the torus mode in which all channels (network links) are on, and the cRing mode in which the on channels form one of the configurations of the cRing topology. The routing function in effect in the torus mode is referred to as R torus and is defined over the set of channels C T : R torus = C T £P ! C T , where P is the set of end nodes (processing elements) connected to the network nodes (routers). The routing function in effect in the cRing mode is referred to as R cRing and is defined over the set of channels C C : R cRing = C C £P !C C . BothR torus andR cRing use two sets of virtual channels, namely virtual channel 0 (VC0) and virtual channel 1 (VC1), associated with their respected sets of channels. In case of torus channels, these sets of virtual channels are denoted as C 0 T for the set of VC0 associated with C T , and C 1 T for the set of VC1 associated with C T . Similarly, in case of cRing channels, VC0 channels are denoted byC 0 C and VC1 byC 1 C . 108 6.1.1 Torus Routing: R torus In the torus mode, message are routed using a hybrid routing function that combines deterministic e-cube routing subfunction on the escape virtual channels (C 1 T ) with fully- adaptive minimal routing on adaptive virtual channel (C 0 T ). Messages are injected into the adaptive virtual channels and, whenever both virtual channels are available, pref- erence is given to the adaptive route. Messages can switch from the adaptive virtual channel to the escape virtual channel at any time, but not the other way around. That is, once a message uses a channel from the set C 1 T it must continue using resources from the set C 1 T until it sinks at the destination. R torus routing function is similar to the one proposed for the bubble adaptive router in [PBG + 99]. 6.1.2 cRing Routing: R cRing In the cRing mode, messages are routed using the cRing routing function, R cRing , as discussed in Chapter 5. Messages are created with their direction set to up. The routing unit at each router first determines the correct direction of the message (represented by the Get-Direction(m) procedure in Figure 5.3). If the direction is down, the message is routed using e-cube routing, traversing dimensions in decreasing order. If the direction is up, the message is routed toward the nearest node that connects in a higher dimension. At each node in the network, the output channel that takes a message toward the node that connects in a higher dimension is fixed and independent of the message destination. Therefore, up messages do not undergo routing like the down messages do; the output channel for all up messages is known a priori. Messages going up use VC0 whereas messages going in the down direction use VC1. As a result, the effective routing function for messages using VC1 is always e-cube regardless of the mode in which the pcRing network is operating. 109 6.2 Dynamic Reconfiguration of Cubic Ring Network In reconfiguring a network from one routing function to another, there are two possi- ble reconfiguration approaches: static reconfiguration and dynamic reconfiguration. In the static approach, the network is drained of messages before its routing function is updated. This is accomplished by stopping injection and letting all the messages in the network either sink at their destinations or be dropped at their current location (and later retransmitted). If performed during the execution of an application, static reconfigura- tion can have a significant affect on application performance and can add new complex- ity to networks with hard link-level flow control which do not require retransmission for any other purpose. Examples of static reconfiguration approaches include [RS91] and [TBG + 97]. In the dynamic reconfiguration approach, message injection is not halted and no routable messages are dropped from the network before, during or after the reconfig- uration. As a result, dynamic reconfiguration does not significantly affect application performance. However, a network undergoing dynamic reconfiguration is susceptible to reconfiguration-induced deadlocks, even if both the old and the new routing function are deadlock-free. Reconfiguration-induced deadlocks can occur when messages routed under the influences of the old routing function form a cyclic dependency with messages routed under the influence of the new routing function. To avoid reconfiguration-induced deadlock, we propose a dynamic reconfiguration protocol for polymorphic cRing networks that exploits the similarities between the two routing functions, R torus and R cRing , to provide a deadlock-free path for existing and newly injected messages throughout reconfiguration process. This path serves as an escape path, allowing blocked messages to escape potential deadlock sceneries and sink at their destination. 110 6.2.1 Definitions Dynamic reconfiguration of a routing function can be defined as change from the old routing function to the new routing function in a sequences steps that must be com- pleted and conditions that must be fulfilled [PPD03]. Each step contains a change in the routing function of the network, that is assumed to be performed atomically at each router. These steps are sequentialized at each router through the conditions which must be satisfied before each router can proceed to the next step. The reconfiguration steps are executed asynchronously across the network, unless a condition imposes the require- ment of network-wide synchronization before the next step. 6.2.2 R torus toR cRing Reconfiguration Cond 1: A reconfiguration notification is received. Step 1: The routing function is changed from R torus to R torus 0. Under R torus 0 adaptive routing of R torus is turned off and all messages use the deterministic e-cube routing function. R torus 0 is active on all C T channels and only supplies VC1; i.e. R torus 0 £ C T ! C 1 T . Messages in VC0 that have already undergone routing com- plete their traversal through the router, and are subjected toR torus 0 at the next hop. This results in a speedy drainage of VC0 along allC T channels. Cond 2: VC0 has been drained of all messages across the entire network. Step 2: The routing function is changed from R torus 0 to R cRing 0 at each router in the network. For messages queued at ports supplied by R cRing (i.e., ports that are on in the cRing configuration),R cRing 0 supplies the same outputs asR cRing . For messages queued at ports not supplied byR cRing ,R cRing 0 supplies the outputs that were supplied by R torus 0. As a result, messages injected after this routing function update only use 111 channels that are available in the cRing mode, while existing messages queued at chan- nels that were available in the torus mode continue to make progress toward their des- tination. Newly injected messages use VC0 for the up segment of their route, and VC1 for the down segment, as supplied by R cRing , whereas the existing messages only use VC1 as supplied byR torus 0. Cond 3: Channels not supplied byR cRing are drained of all messages. Step 3: The routing function is changed fromR cRing 0 toR cRing . Once Step 3 is complete across the network, physical channels not supplied by R cRing are turned off, along with the respective input and output ports connected to them. The reconfiguration is complete and the network now operates in the cRing mode. 6.2.3 R cRing toR torus Reconfiguration Before the routing function is changed fromR cRing toR torus , all physical channels not supplied byR cRing are turned on. Cond 1: A reconfiguration notification is received. Step 1: The routing function is changed from R cRing to R torus across the network. Messages traversing the up segment of their route can use VC0 for fully adaptive routing, or switch to VC1 for e-cube routing, as supplied by R torus . Messages traveling in the down direction continue to use e-cube routing on VC1 until they reach their destination. The reconfiguration is complete and the network now operates in the torus mode. 6.2.4 R cRing1 toR cRing2 Reconfiguration Reconfiguring the network from one cRing configuration to another involves either turn- ing on links non-local or turning them off. To illustrate both cases, we assume that 112 R cRing1 has fewer global rings on than R cRing2 . Therefore, R cRing1 to R cRing2 recon- figuration requires turning on one or more global rings. Before the routing function is changed fromR cRing1 toR cRing2 , channels not supplied byR cRing1 that are available in R cRing2 are turned on. Cond 1: A reconfiguration notification is received. Step 1: The routing function is changed fromR cRing1 toR cRing2 across the network. Both up and down messages can make use of the newly enabled channel(s). The reconfiguration is complete and the network now operates in the cRing2 mode. 6.2.5 R cRing2 toR cRing1 Reconfiguration Reconfiguring the network from a more densely connected cRing (cRing2) to a sparsely connected cRing, cRing1 involves turning off one or more non-local rings. However, before these links can be turned off, they have to be drained of all packets to ensure that no packet is discarded. Cond 1: A reconfiguration notification is received. Step 1: The routing function is changed fromR cRing2 toR cRing2 0 across the network. UnderR cRing2 0, messages queued at channels supplied byR cRing1 using the new routing function, where as message queued at channels not connected inR cRing1 continue to use the old routing function until they reach their destination. Cond 2: Channels not supplied byR cRing1 are drained of all messages. Step 2: The routing function is changed fromR cRing2 0 toR cRing1 . Once Step 2 is complete across the network, physical channels not supplied by R cRing1 are turned off, along with the respective input and output ports connected to them. The reconfiguration is complete and the network now operates in the cRing1 mode. 113 6.3 Proof of Deadlock Freedom R torus andR cRing have been shown to be deadlock-free. In [PBG + 99] it was shown that R torus is deadlock-free when implemented with bubble flow control and in [ZDP10] cRing routing function (R cRing ) was proven to be deadlock-free. In this section, we prove that the reconfiguration protocols forR torus toR cRing reconfiguration andR cRing to R torus reconfiguration are also deadlock-free. To do so, we use the framework for deadlock-free dynamic reconfiguration outlined in [LPD05]. 6.3.1 Sufficient Conditions for Deadlock-free Reconfiguration In [LPD05], four sufficient conditions for deadlock-free dynamic reconfiguration are defined. These conditions are defined based on the view of a reconfiguration process as a sequences of “adding” phases in which one or more new routing choices are added to the existing routing function, and “removing” phases in which one or more routing choices are removed from the existing routing function. The sufficient conditions are: 1. The prevailing routing function is connected and deadlock-free at all times. 2. During an Adding Phase, no routing choice may be added to any switch that closes a cycle of dependencies on escape resources. 3. During a Removing Phase, no routing choice may be removed from any switch before it is known that there will be no packet needing this routing choice – either to proceed or to enter escape resources. 4. All potential ghost dependencies are removed from the network before a transition from a Removing Phase to an Adding Phase. 114 6.3.2 Deadlock Freedom ofR torus toR cRing Reconfiguration Protocol for R torus to R cRing reconfiguration defined in Section 6.3.2 involves one Adding Phase and two Removing Phases. In this section, we show that the sufficient conditions are met at each of the steps duringR torus toR cRing reconfiguration. Lemma 1: R torus 0 routing function is connected and deadlock-free. Proof: As described in Section ,R torus 0 provides e-cube routing using VC1 on phys- ical channelsC T . e-cube routes messages in decreasing order of dimensions such that in each dimension i, the message is routed in that dimension until it reaches a node whose address matches the destination address in the i-th position. Since allC T channels are on whenR torus 0 is the only routing function in effect, the network is connected. e-cube has been shown to be deadlock-free if cycles within the torus rings can be prevented [DS87]. InR torus 0 these cycles are prevented through the use of bubble flow control which does not allow messages that are trying to enter a ring to block indefinitely the progress of messages already in the ring (Lemma 1, Section ). Hence, R torus 0 is connected and deadlock-free. Lemma 2: R cring 0 routing function is connected and deadlock-free. Proof: R cring 0 has two routing sub-functions: R torus 0 for messages in C T ¡ C C channels andR cring for messages in the remaining on channels (i.e.,C C ). Messages in the C T ¡C C channels can enter C C channels but messages in C C cannot enter C T ¡ C C resources. Therefore, an acyclic dependency exists between the two routing sub- functions. Individually, both routing sub-functions have been shown to be connected and deadlock-free (6.3.2 and Lemma 1). Lemma 3: R torus toR torus 0 reconfiguration is deadlock-free. Proof: The change in routing function from R torus to R torus 0 is a Removing Phase because, as a result of this change, routing choices are restricted. Messages can no 115 longer use VC0 so the dependency from VC0 to VC1 is removed. VC1 continues to provide a connected and deadlock-free escape path to all messages. And every message undergoing routing can access VC1. Therefore, this step satisfies the (third) sufficient condition. Lemma 4: R torus 0 toR cRing 0 reconfiguration is deadlock-free. Proof: R torus 0 toR cRing 0 reconfiguration is an Adding phase asR cRing 0 allows paths that were not allowed under R torus 0, up routes of the cRing topology. The old routing function in this case, R torus 0, is only defined on one set of virtual channels, the escape virtual channels (VC1). After the reconfiguration messages routed on the escape virtual channel continue use e-cube routing, according to the definition of R cRing 0. Therefore, VC1 remains deadlock-free. Draining VC0 before this reconfiguration step (Cond. 2) ensure that no ghost dependencies exist between packets using the old routing function in VC0 and those using the new one. Therefore, R torus 0 to R cRing 0 reconfiguration is deadlock-free. Theorem:R cRing 0 toR cRing reconfiguration is deadlock-free. Proof:R cRing 0 toR cRing reconfiguration is a removing phase because paths for pack- ets using rings in high dimension that are off in cRing are removed from active routing. However, before the routing choice is eliminated, these rings are checked for drainage of all existing packets. Therefore,R cRing 0 toR cRing reconfiguration is deadlock-free. 6.3.3 Deadlock Freedom ofR cRing toR Torus Reconfiguration Theorem:R cRing toR torus reconfiguration is deadlock-free. Proof: R cRing toR torus reconfiguration adds new paths to the routing choice. These paths can be used by the messages in the network either by adaptive routing on VC0 116 (susceptible to deadlocks) or through e-cube routing on VC1, which remains deadlock- free under all conditions. It is possible that messages in VC0 form a cycle of dependency but at least one of the messages involved in the cycle will have the option of entering the escape virtual channel, i.e. VC1. Therefore, using Duato’s condition for deadlock-free adaptive routing [DP01], the network will remain deadlock-free. 6.4 Implementation And Evaluation In this section, we quantify the savings in network power that can be realized from turning off rings in a cRing topology. Turning off global rings lowers static power in two ways: turned off links save leakage power in link drivers, and input and output ports associated with these turned off links can also be power-gated to reduce static power in routers. Since all links are symmetric, link power savings is simply proportional to the percentage of links turned off, so little further investigation is necessary. To quantify static power savings in routers, we implemented a polymorphic router in Verilog HDL and synthesized it using Synopsys Design Compiler, targeting TSMC 90nm technology. The router has 2 virtual channels, flit size of 64-bit and 8-flit input buffers at each port. In the 3-port configuration, input and output ports associated with the off global rings are power-gated. Parameter 5-Port Config 3-Port Config Diff Cell Area (¹m 2 ) 183,000 105,344 -42.4% Static Power (mW ) 54.39 33.37 -38.6% Table 6.1: Comparison of active area and static power of 5-port and 3-port routers in 90nm. 117 To quantify the upper limit on the potential power saving, we synthesized (non- reconfigurable) 5- and 3-port routers. Table 6.1 shows the area and power consumption of 5- and 3-port routers. 6.4.1 Power-gated Router Design To quantify the over-head of power-gating in terms of router area and leakage power, we synthesized a 5-port router design with power-gating using Synopsys Design Compiler. Table 6.2 compares the two designs. Parameter 5-Port Without Power-gating 5-Port With Power-gating Diff Cell Area (¹m 2 ) 4,666,209 4,802,262 +2.9% Static Power (¹W ) 121.4059 121.8879 +0.4% Table 6.2: Comparison of active area and static power of 5-port router without power- gating and with power-gating in 90nm. The power-gated design Aggregating the per-router power over the entire network highlights the power- performance trade-off afforded by cRing networks. A 4-ary, 2-cube cRing network with two global rings has 19.3% lower static power consumption than the equivalent torus network. This saving, however, comes at the expense of about 9% higher average distance. Increased average distance indicates not only higher average latency but, more importantly in this case, increased dynamic power. As the size of the network increases, however, the trade-off looks very different and more favorable for the cRing network. For example, for an 8-ary, 2-cube cRing network, a 19.3% reduction in static power can be achieved with 3 global rings and only a modest 2.75% increase in the average distance. 118 Chapter 7 Conclusions Energy efficiency and resilience are emerging as the two most challenging problems in high performance computing systems. Both of these challenges cut across all levels of system design and unless innovative solutions are brought to bear to address both of these challenges, future systems will not be able to provide the performance improve- ments and availability guarantees that will be required to tackle waxing computational problems. An important component of providing resilience against failure in very large sys- tems is the ability of the system to reconfigure itself when a component fails. Brute force methods like redundancy have to be used wisely as they incur significant energy cost. Networks must especially avoid adding more (power) cost to the system to pro- vide resilience through redundancy since most networks already have redundant paths between end nodes. This inherent redundancy can be exploited by reconfiguring the network in the event of failure in links or nodes. Our implementation of a dynamic reconfiguration scheme for InfiniBand networks demonstrates how a carefully designed reconfiguration scheme can improve resilience without requiring additional features in the network. Given that none of the current IBA-compliant devices allow for dynamic reconfiguration, our work provides useful, practical insights into improving the state-of- the-art. With regards energy efficiency, if current technology trends hold, it appears that energy for data transfer will dwarf the energy for computation in many applications of 119 practical significance in large-scale systems. This observation has implications across the system design and it adds further weight to the growing consensus that the power dissipation in the communication fabric, at every level of the system hierarchy, is a first-order design issue. One approach to managing the energy problem that has been proposed is do away with the design of a “balanced node” with fixed power budget for each element (FPU, L1/L2 cache, on-chip network, DRAM, etc.) of the node, in favor of an “adaptively balanced node” [Pet08]. In an adaptively balanced node, each level of the node hier- archy is designed to consume almost all of the node power by itself. To prevent over- subscription, total node power is monitored and elements of the hierarchy throttled (say, by inserting noop instructions) when the power consumption exceeds a threshold. Such a power-adaptive organization has the advantage of meeting performance targets across a wider range of applications without exceeding the system power budget. Power dissipation in on-chip interconnects will necessarily be a significant compo- nent of the overall picture. Therefore, it is important to consider on-chip network archi- tectures that enable adaptively balanced node design by operating at multiple power- performance points. More specifically, dynamically reconfigurable (or polymorphic) on-chip networks that are able to trade-off network bandwidth for network power, at low overhead, can offer a viable alternative to existing voltage and frequency scaling- based schemes. The Polymorphic Cubic Ring network that we propose in this work is a step in this direction. By designing the topology, routing algorithm and reconfiguration scheme with an eye toward power-bandwidth trade off, this work provides insights into a power reduction methodology that can work in conjunction with other power reduction schemes. 120 7.1 Future Work A natural extension of our Double Scheme implementation work is to move from sim- ulation to actual hardware or software implementation of the scheme. Open-source IBA subnet management tools and host channel adapter drivers are being developed. These tools can be used to implement the dynamic reconfiguration scheme proposed in this work on real hardware. Furthermore, in order to compare different approaches to improving system robustness, it may be important to develop clear cost, performance and robustness metrics that can make a fair comparison between various offline, static and dynamic approaches. This work presents an analysis of the overheads and potential benefits of using dynamic reconfiguration in on-chip network for power reduction and fault tolerance. But dynamic reconfiguration does not have to be limited to the Cubic Ring networks. Future work to explore dynamic reconfiguration of other popular topologies, especially indirect topologies, can provide valuable insights into the power-bandwidth trade off offered by these topologies. Dynamic reconfiguration of on-chip interconnects is a promising new approach to solving the leakage power and fault tolerance problems using the same basic mecha- nism. But, a dynamic power reduction scheme can only be effective if triggered at the appropriate time. Previous work has looked at triggers for powering down links at a fine (cycle-by-cycle) granularity. Potential for saving power without significant per- formance penalty at such fine granularity is limited. What is needed, therefore, is a technique for coarse-grain traffic monitoring and prediction that can anticipate period of low bandwidth demand and notify the network to configure accordingly. Such pre- dictors can potentially be assisted by software because, as we noted in our analysis of application traffic, in at least some applications the period of heavy communication are 121 well-structured and visible at the algorithmic level. If such knowledge can be passed on to the network power management engine, significant saving in network power can be realized. 122 Bibliography [ABB + 07] Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi-Chang Miao, Carl Ramey, and David Wentzlaff. Tile Processor: Embedded Multicore for Networking and Multimedia. In Proceedings of the Symposium on High Performance Chips (HOT-CHIPS 19), August 2007. [AP07] Thomas William Ainsworth and Timothy Mark Pinkston. Characterizing the Cell EIB On-Chip Network. IEEE Micro, 5(27):6–14, September 2007. [ASW] Katie Antypas, John Shalf, and Harvey Wasserman. Carver: IBM iDataPlex at Lawrence Berkley National Laboratory. www.nersc.gov/nusers/systems/carver/about.php. [BCJ + 03] Aurelio Bermudez, Rafael Casado, Francisco J.Quiles, Timothy M. Pinkston, and Jos´ e Duato. Modeling InfiniBand with OPNET. In Pro- ceedings of the Workshop on Novel Uses of System Area Networks, Febru- ary 2003. [BCQ + 03] Aurelio Bermudez, Rafael Casado, Francisco J. Quiles, Timothy M. Pinkston, and Jos´ e Duato. Evaluation of a Subnet Management Mecha- nism for InfiniBand Networks. In Proceedings of the International Con- ference on Parallel Processing, October 2003. [BD06] James Balfour and William J. Dally. Design Tradeoffs for Tiled CMP On-chip Networks. In Proceedings of the 20th Annual International Con- ference on Supercomputing, pages 187–198, 2006. [BDK + 05] Shekhar Borkar, Pradeep Dubey, Kevin Kahn, David Kuck, Hans Mul- der, Stephen Pawlowski, and Justin Rattner. Platform 2015: Intel Pro- cessor and Platform Evolution for the Next Decade. Technical Report www.intel.com/go/platform2015, Intel Corporation, March 2005. [BG04] Doug Burger and James R. Goodman. Billion-Transistor Architectures: There and Back Again. IEEE Computer, 37(3):22–28, March 2004. 123 [BJ05] Davide Bertozzi and Antoine Jalabert. NoC Synthesis Flow for Cus- tomized Domain Specific Multiprocessor Systems-on-Chip. IEEE Trans- actions on Parallel and Distributed Systems, 16(2):113–129, 2005. [BM02] Luca Benini and Giovanni De Micheli. Networks on Chips: A New SoC Paradigm. IEEE Computer, 35(1):70 –78, January 2002. [Bor07] Shekhar Borkar. Thousand Core Chips: A Technology Perspective. In Proceedings of the 44th ACM/IEEE Design Automation Conference (DAC ’07), pages 746–749, June 2007. [CBGV] C. Carri´ on, R. Beivide, J. A. Gregorio, and F. Vallejo. Necessary and Suf- ficient Conditions for Deadlock-free Networks. Technical report, Depar- tamento de Electr´ onica y Computadores, Universidad de Cantabria. Avail- able at: www.atc.unican.es. [CBGV97] C. Carri´ on, R. Beivide, J. A. Gregorio, and F. Vallejo. A Flow Control Mechanism to Avoid Message Deadlock in k-ary n-cube Networks. In Proceedings of the Fourth International Conference on High-Performance Computing, 1997. [CBJ + 01] Rafael Casado, Aurelio Berm´ udez, Francisco J.Quiles, Jos´ e Sanchez, and Jos´ e Duato. A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks. IEEE Transactions on Parallel and Distributed Systesm, 12(2):115–132, February 2001. [CGB91] D.R. Cheriton, H.A. Goosen, and P.D. Boyle. ParaDiGM: A Highly Scalable Shared-Memory Multicomputer Architecture. IEEE Computer, 24(2):33–46, Feb 1991. [CK94] Andrew A. Chien and Magda Konstantinidou. Workload and Performance Metrics for Evaluating Parallel Interconnects. IEEE Computer Architec- ture Technical Committee Newsletter, Summer-Fall:23 – 27, 1994. [CKR95] L. Cherkasova, V . Kotov, and T. Rokicki. Fibre Channel Fabrics: Evalua- tion and Design. Feb. 1995. [CLKI06] Guangyu Chen, Feihui Li, Mahmut Kandemir, and Mary Jane Irwin. Reducing NoC Energy Consumption Through Compiler-Directed Channel V oltage Scaling. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 193–203, 2006. 124 [CP03] Xuning Chen and Li-Shiuan Peh. Leakage Power Modeling and Opti- mization in Interconnection Networks. In Proceedings of the 2003 Inter- national Symposium on Low-Power Electronics and Design(ISLPED ’03), pages 90–95. ACM, 2003. [cra93] Cray T3D System Architecture Overview. Technical report, Cray Research Inc., September 1993. [D. 07] D. Wentzlaff, et al. On-Chip Interconnection Architecture of the Tile Pro- cessor. Micro, IEEE, 27(5):15–31, Sep. 2007. [DNV01] D.Avresky, N.Natchev, and V .Shurbanov. Dynamic Reconfiguration in High-Speed Computer Clusters. In Proceedings of Third IEEE Interna- tional Conference on Cluster Computing (CLUSTER’01), page 380, 2001. [DP01] Jos´ e Duato and Timothy M. Pinkston. A General Theory for Deadlock- Free Adaptive Routing Using a Mixed Set of Resources. IEEE Transac- tions on Parallel and Distributed Systems, 12(12), December 2001. [DS87] William J. Dally and Charles L. Seitz. Deadlock-free Message Routing in Multiprocessor Interconnection Networks. IEEE Transactions on Comput- ers, 36(5):547–553, May 1987. [DSC + 07] J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Bra- ganza, S. Meyers, E. Fang, and R. Kumar. An Integrated Quad-Core Opteron Processor. In Proceedings of the International IEEE Solid-State Circuits Conference, Digest of Technical Papers, February 2007. [DT01] William J. Dally and Brian Towels. Route Packets, Not Wires: On-Chip Interconnection Networks. In Proceedings of the Design Automation Con- ference, June 2001. [DT04] William J. Dally and Brian Towels. Principles and Practices of Intercon- nection Networks. Morgan Kaufmann, 2004. [Dua93] Jos´ e Duato. A New Theory of Deadlock-free Adaptive Routing in Worm- hole Networks. IEEE Transactions on Parallel and Distributed Systems, 4(12):1320–1331, December 1993. [Dua95] Jos´ e Duato. A Necessary and Sufficient Condition for Deadlock-free Adaptive Routing in Wormhole Networks. IEEE Transactions on Parallel and Distributed Systems, 6(10):1055–1067, October 1995. [DYN02] Jos´ e Duato, Sudhakar Yalamanchili, and Lionel M. Ni. Interconnection Networks: An Engineering Approach. Morgan Kaufmann, 2002. 125 [FVS95] Keith Farkas, Zvonko Vranesic, and Michael Stumm. Scalable Cache Consistency for Hierarchically Structured Multiprocessors. The Journal of Supercomputing, 8(4):345–369, 1995. [GG00] Pierre Guerrier and Alain Greiner. A Generic Architecture for On-Chip Packet Switches Interconnections. In Proceedings of the Design Automa- tion and Testing in Europe Conference, March 2000. [GKM + 06] Paul Gratz, Changkyu Kim, Robert McDonald, Stephen Keckler, and Doug Burger. Implementation and Evaluation of On-Chip Network Archi- tectures. In Proceedings of the International Conference on Computer Design, October 2006. [Han03] Hang-Sheng Wang and Li-Shiuan Peh and Sharad Malik. Power-driven Design of Router Microarchitectures in On-chip Networks. In Proceedings of the International Symposium on Microarchitecture, December 2003. [HGK09] Kyle C. Hale, Boris Grot, and Stephen W. Keckler. Segment Gating for Static Energy Reduction in Networks-On-Chip. In Proceedings of the 2nd International Workshop on Network on Chip Architectures (NoCArc’09), December 2009. [HHS + 00] Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael Chen, and Kunle Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71–84, 2000. [HJ94] V . Carl Hamacher and Hong Jiang. Comparison of Mesh and Hierarchical Networks for Multiprocessors. In Proceedings of the 1994 International Conference on Parallel Processing, pages 67–71, 1994. [HJ01] V . Carl Hamacher and Hong Jiang. Hierarchical Ring Network Config- uration and Performance Modeling. IEEE Transaction on Computers, 50(1):1–12, 2001. [HMH01] Ron Ho, K. W. Mai, and Mark A. Horowitz. The Future of Wires. Pro- ceedings of the IEEE, 89(4):490 –504, April. 2001. [Hor96] R. Horst. ServerNet Deadlock Avoidance and Fractahedral Topologies. In Proceedings of the International Parallel Processing Symposium, pages 275–280. IEEE Computer Society, April 1996. [HP03] Wai Hong Ho and Timothy Mark Pinkston. A Methodology for Designing Efficient On-Chip Interconnects on Well-Behaved Communication Pat- terns. In Proceedings of the 9th International Symposium on High Per- formance Computer Architecture, February 2003. 126 [HS94] M. Holliday and M. Stumm. Performance Evaluation of Hierarchical Ring- Based Shared Memory Multiprocessors. IEEE Transaction on Computers, 43(1):52–67, 1994. [Inf00] InfiniBand TM Architecture Specification Volume 1, Release 1.0. Infini- Band Trande Association, October 24, 2000. [int94] Intel Paragon Supercomputers: The Scalable High Performance Comput- ers. Technical report, Intel SSD, November 1994. [Itr] The International Technology Roadmap for Semiconductors (ITRS). [J. 05] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer and D. Shippy. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4):589–604, 2005. [JMBM04] Antoine Jalabert, Srinivasan Murali, Luca Benini, and Giovanni De Micheli. XpipesCompiler: A Tool for Instantiating Application Specific Networks on Chip. In Proceedings of the Design Automation and Test in Europe Conference and Exhibition, 2004. [JZH07] D.N. Jayasimha, Bilal Zafar, and Yatin Hostoke. On-die Interconnec- tion Networks: Why They Are Different and How to Compare Them? Technical report, Microprocessor Technology Lab, Corportate Technology Group, Intel Corporation, 2007. [Kar06] Karthikeyan Sankaralingam, et al. Distributed Microarchitectural Proto- cols in the TRIPS Prototype Processor. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39), December 2006. [KJM + 02] Shashi Kumar, Axel Jantsch, Mikael Millberg, Johny Oberg, Juha-Pekka Soininen, Martti Forsell, Kari Tiensyrja, and Ahmed Hemani. A Net- work on Chip Architecture and Design Methodology. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI. IEEE Computer Society, 2002. [KTMW03] Jason Sungtae Kim, Michael Bedford Taylor, Jason Miller, and David Wentzlaff. Energy Characterization of a Tiled Architecture Processor with On-chip Networks. In Proceedings of the 2003 International Symposium on Low-Power Electronics and Design, pages 424–427. ACM, 2003. [KZT05] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 408–419. IEEE Computer Society, 2005. 127 [LAS97] Kathy J. Liszka, John K. Antonio, and Howard Jay Siegel. Problems with Comparing Interconnection Networks: Is an Alligator Better Than an Armadillo? IEEE Concurrency, 5(4):18–28, 1997. [LFD01] Pedro Lopez, Jos´ e Flich, and Jos´ e Duato. Deadlock-free Routing in Infini- Band through Destination Renaming. In Proceedings of the International Conference on Parallel Processing, pages 427–434, Sep. 2001. [LPD05] Olav Lysne, Timothy Mark Pinkston, and Jos´ e Duato. Part II: A Method- ology for Developing Deadlock-Free Dynamic Network Reconfigura- tion Processes. IEEE Transactions on Parallel and Distributed Systems, 16(5):428–443, 2005. [Man07] Mani Azimi, Naveen Cherukuri, D. N. Jayasimha, Akhilesh Kumar, Partha Kundu, Seungjoon Park, Ioannis Schoinas, Aniruddha S. Vaidya. Inte- gration Challenges and Tradeoffs for Tera-scale Architectures. Intel Techonolgy Journal, 11(3):173–182, 2007. [Mar04] Marina Alonso, Juan Miguel Martinez, Vicente Santonja and Pedro Lopez. Reducing Power Consumption in Interconnection Networks by Dynami- cally Adjusting Link Width. In Proceedings of 2004 EuroPar Conference. Springer-Verlag, 2004. [MBL + 01] Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David Webb. The Alpha 21364 Network Architecture. In Proceedings of the Symposium on High Performance Interconnects. IEEE Computer Society Press, August 2001. [MSSD08] H´ ector Montaner, Federico Silla, Vicente Santonja, and Jos´ e Duato. Net- work Reconfiguration Suitability for Scientific Applications. In ICPP ’08: Proceedings of the 2008 37th International Conference on Parallel Pro- cessing, pages 312–319. IEEE Computer Society, 2008. [NDR + 95] N.J.Boden, D.Cohen, R.E.Felderman, A.E.Dulawik, C.L.Seitz, J.Seizovic, and W.Su. Myrinet-A Gigabit Per Second Local Area Network. IEEE Micro, pages 29–36, February 1995. [NER08] NERSC-6 Workload Analysis and Benchmark Selection Process. Techni- cal Report LBNL-1014E, National Energy Research Scientific Computing Center Division, Lawrence Berkeley National Laboratory. Available at: www.nersc.gov/projects/SDSA/reports/uploaded/NERSCWorkload.pdf, August 2008. 128 [NPK + 06] Chrysostomos A. Nicopoulos, Dongkook Park, Jongman Kim, N. Vijaykr- ishnan, Mazin S. Yousif, and Chita R. Das. ViChaR: A Dynamic Vir- tual Channel Regulator for Network-on-Chip Routers. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitec- ture(MICRO 39), pages 333–346. IEEE Computer Society, 2006. [opn] OPNET Technologies. http://www.opnet.com. [PBG + 99] Valentin Puente, Ram´ on Beivide, Jose Gregorio, J. M. Prellezo, Jos´ e Duato, and Cruz Izu. Adaptive Bubble Router: A Design to Improve Per- formance in Torus Networks. In Proceedings of the International Confer- ence on Parallel Processing, 1999. [PCSV03] A. Pinto, L.P. Carloni, and A.L. Sangiovanni-Vincentelli. Efficient Synthe- sis of Networks on Chip. In 21st International Conference on Computer Design, pages 146–150, Oct. 2003. [PD06] Timothy M. Pinkston and Jos´ e Duato. Appendix E of Computer Architec- ture: A Quantitative Approach. Elsevier Publishers, 4th edition, 2006. [Pet08] Peter Kogge et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. www.er.doe.gov/ascr/Research/CS/DARPA exascale - hardware(2008).pdf, September 2008. [Pfi00] Gregory F. Pfister. An Introduction to the InfiniBand Architecture. In Proceedings of the Cluster Computing Conference (Cluster00), November 2000. [PGB02] V . Puente, J. Gregorio, and R. Beivide. SICOSYS: An Integrated Frame- work for Studying Interconnection Network Performance in Multiproces- sor Systems. In Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing, 2002. [Poo05] Poonacha Kongetira, Kathirgamar Aingaran and Kunle Olukotun. Nia- gara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005. [PPD00] Ruoming Pang, Timothy Mark Pinkston, and Jos´ e Duato. The Double Scheme: Deadlock-free Dynamic Reconfiguration of Cut-Through Net- works. In The 2000 International Conference on Parallel Processing, pages 439–448. IEEE Computer Society, August 2000. [PPD03] Timothy Mark Pinkston, Ruoming Pang, and Jos´ e Duato. Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability. IEEE Transactions on Parallel and Distributed Systems, 14(8):780–794, 2003. 129 [PV81] Franco P. Preparata and Jean Vuillemin. The Cube-Connected Cycles: A Versatile Network for Parallel Computation. Communications of the ACM, 24(5):300–309, 1981. [PZD03] Timothy M. Pinkston, Bilal Zafar, and Jos´ e Duato. A Method for Applying Double Scheme Dynamic Reconfiguration over InfiniBand. In Proceed- ings of the International Conference on Parallel and Distributed Process- ing Techniques and Applications, June 2003. [Rat09] Justin Rattner. The Dawn of Terascale Computing. IEEE Solid-State Cir- cuits Magazine, 1(1):83 –89, Winter 2009. [RS91] T. L. Rodeheffer and M. D. Schroeder. Automatic Reconfiuration in Autonet. Technial Report 77, SRC Research, September 1991. [RS97] Govindan Ravindran and Michael Stumm. A Performance Comparison of Hierarchical Ring- and Mesh- Connected Multiprocessor Networks. In Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, 1997. [SBB + 91] M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray, R. M. Needham, T. L. Rodeheffer, E. H. Satterthwaite, and C. P. Thacker. Autonet: A High- Speed, Self-Configuring Local Area Network Unsing Point-to-Point Links. Technical Report 8, October 1991. [SCG + 08] Blaine Stackhouse, Brian Cherkauer, Mike Gowan, Paul Gronowski, and Chris Lyles. A 65nm 2-Billion-Transistor Quad-Core Itanium Processor. In Proceedings of the International IEEE Solid-State Circuits Conference, Digest of Technical Papers, February 2008. [Sci] Scientific Discovery Through Advanced Computing. www.scidac.gov/. [SK04] Dongkun Shin and Jihong Kim. Power-Aware Communication Opti- mization for Networks-on-Chips with V oltage Scalable Links. In Pro- ceedings of the 2nd IEEE/ACM/IFIP International Conference on Hard- ware/Software Codesign and System Synthesis, pages 170–175, 2004. [SKT + 05] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. Power5 System Microarchitecture. IBM Journal of Research and Devel- opment, 49:505–521, July 2005. [SP03] Vassos Soteriou and Li-Shiuan Peh. Dynamic Power Management for Power Optimization of Interconnection Networks Using On/Off Links. In Proceedings of the 11th Symposium on High Performance Interconnects, August 2003. 130 [SP04] Vassos Soteriou and Li-Shiuan Peh. Design-Space Exploration of Power- Aware On/Off Interconnection Networks. In Proceedings of the 22nd International Conference on Computer Design (ICCD), October 2004. [SP07] Vassos Soteriou and Li-Shiuan Peh. Exploring the Design Space of Self- Regulating Power-Aware On/Off Interconnection Networks. IEEE Trans- actions on Parallel and Distributed Systems, 18(3):393–408, March 2007. [SRD01] Jos´ e Carlos Sancho, Antonio Robles, and Jos´ e Duato. Effective Strategy to Compute Forwarding Tables for InfiniBand Networks. In Proceedings of the International Conference on Parallel Processing, pages 48–57. IEEE Computer Society Press, September 2001. [Sri07] Sriram Vangal, et al. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC 2007). Digest of Technical Papers., pages 98 –589, February 2007. [SSM + 01] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S.Malik, J.Rabaey, and A. Sangiovanni-Vincentelli. Addressing the System-on-Chip Intercon- nect Woes Through Communication-Based Design. In Proceedings of the Design Automation Conference, June 2001. [ST96] Steven L. Scott and Gregory M. Thorson. The Cray T3E Network: Adap- tive Routing in a High Performance 3D Torus. In Proceedings of Hot Interconnects IV, August 1996. [TBG + 97] D. Teodosiu, J. Baxter, K. Govil, J. Chapin, M. Rosenblum, and M. Horowitz. Hardware Fault Containment in Scalable Shared-Memory Multiprocessors. In Proceedings of the 24th International Symposium on Computer Architecture, pages 73–84. IEEE Computer Society Press, June 1997. [TLAA03] Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, and Anant Agarwal. Scalar Operand Networks: On-Chip Interconnect for ILP in Par- titioned Architectures. In Proceedings of the The Ninth International Sym- posium on High-Performance Computer Architecture (HPCA’03), page 341. IEEE Computer Society, 2003. [TLM + 04] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, V olker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. Evaluation of the Raw Micro- processor: An Exposed-Wire-Delay Architecture for ILP and Streams. SIGARCH Computer Architecture News, 32(2):2–13, 2004. 131 [War99] Sugath Warnakulasuriya. Characterization of Deadlocks in Interconnec- tion Networks. PhD thesis, University of Southern California, 1999. [Wil87] A. W. Wilson. Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors. In Proceedings of the 14th Annual International Sympo- sium on Computer Architecture, pages 244–252, 1987. [WOT + 95] S. C. Woo, M. Ohara, E. J. Torrie, J-P. Singh, and A. Gupta. The SPLASH- 2 Programs: Characterization and Methodology Considerations. In Pro- ceedings of the 22nd International Symposium on Computer Architecture, pages 24–36. IEEE Computer Society Press, June 1995. [ZD10a] Bilal Zafar and Jeff Draper. Cubic Ring: A Polymorphic NoC for Dynamic Power Reduction and Fault Tolerance. Poster, MuSyC-GSRC Joint Review, St. Claire Hotel, San Jose, CA, September 2010. [ZD10b] Bilal Zafar and Jeff Draper. Polymorphic Networks-On-Chip for Dynamic Power Management And Fault Tolerance. Doctoral Showcase, 23rd ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC‘10), November 2010. [ZDP10] Bilal Zafar, Jeff Draper, and Timothy Pinkston. Cubic Ring Networks: A Polymorphic Topology for Network On-Chip. In Proceedings of the 39th International Conference on Parallel Processing (ICPP ’10), 2010. [ZPBD03] Bilal Zafar, Timothy M. Pinkston, Aurelio Berm´ udez, and Jos´ e Duato. Deadlock-free dynamic reconfiguration over InfiniBand Networks. Inter- national Journal of Parallel, Emergent and Distributed Systems, 19(2 & 3):127 – 143, June 2003. [ZPD03] Bilal Zafar, Timothy M. Pinkston, and Jos´ e Duato. A Method for Apply- ing Double Scheme Dynamic Reconfiguration over InfiniBand. Technical Report CENG 02-03, University of Southern California, March 2003. [ZY95] Xiaodong Zhang and Yong Yan. Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Architectures. IEEE Trans- action on Parallel and Distributed Systems, 6(12):1316–1331, Dec. 1995. 132
Abstract (if available)
Abstract
Computing systems are increasingly becoming communication bound. From mega-watt servers in data centers to mobile and embedded systems, the cost, performance and reliability of the communication fabric has rapidly become a first-order design concern in digital systems at every scale. At the same time, traditional solutions to connecting resources are fast approaching their natural limits beyond which they will not be able to offer the requisite performance at acceptable area, power or dollars cost.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Accelerating scientific computing applications with reconfigurable hardware
PDF
High-performance linear algebra on reconfigurable computing systems
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Parallel simulation of chip-multiprocessor
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Communication mechanisms for processing-in-memory systems
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Floating-point unit design using Taylor-series expansion algorithms
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Synchronization and timing techniques based on statistical random sampling
Asset Metadata
Creator
Zafar, Bilal
(author)
Core Title
Dynamically reconfigurable off- and on-chip networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
05/02/2011
Defense Date
10/11/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer architecture,Infiniband,interconnection networks,network-on-chip,OAI-PMH Harvest,on-chip networks,power-aware networks,power-gating,reconfiguration,voltage island
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey T. (
committee chair
), Nakano, Aiichiro (
committee member
), Prasanna, Viktor K. (
committee member
)
Creator Email
bilal.zafar@gmail.com,bzafar@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3653
Unique identifier
UC1321714
Identifier
etd-Zafar-4292 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-433431 (legacy record id),usctheses-m3653 (legacy record id)
Legacy Identifier
etd-Zafar-4292.pdf
Dmrecord
433431
Document Type
Dissertation
Rights
Zafar, Bilal
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
computer architecture
Infiniband
interconnection networks
network-on-chip
on-chip networks
power-aware networks
power-gating
reconfiguration
voltage island