Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deadlock recovery-based router architectures for high performance networks
(USC Thesis Other)
Deadlock recovery-based router architectures for high performance networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. U M I films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send U M I a complete manuscript and there are missing pages, these w ill be noted. Also, if unauthorized copyright material had to be removed, a note w ill indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6* x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact U M I directly to order. ProQuest Information and Learning 300 North Zeeb Road. Ann Arbor, M l 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DEADLOCK RECOVERY-BASED ROUTER ARCHITECTURES FOR HIGH PERFORMANCE NETWORKS b y Yungho Choi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2001 Copyright 2001 Yungho Choi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3054723 Copyright 2001 by Choi, Yungho All rights reserved. ___ __® UMI UMI Microform 3054723 Copyright 2002 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School University Park LOS ANGELES, CALIFORNIA 90089^1695 This dissertation, w ritten b y Under the direction o f bJA.. Dissertation Committee, and approved b y a ll its members, has been presented to and accepted b y The Graduate School, in partial fulfillm ent o f requirem ents fo r the degree o f DOCTOR OF PHILOSOPHY Dean o f Graduate Studies D * t e August 7 . 2001 DISSERTA TION COMMITTEE Chairperson Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D edication This dissertation is dedicated to my lovely wife, Minjung, and son, Edward. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A cknow ledgm ents I thank Professor Timothy Pinkston for the vision, guidance, and generous support he has provided. W ithout his relentless effort to help me grow as a re searcher, the contributions of this work could not be successfully accomplished. I also thank Professor Monte Ung and Professor Alexander A. Sawchuk for their advice during the qualifying exam and Professor Jean-Luc Gaudiot and Pro fessor John Heidemann for their participation in the dissertation committee. Many thanks also go to SM ART Interconnect members Akash Bansal, Deepesh Chouhan, Joon-Ho Ha, V V ai Ho, Mongkol Raksapatcharawong, Sugath Warnaku- lasuriva and Anjan Venkatramani for their helpful suggestions and feedback. I am extremely lucky to have had the support and encouragement of many wonder ful friends and family members. Much gratitude is owed especially to my mom and dad and to my lovely wife, Minjung, for their patience, understanding, and loving support. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C ontents D edication ii Acknowledgm ents iii List O f Tables vii List Of Figures viii Abstract xi 1 Introduction 1 1.1 Research Problems and Is s u e s ............................................................. 4 1.2 Related W o rk .......................................................................................... 9 1.3 Research Contributions ...................................................................... 15 1.4 Dissertation Outline ............................................................................ IT 2 High Performance Interconnection Networks 18 2.1 Low Latency and High T h ro u g h p u t................................................... 20 2.1.1 Message Congestion C o n tro l................................................... 20 2.1.1.1 Routing A daptivity.................................................... 21 2.1.1.2 O th e rs........................................................................... 25 2.1.2 Message Routing Speed .......................................................... 28 2.1.2.1 C rossbar........................................................................ 29 2.1.2.2 Q u e u e ........................................................................... 32 2.1.2.3 Routing Module ........................................................ 35 2.1.2.4 A rb itra tio n ................................................................. 41 2.1.2.5 O th e rs........................................................................... 42 2.2 Reliability and Fault-tolerance............................................................. 44 2.3 Quality of Service................................................................................... 46 3 Evaluation o f D e-m ultiplexed Crossbar A rchitectures for Dead lock Recovery-based Routers 49 3.1 Routing L o c a lity ................................................................................... 50 3.2 Router Crossbar D esigns....................................................................... 55 3.2.1 Unified Design ......................................................................... 55 3.2.2 Decoupled D e sig n s................................................................... 57 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.2.1 Internal Blocking.............................................. 61 3.2.2.2 Internal Deadlock.............................................. 62 3.2.2.3 Multi-cycle D elay.............................................. 64 3.3 Performance of router d e sig n s............................................................ 66 3.3.1 Router Level Performance E valuation................................... 66 3.3.2 Network Level Performance E valuation....................... 72 3.3.2.1 Existence of Routing L o cality ........................ 73 3.3.2.2 Exploitation of Routing L o c a lity ................. 78 3.3.2.3 Performance comparison of router designs . . . . 85 3.4 Summary ............................................................................................... 93 4 Evaluation o f Q ueue Architectures for Deadlock Recovery-based Routers 95 4.1 Router Queue D e s ig n s ........................................................................ 96 4.1.1 Circular Q u e u e ......................................................................... 97 4.1.2 Dynamically Allocated Multi-Queue....................................... 99 4.1.3 Enhanced Dynamically Allocated M ulti-Queue....................... 102 4.2 Queue Im plem entation........................................................................... 108 4.3 Incorporation of Queue and C ro ssb ar.................................................. 117 4.4 Performance E v a lu a tio n ........................................................................ 122 4.4.1 Router S p e e d .................................................................................123 4.4.2 Network Performance E v alu atio n ..............................................126 4.4.2.1 Queue Performance............................................ 127 4.4.2.2 Architecture Complexity-aware Router Performance C om parison................................................................... 136 4.5 S u m m a ry .................................................................................................. 144 5 Im plem entation o f a True Fully Adaptive Deadlock Recovery- based R outer Architecture: W ARRP 147 5.1 Router Architecture...................................................................................148 5.1.1 Overview ....................................................................................... 148 5.1.2 Virtual Network M o d u le.............................................................. 150 5.1.3 Input M o d u le ................................................................................. 151 5.1.4 Routing and Arbitration M o d u le .............................................. 153 5.1.5 Flow Controller M o d u le .............................................................. 154 5.1.6 Deadlock M o d u le...........................................................................156 5.2 Recovery E ligibility...................................................................................157 5.3 Router Im plem entation............................................................................ 158 5.4 Perform ance................................................................................................161 5.5 S u m m a ry ...................................................................................................162 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6 Conclusions 172 6.1 Future W ork................................................................................................176 Reference List 179 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f Tables 3.1 Delay of Router Designs.......................................................................... 71 3.2 Summary of Results................................................................................. 92 4.1 Delay comparison of the various queue designs...................................... 115 4.2 Crossbar size according to queue designs and network dimensions. 120 4.3 Delay of router designs................................................................................ 124 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Figures 1.1 An example of multiprocessor systems................................................. 2 2.1 Virtual channel scheme............................................................................ 26 2.2 Generic Router Architecture................................................................... 29 2.3 Various crossbar designs.......................................................................... 31 2.4 Input queueing and output queueing............................................... 33 2.5 Two nodes. A and B. in a two dimensional torus network.......... 36 3.1 Examples of routing locality............................................................ 52 3.2 Internal router crossbar designs....................................................... 56 3.3 Internal Blocking and Deadlock....................................................... 61 3.4 Multi-cycle delays for packet P1.P2, and P3.................................. 65 3.5 Router delay components of the VVARRP router.......................... 68 3.6 Fragmentation of Clock Cycle Time................................................ 72 3.7 Routing locality under uniform traffic............................................. 75 3.8 Routing locality under non-uniform traffic..................................... 77 3.9 Subcrossbar load balance in the C-CB and E-CB router design. . 79 3.10 Subcrossbar load distribution in the Hierarchical router design. . . 81 3.11 Effect of load balance on performance under uniform traffic. . . . 84 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.12 Effect of new injection on load balance and performance of E-CB. 85 3.13 Latency and throughput comparisons under Uniform traffic. . . . 87 3.14 Latency and throughput comparisons under Bit Reversal traffic. . 89 3.15 Latency and throughput comparisons under Perfect Shuffle traffic. 91 4.1 Circular queue (CQ) and first-in-first-out queue (FIFO)................. 97 4.2 Dynamically allocated multi-queue (DAMQ)........................................ 100 4.3 Dynamically allocated multi-queue with recruit registers................... 103 4.4 Implementation of queue designs for fully adaptive routers . . . . 109 4.5 Port-directed crossbar.................................................................................119 4.6 Effect of Virtual Channels..................................................................... 128 4.7 Effect of queue size in a 2D torus network......................................... 130 4.8 Three dimensional torus (8x8x8) ....................................................... 133 4.9 Non-uniform traffic pattern.......................................................................135 4.10 Performance comparison under a 16 x 16 two dimensional torus. . 138 4.11 Performance comparisons under an three dimensional torus. . . . 143 5.1 Organization of the WARRP router........................................................164 5.2 Routing Locality in virtual network (YN)..............................................165 5.3 Virtual Network Module............................................................................ 166 5.4 Block Diagram of the Input Module........................................................167 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5 Header Flit Format......................................................................................167 5.6 External Flow Control Signals and their Timing Diagram.................. 168 5.7 Internal Flow Control Signals and their Timing Diagram....................169 5.8 Block Diagram of Deadlock Module......................................................... 170 5.9 The VLSI Layout of the WARRP II Router...........................................170 5.10 Pin-to-pin latency for the two router implementations......................... 171 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A bstract Multiprocessor systems have been developed to efficiently solve complex and large scientific problems. Generally, these systems have a critical component, i.e.. interconnection network, which significantly affects system performance by deter mining the communication capability of multiprocessor systems. In recent years, with the emergence of bandwidth-hungry applications and multi-GHz processors, the demand on high performance interconnection networks has been increased to meet rapidly growing communication needs of multiprocessor systems. To satisfy this demand, routing algorithms must fully utilize network resources while efficiently handling message deadlock which leads to the halting of an en tire system. There are largely two classes of routing algorithms according to the way deadlocks can be dealt: deadlock avoidance-based and deadlock recovery- based routing algorithms. Deadlock avoidance-based networks prevent deadlocks by enforcing routing restrictions, which hampers routing adaptivitv and, there fore, limits network performance. To overcome this problem, recently a number of deadlock recovery-based networks have been proposed, which maximize rout ing adaptivitv and, thus, significantly increase network performance. But, the increased routing adaptivitv could lead to slower and more complicated router architectures, degrading overall network performance. xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to minimize the architecture complexity of deadlock recovery-based routers and to maximize network performance, this dissertation optimizes dead lock recovery-based router architectures by proposing two router component de sign solutions, i.e., partitioned crossbar designs and enhanced dynamically allo cated multi-queue designs. These solutions significantly reduce the architecture complexity of deadlock recovery-based routers while fully benefiting from their capability, leading to optimal deadlock recovery-based router architectures. Through extensive evaluations of various router architectures, this disserta tion verifies that the true fully adaptive routing capability of deadlock recovery schemes can be efficiently implemented in routers and, hence, their superior net work performance can be realized. Finally, this work demonstrates the feasibility of some of the proposed router architectures by implementing the VVARRP router. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction To efficiently solve large and complex problems, many massively parallel pro cessing systems (MPPs) and networks of distributed heterogeneous computers (NOWs) have been developed in recent years. Generally, these systems re duce problem solving time by dividing a large problem into smaller subprob lems, mapping them to multiple processing elements, and executing them con currently. This problem-solving approach requires that information be exchanged and shared among subproblems, which is facilitated by interconnection networks as illustrated in Figure 1.1. An interconnection network consists of routers and links, where routers actively route and deliver messages between processing nodes and/or neighboring routers while links physically but passively connect routers and processing nodes. The communication capability of interconnection networks are, largely, determined by active network resources, i.e.. routers. 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Processing Element CACHE NODE l ) (N O PE j ) • • • (NODE 3} NETWORKl INTERFACE Interconnection Network Injection Channel Reception Channel NETWORK ROUTER Crossbar Module from other routi other routers Figure 1.1: An example of multiprocessor systems. With the recent emergence of multi-GHz CPUs and latency-sensitive, bandwidth- hungry applications (e.g., a variety of scientific simulations, multimedia servers, banking transactions, etc.), interconnection networks are becoming a bottleneck in both MPPs and NOWs. In order to accommodate these increasing network performance demands of MPPs and NOWs, significant advancements will have to be made in router architectures, i.e., the active components of interconnection networks. 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In general, the performance of networks is described by two well-known pa rameters: network latency and network throughput. Network latency is deter mined by the average time for message transfers between two arbitrary nodes in the network. Most MPP applications which require high computing power (e.g., weather simulation, nuclear detonation simulation, etc.) are very sensitive to network latency because a long network latency significantly increases commu nication overhead. Also, latency reduces CPU utilization by making computing nodes frequently stall, which increases the overall execution time. Therefore, low latency networks are essential for achieving high-performance MPP systems. Besides MPP applications, multimedia applications such as web, real-time au dio/video, etc.. also demand a low network latency since this reduces response time and increases the quality of service. Generally, network throughput is measured by the maximum number of mes sages that a network can deliver per unit time. High throughput networks are necessary for efficiently supporting systems that consist of a great number of nodes and communication-intensive applications which generate a great amount of traffic. The recent explosive growth of the Internet has caused a much greater demand for higher throughput networks. Hence, a number of new' networks such as ATLAS I [44], Avici system [20], and Cisco 12000 GSR [46] have been devel oped. Besides the Internet, large MPP systems (e.g.. Intel teraflops [9] and Intel 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Paragon [15]) which consist of hundreds or thousands of computing nodes require high throughput networks so as to better support the communication demands of a great number of computing nodes without suffering from network saturation. Clearly, low latency, and high network throughput are the key characteristics of such high performance networks. 1.1 Research Problem s and Issues Low latency and high throughput interconnection networks can be achieved by maximizing the utilization of network resources, e.g.. links and routers. One ap proach for increasing the utilization of network resources is through the use of virtual channels [16. 17, 80]. This technique assigns virtual channels to a phys ical link so that it can prevent physical link bandwidth from being wasted by a blocked message. This not only increases the utilization of the physical link band width, but also reduces the communication delay caused by head-of-line blocking. This results in low latency and high throughput. The virtual channel scheme has also been used to provide better quality of service since this scheme makes it easier to support various classes of messages, with their different communication demands. Despite its benefits, however, previous research has shown that an additional virtual channel increases router delay by 15% due to its complicated 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. virtual channel control logic and an elongated critical path of multiplexers [11]. Although somewhat resolving this problem by increasing router clock speed and network throughput, pipelined designs [62] cannot reduce the increased architec ture complexity incurred by virtual channels, resulting in high network latency, especially at low load rates. Moreover, since each virtual channel requires an additional input queue, it makes router design slower and more complicated due to the increased number of input queues which must be arbitrated. Another method for increasing network resource utilization is to improve rout ing adaptivitv [5, 23, 24]. Enhanced routing adaptivity provides more routing choices for messages. This increases the utilization of network bandwidth by dis tributing traffic evenly throughout the network. Unfortunately, it is not easy to improve routing adaptivity due to deadlock situations. Deadlock is the hold-and- wait situation in which packets cannot make progress due to cyclic dependencies on resources. To guarantee deadlock-freedom, by and large, two classes of routing algorithms were developed, deadlock avoidance and deadlock recovery. Deadlock avoidance-based algorithms [12, 19. 51. 57] prevent deadlock from occurring in networks by enforcing routing restrictions, e.g., dimension-order restrictions, turn Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. restrictions, etc. Although this guarantees deadlock-freedom, the routing restric tions imposed by these algorithms hamper routing adaptivity. The reduced rout ing adaptivity might aggravate the flow through congested areas in networks by making it difficult for messages to go around the congested spots. To maximize routing adaptivity while guaranteeing deadlock freedom, re cently deadlock recovery-based algorithms [5, 65, 53, 41] have been proposed. These algorithms eliminate the routing restrictions enforced by deadlock avoidance- based schemes, and hence, maximize the routing adaptivity. Instead of avoiding deadlock situations, deadlock recovery-based schemes detect and recover from deadlock situations. The viability of these schemes depends on the assumption that deadlock situations rarely occur. If this is not the case (i.e.. deadlock occurs frequently), the cost of deadlock recovery processing outweighs the performance benefit obtained by increased routing adaptivity, and the overall performance is degraded. Fortunately, research [69, 87] has demonstrated that deadlock is. in deed, rare in networks, especially with higher routing adaptivity, more virtual channels, and higher connection degree. Although deadlock recovery schemes have a great potential for performance enhancement, ensuring their maximized routing adaptivity significantly complicates the design of router components such as the crossbar, routing and arbitration logic, address decoder, etc. This could considerably reduce router speed and. thus, increase overall network latency. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Moreover, this problem is exacerbated when the virtual channel scheme is in corporated into deadlock recovery routing schemes. Consequently, to maximize network performance, deadlock recovery-based router architectures have to be optimized in a way that the architecture complexity incurred by the increased routing adaptivity is minimized while the performance potential of deadlock re covery routing schemes is maximized. To achieve high performance interconnection networks, this dissertation ad dresses two important issues. The first issue is concerned with how to efficiently minimize router architecture complexity incurred by virtual channels and fully adaptive routing schemes. This issue is addressed by proposing partitioned cross bar designs which can minimize the cost of fully adaptive routing schemes and virtual channels, resulting in fast and fully adaptive routers. The key idea be hind the partitioned crossbar structure is to exploit a dynamic routing behavior identified as routing locality. Sufficient routing locality enables the internal cross bar of a deadlock recovery-based router to be partitioned into smaller and faster units without sacrificing its true fully adaptive routing capability. This min imizes the delay suffered from implementing the increased routing adaptivity. thus making the common case fast. This research validates the sufficient exis tence of routing locality by extensively simulating deadlock recovery-based fully adaptive interconnection networks. The simulation results will provide insight as Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to how much the identified behaviors exist, how they can be exploited, and how their exploitation will affect network performance. In order to effectively apply the identified routing behaviors to router architecture optimization, this disser tation will present a variety of crossbar architectures for interconnection network routers. This proposed scheme will increase router speed without sacrificing true fully adaptive routing capability. For the second issue, this work addresses how to better support the unre stricted routing capability of deadlock recovery-based schemes in router archi tectures. A main performance capability of the unrestricted routing schemes lies in minimizing message blocking and the related queuing delay by effectively utilizing the given network resources. To minimize message blockage and, thus, to maximize the capability of the unrestricted routing schemes, this dissertation proposes two new dynamically allocated multi-queues (DAMQs): /dynamically Allocated Multi-Queue With Recruit registers (DAMQVVR) and Virtual Channel Dynamically Allocated Multi-Queue (VCDAMQ). DAMQVVR provides the capa bility to dynamically modify queue linked lists utilizing nominal resources, i.e., recruit registers. This capability enables queues to minimize the message block age caused by the statically-managed linked lists of generic DAMQs. Hence, this queue is expected to better support the unrestricted routing capability of deadlock recovery-based schemes. VCDAMQ dynamically allocates queue space 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to virtual channel networks. This provides the potential to minimize message blocking by efficiently accommodating unbalanced traffic loads among virtual channel networks which might exist in fully adaptive routing networks. There fore. VCDAMQ is also expected to well accommodate the fully adaptive routing capability, resulting in high network performance. Additionally, these queues, i.e., DAMQVVR and VCDAMQ, not only minimize message blocking but also well cooperate with partitioned crossbar structures. This can further reduce net work latency while maximizing network throughput. To validate the proposed queue designs, this dissertation characterizes these queues in-depth and evaluates their performance through extensive simulations. 1.2 R elated W ork Within the last few decades, a great number of router architectures for inter connection networks have been developed toward improving router speed. Such routers include Mosaic Chip [27], Intel iPSC [6], MIT J-Machine [21, 59], Stanford Dash [49], Cray T3D [77], Intel Paragon [15], Stanford FLASH [45], MIT Alewife [1], Cray T3E [78], SGI Origin 2000 [47], and Alpha 21364 [34]. Generally, these router architectures are based on deadlock avoidance routing algorithms which Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. enforce routing restrictions (e.g., dimension order, channel order, etc.) to pre vent deadlocks. These architectures increase router speed by exploiting routing restrictions of the underlying algorithm. Since deadlock avoidance schemes re strict messages from using all alternative routing choices, it is not necessary for the crossbar to provide direct access to all output channels for packets, which makes it possible for the internal crossbar to be partitioned into smaller, faster sub-crossbars. This decreases not only the delay of the crossbar but also the delay of other router components (e.g., header selection and routing arbitration logic), by limiting routing choices. For instance, in the Mosaic Chip [27j, static dimension ordered routing is enforced where packets are not allowed to route in the Y dimension before reach ing the X dimension of the destination. This makes it possible to partition the crossbar into smaller, faster units based on dimension. In the partially adap tive Planar Adaptive Router [12], only two dimensions at a time are available for routing to avoid deadlock (routing in all other dimensions is prohibited). By exploiting these routing restrictions, the crossbar can be partitioned into pla nar sub-crossbar units (as opposed to dimensional units as used in the Mosaic 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chip [27]) to improve speed. In the Hierarchical Adaptive Router [51]. the cross bar is partitioned into ordered virtual channel network subcrossbars. Here, dead lock is avoided by enforcing routing restrictions in the lowest virtual network, viz.. Duato’ s algorithm [23]. Although the network router architectures addressed above achieve consider able router speed improvement, their routing restrictions inherently reduce the bandwidth utilization of a given network, which degrades the overall network per formance. Recently, a couple of deadlock recovery algorithms [5. 41] have been proposed to solve this problem. These algorithms do not enforce any routing re strictions. Instead, they detect and recover deadlocks. Although this restriction- free routing capability may increase overall network performance by providing more routing choices, it increases router design complexity and reduces router speed. A motivation of this dissertation is to achieve the highest possible net work performance of deadlock recovery-based routers by optimizing the router designs. The design optimizations will aim at minimizing the additional delay caused by the increased adaptive routing capability without compromising the benefits of unrestricted routing. In addition to the efforts to improve router speed, a number of queue de signs for routers have been proposed to efficiently minimize message blocking and hence, to maximize the capability of the underlying routing schemes. One of 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. these queues is the first-in-first-out queue (FIFO) which is the most well-known and straight-forward design. This queue simply stores and sends out packets in reception order. Although its design simplicity can lead to high speed. FIFO, in general, suffers from a head-of-line blocking problem (HOL) especially when router architectures employ input queuing for scalability. HOL blocking is a situation such that, when a message in the head of a queue is blocked, the sub sequent packets are blocked although they can be routed through different idle output links. This HOL blocking has been shown to significantly reduce network throughput [39]. Another problem of FIFO is that an incoming packet wastes clock cycles to reach the head of the queue for its routing, even when the queue is empty. To solve this problem, a circular queue design (CQ) [7] was proposed. This queue design is basically a FIFO except that its head and tail pointers are moving around the queue cells such that incoming packets can be directly stored into the head of the queue without wasting clock cycles. Although reducing the unnecessary queuing delay of FIFO, the circular queue does not resolve the HOL problem. To overcome the HOL blocking problem of CQ and FIFO, many queue de signs have been proposed. Some of these queues are the statically allocated fully connected multi-queue (SAFC) [22, 31, 33] and the dynamically allocated multi queue (DAMQ) [85]. The key idea behind these queues is to maintain multiple 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. queues associated with each output link of a router such that a packet in the head of a queue will not block the packets heading to different directions. Although these queue designs eliminate the HOL problem, they still have a static linked- list management problem which can hamper fully adaptive routing capability. In these queues, once an incoming packet is assigned to one of the output-associated multi-queues by updating a corresponding linked list, the packet cannot be re assigned to other queues. This limits routing adaptivity by forcing the packet to be routed through the assigned queue and its associated output link even when that link is congested and others become free. Therefore, these queues might not effectively support fully adaptive routing capability and hence, degrade net work performance although the queues might be well suited for deterministic or restricted routing schemes. This problem can be solved by complicated and software-based scheduling with an embedded processor or a dedicated workstation just like general Internet routers do. However, the solution is too expensive to satisfy the communication demand of latency-sensitive MPP/NOW applications. Another queue design to resolve the HOL blocking problem is the statically allocated multi-queue (SAMQ) whose queue space is statically and evenly divided into multiple subqueues and each subqueue is associated with one of the virtual channels assigned to a physical link. Since SAMQ implements a set of virtual channels, it minimizes HOL blocking. However, the queue structure of SAMQ 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. limits network performance by providing only one reading port and thus, making subqueues share the reading port. To overcome this problem, some upgraded SAMQ designs [38. 56] add up to the number of reading ports so that each sub queue can have its own reading port. This enables multiple packet reads, which reduces queueing delay. Nevertheless, the upgraded SAMQ increases crossbar size in order to support the increased number of reading ports. This complicates router architectures and, thus, might make SAMQ fail to minimize the cost of fully adaptive routing capability. Moreover, regardless of the number of reading ports, SAMQ does not well adapt to dynamic network traffic due to its static queue allocation, which degrades overall network performance. As mentioned above, although many queue architectures have been developed to improve router performance, they have limitations in efficiently and effectively exploiting the unrestricted routing capability of deadlock recovery schemes. This provides another goal of this dissertation, i.e., optimization of queue architectures to better exploit unrestricted routing capability. For this goal, proposed and evaluated are two enhanced DAMQ deigns that are expected to minimize message blocking while maximizing router speed. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.3 Research C ontributions This work will present optimal deadlock recovery-based true fully adaptive router architectures for both MPPs and NOVVs which could provide low latency, high throughput and scalable networks. The proposed architectures will be designed to accommodate the network demands of distributed multi-processor systems with a higher connection degree, and increased routing adaptivity. These architectures will be realized by two router architecture optimization solutions, i.e., partitioned crossbar designs exploiting true fully adaptive routing behaviors which exist in networks and two enhanced DAMQs which better support unrestricted routing capability. Specifically, this work makes the following contributions: • This work explores and identifies true fully adaptive, deadlock recovery- based routing behaviors which are essential for optimizing true fully adap tive router architectures. The identified behaviors are exploited in order to achieve low latency and high throughput networks in a real application environment. • A novel framework for applying the identified behaviors to design opti mization are provided. The framework provides a concrete and straight forward procedure for router architecture design optimization. Moreover. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this framework is believed to have uses for the optimization of other system architectures. • Extensive evaluation of various router architectures which exploit the iden tified behaviors not only demonstrate the existence of the routing behaviors but also reveal the effect of the exploitation of these behaviors on network performance. • Two new queue architectures which enable an efficient implementation of fully adaptive deadlock recovery-based routers are proposed and explored. These architectures minimize message blocking and effectively adapt to dynamic network traffic while increasing router speed by effectively coop erating with partitioned crossbar designs. • A prototype of the YVARRP router which is a cost effective and high speed router architecture is implemented using FPGA and electronic-optical CMOS technology. This prototype demonstrates the feasibility of the pro posed router architectures. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.4 D issertation Outline The next chapter provides an overview on router architectures proposed to im prove network and router performance, which directly and/or indirectly relate to this research. Chapter 3 explores and identifies a true fully adaptive rout ing behavior which, if appropriately exploited, can minimize design complexity incurred by increased routing adaptivity. Additionally, this chapter presents a variety of crossbar architectures which exploit the identified behavior. Chapter 4 proposes and extensively evaluates two new DAMQ designs that have a potential to better exploit the unrestricted routing capability of deadlock recovery schemes. Chapter 5 presents an implementation of a deadlock recovery-based router archi tecture for regular topologies which employs the crossbar architecture proposed in Chapter 3. Finally, Chapter 6 concludes this research and points to future research directions. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 2 H igh Perform ance Interconnection Networks Over the last few decades, many efforts have been made to develop high per formance multiprocessor systems. In developing these systems, one of the most challenging design issues is to efficiently accommodate their communication de mands. Such demands, in general, include low latency and high throughput, reliability and fault tolerance, and quality of service (QoS). Among them, the first demand, i.e., low latency and high throughput, is the most fundamental re quirement of high performance networks, which enables fast communication while effectively servicing many users handling large amount of data. Furthermore, the importance of this network requirement is becoming more emphasized as multi- GHz processors and bandwidth-hungry applications emerge. Additionally, low latency and high throughput networks can be an efficient platform on which the 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. other performance requirements, i.e., reliability, QoS, etc., can be better imple mented and realized. Consequently, this dissertation considers low latency and high throughput as the first step toward high performance networks and, thus, chooses this requirement as the main research goal. The importance of the second performance requirement, i.e., reliability and fault-tolerance, becomes emphasized as commercial server systems are deployed in the industry handling reliability-critical transactions, e.g., bank, stock market, etc. Such systems are required to provide reliable services under any circum stance and, thus, necessitate networks that can reliably deliver messages among processing nodes and/or end users regardless of the existence of network faults. The third requirement, i.e., QoS, is concerned with the capability to provide predictable services, e.g., minimum bandwidth, maximum delay, etc. QoS is becoming a critical design issue of networks because networks are limited and sharing resources among multiple processing nodes, and users and applications demand different classes of network services. This requires networks to accurately control the quality of service for each user and each application, making QoS one of the important network design issues. The following sections will survey what previous work has been done to improve these performance requirements mentioned above and, as a result, will provide research issues that need to be done to further enhance network performance. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1 Low Latency and H igh T hroughput In architecting low latency and high throughput routers, there are two critical design challenges: (1) message blocking and congestion control capability and (2) fast message routing and transmission capability. In other words, to deliver more messages faster, routers must minimize congestion by efficiently utilizing resources and speed up message routing and transmission by optimizing router architectures. The following sections describe what previous work has been done in the design challenges mentioned above. Additionally, they give open research issues in order to further improve network performance. 2.1.1 M essage C ongestion C ontrol To minimize network congestion, many schemes have been proposed and, indeed, employed by commercial systems. These schemes include routing adaptivitv, virtual channels, virtual cut-through switching (VCT), and injection limitation. The following sections describe these schemes and, also, identify their potential problems. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.1.1 Routing A daptivity Adaptive routing is the capability to provide alternative routing paths for mes sages transmitted between processing nodes. This capability helps to spread out network traffic evenly throughout the entire network and, thus, minimizes biased and concentrated usage of network resources, i.e., congestion. In general, this routing adaptivity is limited by deadlock. Deadlock is a critical network situa tion where messages have cyclic resource dependencies on one another, leading to an infinite blockage of the involved messages and, eventually, halting the entire system. Therefore, deadlock must be effectively handled by networks. This is the responsibility of routing algorithms that guard networks against deadlock while efficiently allocate network resources, e.g., queues and links, to routing packets. According to the way of handling deadlock, these algorithms are classified into deadlock avoidance and deadlock recovery schemes. Deadlock avoidance-based routing algorithms enforce routing restrictions to break cyclic resource dependencies of deadlock and thus, to prevent deadlock. Such algorithms have been developed for two different classes of networks: regular networks and irregular networks. A regular network is a network whose nodes are deployed in a regular pattern, which makes node addressing and message routing easier. In contrast, an irregular network has an irregular node deployment and 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. irregular link connections among the nodes, thus complicating addressing and routing. Many deadlock avoidance schemes have been developed for regular networks. One such routing algorithm is the dimension order routing (DOR) scheme [19]. In DOR, packets are routed in a pre-specified order of dimensions to avoid the formation of resource dependency cycles. This highly restricts and prevents mes sages from using alternative paths for routing, significantly limiting network per formance. To overcome the limitation of DOR, various partial adaptive routing schemes were introduced. These schemes provide limited but multiple alterna tive routing paths for messages, hence relaxing routing restrictions to prevent deadlocks. They include the Turn model, Linder and Harding’s routing, the Planar Adaptive routing, Duato’s Protocol routing, and Bubble routing. The Turn model [57] restricts only certain turns in routing to prevent deadlock cy cles. Linder and Harding’s routing schemes employ virtual channels to make resource dependency chains acyclic. However, these generally require many vir tual channels to prevent deadlocks and. thus, can make routers complicated and slow. Planar adaptive routing [12] allows routing adaptivity within two consecu tive dimensions at a time, which reduces the number of virtual channels required to prevent deadlocks. Duato’s protocol [23] provides a restricted and minimized 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. set of escape channels in addition to a set of adaptive channels to avoid dead lock, allowing less restrictive deadlock avoidance-based routing with less network resources. Bubble routing [70] avoids deadlocks by reserving some network re sources, i.e., queue space and, thus, ensuring the mobility of packets in networks. This scheme allows full routing adaptivity at the cost of low queue utilization. A number of deadlock avoidance-based routing algorithms have been proposed for irregular networks as well. Such algorithms include the up-down routing al gorithm [73] which defines a minimum spanning tree within the network and restricts message traversal to a specified order within the tree and the adaptive- trails algorithm [71] which orders the network nodes using Eulerian Trails and restricts routing based on this ordering. Recently, proposed extensions to the up- down scheme based on D uatos algorithm [79, 80, 81] provides additional adap tivity through the use of redundant channels. However, these deadlock avoidance schemes not only limit resource utilization through routing restrictions, but also require the use of non-minimal paths, thus increasing message latency and wast ing resources. To overcome the performance limitation of deadlock avoidance schemes, dead lock recovery-based routing algorithms have been proposed. Recovery-based rout ing algorithms allow deadlock cycles to form, detect, and recover from them in stead of preventing the cycles. This enables full routing adaptivity within not only 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. directions but also virtual channels. In other words, with these schemes, mes sages can be routed to their destinations through any direction and any virtual channel. Such schemes include Compressionless [41]. Disha-Sequential [5], Disha- Concurrent [4], and others. Compressionless is a regressive deadlock recovery- based algorithm in which, when deadlock cycles are detected, the involved mes sages are killed to break the cycles. Although this scheme provides full routing adaptivity, it might waste network resources by killing falsely-detected packets that already utilize considerable network resources, i.e.. link bandwidth and queue space, which might limit network performance. As a solution to the regressive deadlock recovery scheme, progressive deadlock recovery schemes were proposed. In these schemes, instead of killing, suspected deadlock packets are delivered to the destinations by preempting network resources, eventually breaking deadlock cycles. These schemes not only provide full routing adaptivity but also better utilize network resources. Although these deadlock recovery schemes were de signed for use in regular networks, there are no inherent restrictions that limit their use in irregular networks as well. For instance, a direct extension of [4] has recently been proposed for use in irregular networks [82]. Deadlock recovery-based routing schemes can maximize routing adaptivity by eliminating routing restrictions, significantly minimizing network congestion. However, the increased routing adaptivity of these schemes might complicate 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. router architectures by increasing the complexity of routing, channel selection, arbitration, etc. This makes the router slower and, hence, can degrade overall network performance. 2.1.1.2 Others In addition to routing adaptivity, many schemes were proposed to control network congestion and message blockage. One of them is the virtual channel scheme [16]. Although originally proposed to prevent deadlock, these days, this scheme is also used to reduce message blocking. As shown in Figure 2.1. the basic idea behind the scheme is to allow messages to bypass blocked messages by allocating multi ple queues to a physical link and multiplexing them onto the link. The virtual channel scheme significantly reduces message blocking and increases link utiliza tion. However, despite its performance benefit, it has been shown that increasing the number of virtual channels might reduce overall network performance by complicating router architectures [11]. Although some schemes such as pipelined designs [62] were proposed to resolve this problem, few of them successfully min imize the cost of virtual channels while efficiently handling both light and heavy network traffic. This gives one of the goals of this dissertation. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. d ° ) Block ► a w Block Figure 2.1: Virtual channel scheme. Another approach to reduce network congestion while minimizing network latency is cut-through switching schemes, e.g., wormhole [58] and virtual cut- through switching scheme(VCT) [40]. These cut-through switching schemes can initiate routing as soon as the header flits of packets are received, which reduces network latency by saving the time for storing a whole packet in Store-and- forward switching (SAF) [30]. However, in case of router architectures having small queues, a message might be blocked in place while occupying several links and queues. This hampers network resource utilization and aggravates network congestion. The problem can be resolved by increasing queue size to n packets (where n > 1). In case of large queues, when a message is blocked, the whole blocked message can be stored in a queue such that its routing interference with other messages can be minimized. This reduces network congestion by better uti lizing network resources. However, because queue size is considerably increased. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. there is a need to consider a couple of important design challenges: how to or ganize queue structures and how to schedule messages. In routers having large queues, a queue can store multiple messages and. thus, the messages in the queue might block one another, which does not occur in routers having small queues. Regarding this issue, much work has been done in router architectures. This will be shown in Section 2.1.2.2. As one of the latest approaches to control network congestion, a network injection control scheme was proposed [64]. This scheme limits the rate of message injection into networks so as to prevent networks from going into saturation just like traffic lights at highway ramps do. The success of this scheme highly depends on how to determine an injection rate threshold for network saturation. If the threshold is determined to be too high, a network becomes saturated while, if too low, the utilization and the throughput of the network will suffer. However, it is not easy to determine an appropriate injection rate threshold only based on local traffic information provided by a router (i.e.. link and queue utilization status). Moreover, this problem might be aggravated when dynamic network traffic is considered. As shown above, the proposed schemes have some architectural challenges (i.e.. architectural cost of virtual channels, buffer organization, injection rate Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. threshold determination) although they all have the potential to efficiently con trol network congestion. Furthermore, network performance relies not only on congestion control capability but also on message routing speed, which is deter mined by router architectures. Consequently, there is a strong need to optimize router architectures in order to maximize message routing speed while efficiently supporting the congestion control schemes mentioned above. The next section provides previous work on router architecture optimization. 2.1.2 M essage R ou tin g Speed Message routing speed is mainly determined by router architectures that realize routing algorithms and congestion control schemes. Therefore, much effort in router architecture optimization has been made to improve routing speed. Figure 2.2 illustrates a generic router architecture on which this dissertation is based. As shown, the router architecture consists of a crossbar, input and output queues, and a routing and arbitration module. The following sections will describe the functions of these modules and discuss previous effort mades to optimize each router component module. Additionally, the sections will identify the limitations of the previous efforts in supporting the capability offered by the congestion control schemes mentioned. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LEGEND: IB: Input Buffer OB: O utput Buffer R&A: Routing and Arbitration IB H I —► Crossbar H h > R & A Module Figure 2.2: Generic Router Architecture. 2.1.2.1 Crossbar The function of the crossbar is to provide connections between input and output ports of routers. Generally, as the router connectivity determined by the number of network dimensions and the number of virtual channels increases, the com plexity of the crossbar increases on a log scale. The main challenge in designing crossbars is to provide better connectivity while minimizing their cost. There 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are three different classes of crossbars: fully de-multiplexed crossbars [84. 36], multiplexed crossbars [28. 18], and logical crossbar [63]. A fully de-multiplexed crossbar provides physical crossbar ports for all the input and output channels as shown in Figure 2.3 (a). The key benefit of this crossbar is its non-blocking structure in which, as long as an output port is available, any input can be connected to the available output port. This allows good support for any routing scheme without causing internal blockage. However, since the crossbar size rapidly grows as the number of dimensions and the number of virtual channels increase, this crossbar might significantly reduce router speed, resulting in low overall network performance. As a solution to this, partitioned crossbars shown in Figure 2.3 (b) were proposed for deadlock avoidance-based schemes. Because these schemes have routing restrictions, crossbars do not have to provide all pairs of input and output ports, which enables one to partition the crossbar into multiple subcrossbars. This significantly reduces crossbar cost without harming the capability of the underlying routing scheme. However, due to its potential problem, i.e., blocking structure, this crossbar structure might not be adequate for adaptive routing schemes enforcing few or no routing restrictions. In a multiplexed crossbar, a set of channels share a crossbar port as shown in Figure 2.3 (c). This crossbar structure reduces crossbar size and cost. However, it is a blocking structure that causes internal blocking and, thus, might hamper 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Crossbar Crossbar Crossbar (a) Fully de-multiplexed crossbar. (b) Partitioned crossbar. Memory Crossbar (c) Multiplexed crossbar. (d) Logical crossbar. Figure 2.3: Various crossbar designs. the fully adaptive routing capability of deadlock recovery schemes. This raises the need to evaluate the trade-off between reduced crossbar size and internal blocking. This will be provided in Chapter 3 Unlike other crossbar designs, a logical crossbar consists of a memory unit and a dedicated control unit as shown in Figure 2.3 (d). Instead of providing physical connections between inputs and outputs, the logical crossbar provides virtual and logical connections by reading messages from inputs, storing them in the memory, and writing them to appropriate outputs. Since this structure consists of a memory and a dedicated control unit, the logical crossbar does not need additional queues and enables flexible scheduling and routing. However, de spite the benefits mentioned, its performance (especially latency) and scalability 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. can be limited because a memory unit generally does provide a small number of input and output ports and. thus, critically limit the concurrency of reads and writes. Therefore, this crossbar has not been popularly employed by multipro cessor systems in which latency and scalability are the most important design issues. 2.1.2.2 Queue In increasing network performance, queues play a critical role by temporarily stor ing messages until network resources are available and. thus, minimizing packet losses. This increases network resource utilization by minimizing the waste of network resources utilized and occupied by lost messages. The most popular and simplest queue is the first-in-first-out queue (FIFO) in which packets are stored and routed in reception order. The design simplicity of FIFO makes routers faster, which increases network performance and makes FIFO employed by many high performance routers [27, 43, 77, 78]. Additionally, increasing the queue size of FIFO further improves network performance by increasing network capacity. Fortunately, as current CMOS technology allows small transistor size and large die size, the queue size of commercial routers tends to rapidly grow. For exam ple, T3E [78] and SGI Spider [28] routers provide the queue space of 612 bytes and 1024 bytes for each input physical link, respectively. Even considering their 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. largest packet size (around 16 through 32 bytes), each queue is capable of storing more than twenty packets. The queue design trend, i.e.. increased queue size, raises a design challenge regarding how to structure queues and how to schedule messages for routing. This design challenge will be discussed in Chapter 4 in order to better exploit the fully adaptive routing capability of deadlock recovery schemes. (a) Input queueing. (b) Output queueing. Figure 2.4: Input queueing and output queueing. There are three main queueing techniques for routers: input queueing [78. 28], output queueing [93], and hybrid queueing [84]. Among them, the most popular queueing technique in multiprocessor system networks is input queueing due to its scalability, high speed and low cost. As shown in Figure 2.4 (a), input queueing puts queues at the input port side of routers and, hence, each queue is associated with only one input channel. This makes the flow control of each queue simple, inexpensive, and fast because each queue needs to control one flow of messages 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. regardless of router connectivity. Moreover, the simple flow control of input queueing makes routers scalable in terms of network connectivity. The main disadvantage of this scheme is head-of-line blocking (HOL) where a blocked packet at the head of a queue blocks the following packets heading for dif ferent directions in the same queue. The HOL blocking problem significantly lim its network performance by degrading network resource utilization. To overcome this problem, many alternative queue designs have been proposed [31, 22, 85]. The basic idea behind these queues is to store packets according to their routing directions such that packets cannot block the packets heading for different di rections. Although solving the HOL blocking problem, this scheme might make routers slower due to its complicated queue management. Furthermore, these queue designs limit routing adaptivity by forcing a packet to be routed through a pre-defined routing direction and, thus, do not efficiently adapt to dynamic net work traffic. Consequently, these schemes might not be suitable for fully adaptive routers such as deadlock recovery-based routers. Output queueing places queues in the output side of routers as shown in Figure 2.4 (b). In this queueing technique, each queue associated with an output link is capable of receiving messages from all the input links within one clock cycle without losing any message. This resolves the HOL blocking problem, which potentially increases network performance. However, this limits the scalability of 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this scheme since each output queue requires n times the bandwidth of an input physical link, where n is the number of physical input links. Additionally, output queueing complicates flow control because each output queue is associated with n input links while, in input queueing, each queue is connected to only one input link. The complicated flow control might make routers significantly slower. This is why latency-sensitive multiprocessor systems prefer input queueing. Hybrid queueing is devised to benefit from the virtues of both input and output queueing [84]. The idea of this scheme is to store non-blocked packets in input queues in order to maximize router speed while transferring blocked packets to a central queue implementing output queueing in order to minimize the HOL blocking problem of input queueing. However, employing both input and output queueing indeed complicates the internal flow’ control and arbitration logic of routers, resulting in slower routers. This can make hybrid queueing not suitable for fully adaptive routers by aggravating the router architecture complexity, increased by unrestricted routing capability. 2.1.2.3 R outing M odule The routing module in routers determines the routing paths of transmitted mes sages based on the underlying routing algorithm. The cost of this module highly depends on the regularity of network topology. First of all, in regular networks 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Y+ □ □ □ □ □ □ □ □ □ m □ □□□[] □ a □ □ □ □ □ □ □ □ x+ Figure 2.5: Two nodes, A and B. in a two dimensional torus network. in which processing and/or routing nodes are deployed in a regular pattern, it is easy to address nodes and to route messages in a systematic way. This is because the regular node deployment of these networks enables the ability to inexpen sively locate network nodes. One of the most popular and straightforward ways to address and locate network nodes in regular networks is relative addressing. In relative addressing, a node in a regular network is addressed by its relative distance from the other nodes. For instance, consider Node A and Node B in Figure 2.5. Node A is three hops away in X+ and two hops away in Y+ from Node B, which is not only (1) the relative address of Node A against Node B but also (2) the routing information from Node B to Node A. Since relative addresses 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. include the routing information of messages, routing processes are not needed in the intermediate nodes. Instead, routing modules in intermediate nodes need to select one of the routing choices obtained from relative addresses and update the relative addresses of the routed messages according to the routing decision. This makes routing modules inexpensive and fast and thus, results in faster routers. This is why many latency-sensitive multiprocessor systems employ regular net work topologies rather than irregular network topologies. Since this dissertation focuses on low latency and high throughput networks, regular networks will be assumed throughout the majority of this work. Unlike regular networks, routing nodes and computing nodes in irregular net works are irregularly and randomly arranged. Although irregular networks pro vide connection flexibility and incremental expansion for LAN environments, their irregularity makes it more complicated not only to address nodes in a structured way but also to route packets in a network. This enforces irregular network nodes to be addressed by their unique node identifiers, i.e., absolute addresses as opposed to relative addresses. In addition, the irregular node deployment of the networks requires routers to employ either source routing or table-based distributed routing. In networks employing source routing (e.g.. Myrinet [8], Au- toNet [73], etc.), the source node calculates and determines a routing path for a packet based on a recently updated global network configuration. Using the 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. given routing path encoded in the packet header, the packet is delivered to its destination through intermediate nodes. Since routing paths are determined at the source node, packets cannot dynamically adapt themselves to network traffic situations (e.g.. hot spots, faulty links, etc.). This restricted adaptivity reduces the utilization of network resources, making source routing paroblematic for high performance networks. By contrast, table-based distributed routing can allow routing adaptivity by providing multiple routing choices per each table-lookup operation. However, table lookup operations are generally expensive since the routing table is stored in memory and, thus, accessing delay is not negligible. Moreover, single port memory implementations further increase latency by seri alizing all routing table accesses, which also reduces network throughput. To overcome these problems of the routing tables, some router architectures such as SGI spider [28], ServerNet [37, 35, 29], etc., have pipelined table look-up operations. This increases throughput as well as the router clock speed. However, this approach cannot reduce routing latency of messages since each message has to go through all pipeline stages. Another way to reduce routing delay incurred by table look-up operations is through circuit switching [6]. In this scheme, a physical path from the source to the destination is reserved prior to the transmission of a message so that the 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. message can be transmitted through the path without suffering long and fre quent routing table look-ups. In this scheme, however, short messages suffer long path setup overheads, which fails to reduce the overall communication latency. Furthermore, since circuit switching reserves paths to be used prior to message transmission, the reserved network bandwidth might be wasted since other mes sages cannot use them. Another scheme for reducing delay due to complicated table lookups is vir tual circuit switching of asynchronous transfer mode (ATM) [3, 48]. This scheme minimizes table lookups based on complicated IP addresses by replacing complex IP (Internet Protocol) addresses with simple virtual circuit identifiers. In this scheme, a virtual circuit must be established prior to the transmission of packets between the source and the destination. After the circuit is established, a virtual circuit identifier is assigned to packets to be sent between the two nodes. The given identifier enables intermediate nodes to route or forward the packet toward the destination without additional complicated routing table lookups. Although this scheme reduces the routing latency significantly, it experiences a long pre- circuit-setup latency, especially for short messages. In addition, since this scheme is connection-oriented, it is susceptible to dynamic faults that might occur during data transmission. This requires additional circuits to be set up as the faults oc cur. which degrades the overall network performance. Furthermore, most ATM 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. switches employ deadlock avoidance routing schemes which limit routing adap tivity to prevent deadlocks. This significantly reduces network throughput and increases network latency. As another approach to reduce routing latency, some studies [28, 86] have pro posed look-ahead routing. This scheme overlaps routing table look-up delay with routing arbitration so that routing arbitration delay can be hidden by routing table look-up. Although succeeding in hiding routing arbitration delay (which reduces routing delay), this scheme still suffers from long routing table look-up delays. One of the most recent approaches to reduce routing latency is the notion of combining wormhole switching and circuit switching [25]. Basically, this scheme is realized by providing two separate networks, one used for wormhole switching, and the other for circuit switching. The wormhole switching network is used for short messages so as to reduce long path setup overheads while the circuit switching network handles long message transmissions in order to minimize long routing table look-up delays. Although the scheme is successful in reducing com munication latency, it could be viewed as overkill since two separate expensive networks are used. Although some work has been done to improve irregular network router per formance as mentioned above, limited success has been made in reducing routing 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. delay caused by complex routing look-up operations. Furthermore, most of the architectures addressed above are based on deadlock avoidance schemes that limit routing adaptivity and, thus, degrade overall network performance. This provides research issues on deadlock recovery-based router architectures for irregular net works. 2.1.2.4 Arbitration An arbitration module arbitrates among messages requesting for a same output channel. There are in large two approaches to effectively and fairly arbitrate: sequential arbitration and parallel arbitration. The sequential arbitration ap proach handles routing requests one by one (generally, in a round-robin fashion), regardless of how many concurrent requests are made. This makes arbitration processes simple and fast, increasing router speed. However, this sequential ar bitration might hamper overall router performance. This is because there can be multiple packets requesting different idle output ports at the same time and the packets cannot be routed concurrently. Fortunately, some previous work [13, 43] shows that, unless the network is saturated, there are only one or two packets to be arbitrated/routed in a router while other packets either were already routed or are being transmitted. This makes sequential arbitration a sufficient and fast ar bitration approach for high performance routers. In contrast, parallel arbitration 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tries to find a maximum matching set of requesting packets and the corresponding output channels. In other words, this approach can allocate resources to multi ple packets simultaneously, which can potentially increase network performance. Nevertheless, this scheme significantly complicates router architectures due to its complicated arbitration policy. Although some previous work such as iSLIP scheduling alogirithm [54] was proposed to reduce this problem, because it is rare to handle multiple concurrent requests as mentioned above, parallel arbitration is indeed an overkill. As a result, this dissertation assumes a sequential arbitration for router architectures. 2.1.2.5 Others Besides the schemes mentioned above, to enhance router performance, many other efforts have been made. One of these efforts is the short circuit path scheme employed by Cray T3E [78] and Alpha 21364 [34]. In this scheme, packets does not have to go through a crossbar if the packets are headed straight-forward and the corresponding output link is idle, which is the common case of dimension order routing. This scheme saves not only a crossbar delay but also an arbitration delay. As another scheme to increase router performance, virtual channel loads were proposed to be balanced for deadlock avoidance-based routers [77, 78]. This 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. scheme balances loads among virtual channels that are frequently biased due to the routing restrictions of deadlock avoidance schemes. This increases overall network throughput. Another approach to improve router throughput is a piplined design of routers [28. 62]. This approach divides a complete routing procedure into smaller jobs and overlaps them on one another similar to current micro-processor designs. Al though this scheme considerably increases throughput, it does not reduce the network latency of a packet since a packet should go through all pipeline stages. Furthermore, these schemes may not reduce the increased architecture complexity incurred by fully adaptive routing schemes. One of the latest schemes to increase router performance is flit-reservation flow control [50]. This scheme can better utilize network resources by reserving net work resources before routing packets. Although having a potential to increase network performance, the complicated resource reservation of this scheme sig nificantly increases communication latency, especially at low load rates in which packets can be routed without suffering from conflicts for network resources. Con sidering that the average network load rate of latency-sensitive parallel processing systems is generally low. this scheme might be too expensive to accommodate their network demands. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 R eliability and Fault-tolerance To design reliable and fault-tolerant networks, several approaches have been pro posed. One of them is adaptive routing [2, 78]. As mentioned before, adaptive routing schemes are originally proposed to increase network performance. How ever, these schemes also have the capability to enhance network reliability and fault-tolerance by providing multiple routing paths. In case of the occurrence of network faults, packets can go around the faults by utilizing alternative paths given by routing adaptivity. Although this routing adaptivity cannot ensure message delivery to destinations due to the case that routing adaptivity is ex hausted, this approach is easiest and most inexpensive because this scheme does not require any additional hardware to increase fault-tolerance. Another scheme to improve reliability and fault-tolerance is point-to-point ap proaches [18, 34, 91]. These approaches guarantee message delivery between two neighboring routers by adding and utilizing error-detecting and error-recovery mechanisms such as CRC error checks and data queueing and re-transmiting. This scheme might be overkill considering the short distance between routing nodes in multiprocessor systems and low error rate of state-of-the-art link tech nology. 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A solution to the overkill problem of point-to-point schemes is end-to-end ap proaches [88]. Instead of adding additional resources to each routing/processing node, this scheme handles network reliability at the end nodes, generally pro cessing nodes. In other words, when packets get lost in the routing paths or when packets are received incorrectly, the responsibility to detect errors and to recover from the errors lies in end nodes. This scheme significantly reduces the cost of hardware required to implement network reliability and fault-tolerance. However, this can increase additional network traffic to recover data from errors, degrading network performance. One of the most straight-forward and most solid approaches is redundancy. This approach provides redundant resources for networks to increase network reliability. For instance. Each SP-2 vulcan chip [84] has a shadowed router chip to detect its operational defects. Additionally, T-net [36] consists of two identical X and Y independent networks to ensure network reliability even on the occurrence of network faults. Although this scheme has the strong capability to handle faults of networks, its redundancy might be very expensive and can be overkill. Another approach to increase network reliability is reconfigurability [60. 10]. Reconfigurability is the capability to change network topology on the occurrence of faults. For example. Cray’s Giga Ring [75, 76] changes its double ring topology to a single ring topology when faulty links are detected. This can efficiently 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. improve network reliability by isolating faulty network resources. However, in this approach, changing topology can also isolate non-faulty network resources, wasting network resources. 2.3 Q uality of Service The efforts to implement quality of services(QoS) in networks are, in large, clas sified into connection-oriented, connectionless, and hybrid approaches. The best and most straight-forward way to provide predictable quality of services is to re serve network resources between sources and destinations before routing packets. This is the connection-oriented approach [42, 44, 61, 90. 92]. Through reserving network resource, this approach can easily guarantee maximum communication latency and minimum bandwidth for packets, resulting in high QoS. However, despite its powerful control on QoS, the long connection set-up latency might make this approach inadequate for latency-sensitive multiprocessor systems. In deed. this approach is widely being used for computer networks that require high throughput rather than low latency and, moreover, need to accommodate various classes of services. In contrast, the connectionless approach [55, 78, 83] provides QoS without establishing and reserving any channel connection between end nodes. Instead, 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this approach gives priority to packets that need to be urgently delivered or are blocked for a long time such that the packets can be delivered to desti nations faster. Although this approach is inexpensive and can eliminate the long connection setup latency of the connection-oriented approach, it cannot ei ther tightly control or guarantee the quality of services, failing in supporting latency/bandwidth-constrained applications. The hybrid approach [26, 52, 72] combines the virtues of the connection- oriented approach and the connectionless approach, i.e., high QoS and low la tency. The basic idea behind this approach is to reserve and establish connec tions for long messages and real time messages while routing short messages and best-effort messages without establishing connections between end nodes. This enables one to tightly control QoS for real-time traffic and to save long connection setup latency for short and best-effort traffic. Although many efforts have been made to provide QoS, most of these ef forts are based on either restricted or partial adaptive routing schemes that en force routing restrictions. The routing restrictions limit overall network perfor mance and. moreover, limits possible connection paths between communication end nodes that are necessary in realizing the quality of services. To overcome Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. these limitations, fully adaptive routing algorithms, e.g., deadlock recovery rout ing schemes, can be a good platform on which QoS is more efficiently imple mented. This raises an open research issue, which can be future work resulting from this dissertation. 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 3 Evaluation of D e-m ultiplexed Crossbar A rchitectures for D eadlock R ecovery-based R outers In order to achieve a high-speed, true fully adaptive router design, this chapter explores the notion of exploiting dynamic routing behavior identified as routing locality. Sufficient routing locality enables the internal crossbar of a deadlock recovery-based router to be partitioned into smaller and faster units without sacrificing its true fully adaptive routing capability. This minimizes the delay suffered from implementing the increased routing adaptivity, thus making the common case fast. Additionally, to ascertain the validity of this notion, this chapter presents, characterizes, and evaluates a variety of promising crossbar 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. designs which exploit various forms of routing locality. Through extensive evalu ation. this study provides insight into the degree to which routing locality exists under the assumption of true fully adaptive routing and how well each crossbar design exploits this behavior and supports true fully adaptive routing capability. The evaluation results reveal that a high-performance, true fully adaptive router can be implemented by proper exploitation of routing locality. To maximize network performance, this research focuses on only de-multiplexed crossbars which provide physical crossbar ports for all the input and output links of routers and. thus, minimizes internal message blocking. In addition, this work assumes a simple queue design, i.e.. FIFO/CQ, to concentrate on crossbar de signs. The extensions to other queue designs will be explored in Chapter 4. Although this study is applicable to deadlock recovery schemes in general, this chapter focuses on Disha-based [5] routing. 3.1 R outing Locality The key factor that makes partitioned crossbar structures beneficial to deadlock avoidance-based router designs is routing restrictions enforced by the underlying routing algorithm, as explained in Section 1.2. However, the partitioned crossbar 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. structures might not be ideally suited for unrestricted, deadlock recovery rout ing schemes. The main reason is that dividing the crossbar into smaller units can increase routing latency by limiting direct access to all the output chan nels. For example, consider a fully adaptive router which employs a partitioned crossbar. When all profitable output channels are unavailable in a subcrossbar, packets in the subcrossbar have to traverse subcrossbars to acquire non-occupied output channels of the other subcrossbars in the router. Each subcrossbar traver sal incurs additional delays which can outweigh the speed benefit of partitioned crossbar structures, decreasing overall router performance. For partitioned crossbar structures to support true fully adaptive routing without compromising its speed benefit, subcrossbar traversals should be mini mized. Figure 3.1 illustrates two different forms of true fully adaptive routing behavior that, if exist and are properly exploited, could minimize subcrossbar traversals. In Figure 3.1(a), two packets, PI and P2, are delivered to the same destination using an unrestricted, minimal routing scheme which permits all min imal paths comprising all virtual channels between a source and a destination. PI takes alternative dimensions at every hop whereas P2 does not. Note that, in most cases, P2 demands output channels in the current dimension to which P2 currently belongs, which minimizes direct access to output channels of other dimensions. This makes it possible for P2 to route without suffering frequent 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Legend: V C N 1 V C N 2 routing locality no roucing locality □ □ □ Q - e p] l 1 4 - f e - Q . □ P 2 b B B □ B B £ - - I - . 2 y B- i i B — B j i B (a) Example of routing locality within dimension by Packet P2. routing locality no roucing locality □ □ Q - K p □ d-o g P 3 jvcmj jvcmj |vcm| P 4 K S 2 . P 4 (b) Example of routing locality within V'CN by Packet P3. Figure 3.1: Examples of routing locality. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. subcrossbar traversals in a crossbar structure divided based on dimension, as shown in Figure 3.1(a). In contrast, P I would suffer increased routing latency since its routing behavior, i.e., frequently changing dimensions, causes frequent subcrossbar traversals. The routing behavior of P2 is referred to as routing locality within dimension, which indicates that packets tend not to frequently change dimensions along their routing paths. This can be one form of routing behavior which enables the partitioned crossbar to support true fully adaptive routing without suffering frequent subcrossbar traversals. Figure 3.1(b) illustrates another form of routing behavior. In this figure, P3 stays in the first virtual channel network (VCN) until it reaches its destination, although it makes turns at every hop. whereas P4 frequently changes virtual channel networks, although it stays in dimension. Note that routing of P3 requires only direct access to output channels of the first virtual channel network. Therefore, as shown in Figure 3.1(b). subcrossbar traversals of P3 can be minimized in the crossbar structure which is partitioned based on virtual channel network. In contrast, P4 traverses subcrossbars at every node, suffering increased latency. The routing behavior of P3 is referred to as routing locality within virtual channel network. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Two alternative forms of routing locality and their exploitation have been shown to have the potential to enable partitioned crossbar structures to sup port unrestricted routing without suffering frequent subcrossbar traversals. The important fact that should be noted here is that only proper exploitation of rout ing locality can minimize the frequency of subcrossbar crossings. For instance, the crossbar structure of Figure 3.1(b) efficiently routes P3 which has routing locality within virtual channel network to minimize subcrossbar traversals. How ever, in Figure 3.1(a), P i suffers frequent subcrossbar traversals although PI takes the same routing path as P3. The reason is that the crossbar structure of Figure 3.1(a) does not properly exploit and accommodate the routing behavior of packets which have routing locality within VCN, whereas it does for packets which have routing locality in dimension. In summary, the key factors that determine the performance of the partitioned crossbar structure of deadlock recovery-based routers are (1) sufficient existence of a certain form of routing locality and (2) proper exploitation of the routing lo cality. To ascertain the validity of the key factors, this chapter explores the notion that there exists sufficient amount of some form of routing locality resulting from true fully adaptive routing and this routing locality behavior can be exploited to improve router speed. Alternative crossbar structures designed to exploit routing locality will be presented, characterized, and evaluated in the following sections. 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2 R outer Crossbar D esigns This section presents three alternative decoupled crossbar designs and one uni fied crossbar design for deadlock recovery routers. Decoupled crossbar designs have partitioned crossbar structures and aim at exploiting some form of rout ing locality, whereas the unified crossbar design does not. They are shown in Figure 3.2: one unified-crossbar structure (U-CB) and three decoupled crossbar structures, i.e., the cascade-crossbar (C-CB), the hierarchical-crossbar (H-CB) and the enhanced-hierarchical-crossbar (E-CB). First, a few assumptions are made. A fc-ary n-cube network with V ' virtual channels per physical link is assumed. Messages are assumed to be received from processing nodes through a randomly selected injection virtual channel, i.e.. Proc in Figure 3.2. All deadlocks are assumed to be recovered from by a deadlock recovery mechanism, e.g.. in Disha [5], the centralized deadlock queue (DB) is used to progressively recover from deadlock. 3.2.1 Unified D esign The most straightforward of the four router crossbar designs is the unified- crossbar. shown in Figure 3.2(a). Its structure consists of a single crossbar capable of connecting all router input queues to any of the router output queues across 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. XBAR from nth Xbar [si Xbar XBAR Proc i 2nd Xbar XBAR TV Proc i S i : 6 b I Proc SrC rc atb Xbar XBAR Proc i (a) U-CB (b) C-CB "cfc S b to 1st Xbar . rrv^ I — i - Proc , 1st Xbar XBAR from nth Xbar I P ro c'^ ^ m - ■ o- DB 1st Xbar XBAR Proc i ■ \t -► m DB \r 2nd Xbar XBAR * ^ ---------- ------------ ' c i; . a + . n- •DB I Proc 2nd Xbar XBAR Proc / * * < £ Vib Xbar Vib Xbar Proc * l ; — g n + -------:--- n- ♦ XBAR t = ----------l r i * = E -------------- n- ■' *■ XBAR (c) H-CB (d) E-CB : 0 b , — f— lo 1st Xbar Figure 3.2: Internal router crossbar designs. all virtual channels. This results in a crossbar size of P = V'(2n + 1), where P is the number of crossbar input ports. The cost and speed are functions of both n and l \ resulting in increased delay as n and V ' increase. However, due to U-CB’s strictly non-blocking internal structure, any input port can be connected to an available output port in a single cycle regardless of other connections. This capability supports true fully adaptive routing, making the U-CB worthwhile to evaluate despite its potentially long delay. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.2 D ecoupled D esigns The decoupled crossbar designs consist of smaller subcrossbars connected by con nect channels. A connect channel is an unidirectional, internal physical channel which connects the ith subcrossbar to the (ith + 1) mod n subcrossbar within the same router l. Increasing the number of connect channels decreases internal blocking (see Section 3.2.2.1) but increases subcrossbar size. In Figure 3.2, C connect channels per router subcrossbar are assumed. A wrap-around connect channel (WC) is the connect channel which connects the nth subcrossbar to the first subcrossbar. This structure reduces the size of the crossbar as well as the complexity of the routing arbitrator, making the router potentially faster. Further, this design structure is intended to exploit routing locality within dimension or within vir tual channel network. If most packets tend not to change dimensions or virtual channel network frequently due to locality in routing, then it is not necessary for the crossbar to provide packets with direct access to all output channels even in the case of fully adaptive routing. Instead, changes in dimension or virtual channel network can be supported by simply requiring that indirect access to all output channels through subcrossbar connect channels be provided. Each design is discussed below. 1 except for the Hierarchical crossbar which has no wrap-around connect channel. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The cascade-crossbar design is derived from avoidance-based dimension or dered routing [27] but modified to support recovery-based fully adaptive routing. As shown in Figure 3.2(b). each subcrossbar in C-CB is associated with only one dimension which consists of V' virtual channels for each direction. This results in a subcrossbar input size of P = (2V 4- C + 1) ports. Whenever packets change dimensions, the connect channel must be used to access the next subcrossbar in sequence. Unlike the subcrossbars in avoidance-based deterministic routers, the subcrossbars and virtual channels in each subcrossbar may be used by packets in no prescribed order for recovery-based fully adaptive routing. Additionally, wrap around connections are also allowed in C-CB from the lowest dimension to the highest dimension to support adaptive routing. This router design exploits routing locality within dimension and is, therefore, promising if packets being adaptively routed tend to continue in a given dimension the majority of the time before turning (possibly using different virtual channels within that dimension). In fact, many adaptive routers currently implement a preferred channel selec tion policy [74] of "straight" over "turn" to exploit such locality within routing decisions. This design takes advantage of this fact. In addition, for minimal routing, the subcrossbar size of the cascade-crossbar can be further reduced. According to its definition, minimal routing does not allow packets to backtrack in networks. This eliminates the need for direct access 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between opposite directions of a dimension in a router and, thus, enables us to split the crossbar into smaller subcrossbars associated with directions [78] instead of dimensions, e.g., X-K X-, Y+, Y-, etc. This increases router clock speed without hampering routing adaptivity. Nevertheless, this research assumes a cascade crossbar based on dimension rather than on direction since this applies to the general case of unrestricted routing which includes both minimal and non- minimal routing. The hierarchical-crossbar design is derived from the Hierarchical Adaptive Router [51]. Each subcrossbar in H-CB is associated with one virtual channel network (VCN) which includes all dimensions and directions as shown in Fig ure 3.2(c). This results in a subcrossbar input size of P = max[(2n + V'). (2n+ C)\ ports. Its subcrossbars each support connections to all dimensions, which allows packets to change dimensions within each subcrossbar but requires the use of connect channels to change VCNs. This crossbar structure exploits routing lo cality within VCN, which is expected to better support adaptive routing than locality within dimension by allowing packets to avoid faulty and congested areas without using connect channels. Unlike the deadlock-avoidance routers, no rout ing restrictions are enforced on the lowest subcrossbar VCN for recovery-based routing. How'ever, inherent to this hierarchical design structure is the restriction that no connect channels exist between the lowest and highest VCNs. Moreover. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. packets are injected into the highest VCN only and trickle down to the next low est VCN on blockage, making the lowest virtual network a potential bottleneck. While this ordering between V'CNs is required for avoidance-based routing, it is a restriction that can be relaxed in recovery routing. The enhanced-hierarchical-crossbar design is proposed as an enhancement to the hierarchical crossbar. In addition to the connectivity supported by H-CB, the E-CB design has wrap-around connect channels between the lowest and highest virtual channels networks since VCN ordering is not necessary in recovery-based routing (see Figure 3.2(d)). This solves the bottleneck problem of H-CB. In addition, injection ports from the processor are distributed over all subcrossbars in E-CB to balance VCN load [77]. Again, this feature is disallowed in deadlock- avoidance routers but is allowed with recovery routing. Therefore, E-CB exploits routing locality within VCN while supporting greater, more distributed access to VCNs. The resulting subcrossbar input size is P = (2n + C + 1) ports. While the decoupled crossbar structures provide many potential performance benefits, each inherently has internal complications that can degrade performance if not handled correctly as discussed below. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CB Module! Internal Blocking Connect Channel (C O L egend: p i » Inactive packet P2 - - - Request for channel Active packet f Subcrossbar - X - - P2 (a) Internal Blocking (b) Internal Self-Deadlock (c) Internal Mutual-Deadlock Figure 3.3: Internal Blocking and Deadlock. 3.2.2.1 Internal Blocking The function of connect channels is to provide access to decoupled subcrossbars for packets that are currently blocked at the present subcrossbar or have finished routing over the resources supported by the present subcrossbar. The number of connect channels between two neighboring subcrossbars is critical to the perfor mance of the router as the lack of a sufficient number limits the available routing alternatives of packets, causing internal blocking, e.g., packet P3 in Figure 3.3(a). Internal blocking may be more frequent in some decoupled designs than in oth ers. There is a cost and delay associated with increasing the number of connect channels to prevent internal blocking, which sets an upper bound. Hence, it is important to examine this trade-off to balance performance and cost. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.2.2 In tern a l D eadlock Internal deadlock is defined as the infinitely persisting packet blockage resulting from cyclic dependencies on connect channels. Two types can occur: internal self-deadlock and and internal mutual-deadlock illustrated in Figures 3.3(b) and 3.3(c), respectively, for one connect channel per subcrossbar, minimal routing, and the C-CB design. In Figure 3.3(b), packet Px tries to obtain available output channels in subcrossbar y or z as it has finished routing in the x dimension. It subsequently moves through subcrossbars y and z and back to x through the wrap-around connect channel because the output channels in y and 2 were not immediately available. As a result, packet PL will remain blocked forever because it has used up all connect channels needed to route in the remaining dimensions to its destination. Obviously, internal self-deadlock can significantly degrade performance as it may cause other packets to internally block. In Figure 3.3(c), packets P lr Po, and P3 are trying to reach subcrossbars 2, x, and y, respectively, because they have finished routing in all other dimensions. However, they mutually block one another in a cyclic fashion. These internal deadlock situations must be guarded against in the decoupled crossbar structures. T heorem 1: Internal deadlocks can occur in a decoupled crossbar struc ture only if there exist (1) a wrap-around connect channel WC in the crossbar 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and (2) a possibility that a packet must traverse the next subcrossbar in order to complete its routing, i.e., the destination address is reached in the present di mension implemented by one subcrossbar but is not reached in at least one other dimension implemented by another subcrossbar(s). Proof: = > Cyclic dependency among resources is a necessary condition for deadlock [24]. A proof by contradiction follows. (1) Assume deadlock exists in a decoupled crossbar design which has no wrap-around connect channel WC. Packets in the ith subcrossbar can reach only the (i+ 1)< A and higher subcrossbars, not the ( / — l)£ /l and lower subcrossbars since there is no path from the nth subcrossbar to the first subcrossbar. To form a direct or indirect cyclic resource dependency between packets in a router, a packet in the ith subcrossbar must have at least one path in which the packet can reach itself traversing the (i + 1)£ /* and the (i — l)th subcross bars. However, such a path does not exist since there is no path between the nth subcrossbar and the first subcrossbar. Hence, packets in the decoupled crossbar not having the wrap-around connect channel cannot form any direct or indirect cyclic resource dependency, which is a contradiction. (2) Given a decoupled cross bar structure with a wrap-around connect channel, assume that deadlock exists even though there is no need for a packet to move to other subcrossbars. Then, blocked packets will subsequently be assigned to output channels in the current 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. subcrossbar so long as external deadlocks do not exist, which is a contradic tion. Collectively. (1) and (2) are the necessary conditions for internal deadlocks. □ From Theorem 1. internal deadlock can occur only in the cascade-crossbar— the only decoupled design which satisfies all conditions. Internal deadlock can be solved by the same recovery method used for external deadlock, i.e., the deadlock queue resource in Disha. However, internal deadlocks cause local router conges tion which may lead to overall network performance losses. To improve router and network performance, the following restrictions may be applied to avoid in ternal self-deadlock, although they are not necessary: no packet is allowed to use more than (n — 1) connect channels and no packet in the ith subcrossbar may use the connect channels leading to the (i + 1)< A subcrossbar dimension when it has finished routing in all of the (i + l) th, (i + 2)th, .... (i + (n — k — l))th subcross bar dimensions, where k is the number of connect channels which the packet is occupying in the current router. 3.2.2.3 M ulti-cycle D elay In the decoupled crossbar structures, packets may experience different setup de lays and pass-thru delays, depending on the number of subcrossbar traversals. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P I P2 P3 1st Xbar 1st Xbar 1st Xbar 2nd Xbar 2nd Xbar 2nd Xbar nth Xbar nth Xbar nth Xbar Figure 3.4: Multi-cycle delays for packet P1,P2, and P3. This is shown in Figure 3.4. If the clock cycle is bound by the setup (or pass- thru) delay of one subcrossbar, packets requiring subcrossbar traversal would take multiple cycles to pass through the router, which increases their average router delay. Therefore, while the router speed (clock cycle time) may be faster than the unified design, the average router delay can actually be higher, depending on the dynamic behavior of routes taken through the network. This presents an interesting trade-off that will be evaluated by simulation in Section 3.3 below. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Perform ance o f router designs For router design performance evaluation, each router design is assumed to incor porate one of the alternative crossbar designs presented in the deadlock recovery- based WARRP router design [67] (see Chapter 5). Each router design is repre sented by the name of its crossbar design. First, at the router level, the speed of each router design is estimated through synthesis, which is more precise than Chien’s model [11]. This gives insight on how fast each router design can operate. Second, at the network level, each router design is evaluated by measuring overall network performance. The simulation results provide information on how well each design supports the capability of the underlying routing algorithm as well as router speed. For all evaluations, a three dimensional torus network with 3 virtual channels per physical channel is assumed. 3.3.1 R outer Level Perform ance Evaluation The speed of router designs can be determined by the following metrics: Tset^up. Tpass— thru: and Tefc. Tset-uP and Tpaaj— t/iru are the minimum delays which a header flit (the first flit) and a data flit (one of subsequent flits) take passing through a router, respectively. Tefc is the minimum external flow control time for a flit. Tefc determines the maximum bandwidth of an external physical link when 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the physical link is fully pipelined. These metrics comprise various router delay components of the deadlock recovery-based WARRP router [67]. Expressions for these metrics are given below: Tset-up = Tad + Tsel + Tarb + T fc +Tc b, Tpass-th.ru ~ Tfc "b Tc bi Tefc = Tvc + Tbc, where Tad: delay of queue control and address decode, Tsel: delay of channel selection. Tarb: delay of arbitration and output channel status update, Tfc. delay of internal flow control, crossbar setup, operation select and queue control, Tcb: delay of crossbar delay, Tvc: delay of virtual channel control, Tbc: delay of queue control. Note that Tad, Tfc, and Tb c are constant delays. Tc b is a function of P. where P is the number of crossbar input or output ports. Tvc is a function of V, where V is the number of virtual channels. Tsei and Tarb are functions of routing freedom 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 1 Delay 4 (ns) 3 other delay components: Tad = 1.5 ns Tfc = 1.6 ns = 0.9 ns 4 8 1 2 P (# of ports) F(routine freedom) V(# of VCs) Figure 3.5: Router delay components of the WARRP router. (F), where F is the number of routing choices given by the underlying routing algorithm. Figure 3.5 shows how various components of router delay vary with router parameters. Values for these components are obtained using an automatic syn thesis tool. EPOCH, a behavior level \H D L tool, and HSPICE. A 0.5 um HP14B CMOS technology is also used. 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The main reason why the VVARRP router design [67] is employed for router speed evaluation instead of Chien’s model [11] is that the WARRP router is de signed and optimized for a deadlock recovery routing scheme, Disha [5] and, thus, it can better characterize the attainable speed of deadlock recovery-based routers than Chien's model. There are three distinctive differences between Chien’s model and the router speed estimation using the WARRP router design. First, some router delay components of the WARRP router are constants rather than logarithmic due to design optimization. For example, the header selection delay of the WARRP router (a part of 7 /c) is constant while being a logarithmic function in Chien’s model. The WARRP router hides header selection delay by overlap ping it with Tsei and Tarb. Second, the WARRP router design highly relies on behavior level VHDL and EPOCH (an automatic synthesis tool) in which finely tuned design optimization is not supposed. Therefore, as shown in Figure 3.5. router delay components of the WARRP router look linear within a range of the given configuration rather than logarithmic due to the high level design over head. This favors the performance of partitioned crossbar designs by enlarging the delay gap between partitioned and unified crossbar designs. However, it has been shown in [13] that even the logarithmic router component delay assumption does not change the qualitative conclusion of this research. Third, the WARRP router design employs output queues as well as input queues so that it makes 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. clock speed faster by decoupling external flow control delays from core router delay. In contrast, Chien's model assumes only input queues, resulting in a long clock cycle time: internal router delay includes external flow control delays such as virtual channel control delay and physical link delay. Additionally, the VVARRP router design is also different from pipelined router designs [62]. The main difference is that pipelined router designs provide seper- ate pipeline stages for routing, arbitration, and switch allocation, whereas the WARRP router design considers all these operations one multi-cycle operation. Although pipelined router designs increase router throughput by overlaping small divided routing tasks of indepedent packets, they can not reduce the increased architecture complexity incurred by true fully adaptive routing capability and, hence, suffers from high network latency (especially at low network traffic). This pipelined WARRP router design might be an research opportunity for future work of this dissertation. Table 3.1 gives internal router delays, the achievable maximum clock speed, and the maximum external physical link bandwidth for the alternative router designs, i.e.. unified, cascade, hierarchical, and enhanced-hierarchical router de sign. For decoupled crossbar router designs, the number of connect channels (C) is varied from one to three. The achievable maximum clock speed is determined by the inverse of Tp a3S_tliru of each router design. The reason why T p ^ s - t h r u IS 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.1: Delay of Router Designs. T 1 set-up Ip a s s-th m achievable maximum clock speed T efc maximum physical link bandwidth U-CB 1 4.60ns 6.50ns !54M hz 1 .5ns 667Mflits/sec C-CB (C =l) 8.60ns 3.70ns 270M hz 1.5ns 667Mflits/sec C-CB (C=2) 9.00ns 3.90ns 256M hz 1.5ns 667Mflits/sec C-CB (C=3) 9.30ns 4.00ns 250M hz 1.5ns 667M flits/sec H-CB (C =l) 9.00ns 3.90ns 256M hz 1 .5ns 667Mflits/scc H-CB (C=2) 9.00ns 3.90ns 256M hz 1.5ns 667Mflits/sec H-CB (C=3) 9.00ns 3.90ns 256M hz 13ns 667Mflits/sec E-CB (C =l) 8.60ns 3.70ns 270M hz 13ns 667Mflits/sec E-CB (C=2) 9.00ns 3.90ns 256M hz 1.5ns 667Mf!its/sec E-CB (C=3) 9.30ns 4.00ns 250M hz 1.5ns 667Mflils/sec chosen rather than Tset- up as the determinant of internal router clock speed is that this increases router performance by minimizing the fragmentation of clock cycle time, as shown Figure 3.6. The maximum external physical link bandwidth is defined by the inverse of Te/C (unit: Mega flits per sec) when physical links are fully pipelined. The internal router clock speed of decoupled router designs are faster than that of the unified router design by up to 76%, as shown in Table 3.1. The partitioned crossbar structure reduces the crossbar size as well as the complexity of routing and arbitration circuits. These advantages increase as the number of dimensions and virtual channels grow but diminish as the number of connect 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T se(-up r pass-thra (a) when a clock cycle time is Tset_ , u p Legend: _1 ! clock j u i r m J U T R T U i pas»-thn{l*pass-lhnil*paM-thi|rpa3s-ihnfo*pa3*-<hn[fpass-ihnj fragmentation ■ sei-up (b) when a clock cycle time is Tp ass.th ru Figure 3.6: Fragmentation of Clock Cycle Time. channels grow beyond a certain point. Nevertheless, faster routers do not always result in higher network performance. This raises a need to evaluate each router design at the network level to determine how well each design exploits the benefits of unrestricted, recovery-based routing. 3.3.2 N etw ork Level Perform ance Evaluation This section evaluates and compares the performance of various router designs through extensive simulation using FlexSim 1.2 [66], a more enhanced version of FLITSIM2.0. All simulations are run on an 8 x 8 x 8 three dimensional torus (n = 3) with 3 virtual channels per physical channel and full-duplex links. Messages are 32 flits long. A queue depth of only two is assumed to minimize the effect of buffer configurations on network performance and, thus, to focus on crossbar 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. designs better. Large queue designs will be explored in Chapter 4. All router designs use one injection and delivery physical channel consisting of three virtual channels per node. A true fully adaptive minimal deadlock recovery routing scheme (Disha) is assumed with a default time-out of 25 cycles before deadlock is suspected. Each simulation is run for a duration of 50,000 simulation cycles beyond the initial transient period (the first 10,000 cycles) so that data is collected during steady state only. Other particular assumptions for each simulation will be given in the corresponding section, if necessary. The following subsections examine if true fully adaptive routing provides suffi cient routing locality and measures how well each decoupled router design exploits routing locality. The examination results help not only to analyze the potential performance of decoupled router designs but also to understand the effects of their partitioned crossbar structure on network performance. All alternative router de signs are then compared to determine which provides true fully adaptive routing capability best. 3.3.2.1 Existence of R outing Locality Sufficient routing locality is one of the key factors that empower decoupled cross bar designs to make the best use of the benefits of unrestricted routing without suffering from delays caused by subcrossbar traversals. Under the assumption 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of true fully adaptive routing, two forms of routing locality are examined: lo cality within dimension and locality within virtual channel network (VCN). The amount of each routing locality is defined by RO = iV r/ x 100/iVa// where, for a whole simulation period, Nri is the number of routing trials in which packets suc cessfully achieve profitable output channels in the current subnetwork (i.e.. the dimension or VCN in which a routing packet enters the router) without blocking, and Na u is the total number of routing trials. Rb represents the percentage of routing trials in which packets cannot obtain output channels in any subnetwork. To estimate the average cost of locality misses, R l and R2 are measured through simulation, where Rl and R2 are the percentages of packets routed to the subnetwork which is one and two hops away from the current subnetwork, without obtaining output channels in the current subnetwork. In counting the number of the hops between subnetworks, the following order of subnetworks is assumed: .Y — ► Y — > Z — y X for locality within dimension and 1 V C N — > 2V C N — > 3V C N — > lVCiV for locality within VCN. To eliminate all unnecessary effects of router designs on results, the non- blocking crossbar structure, i.e.. Unified crossbar design, is assumed in these simulations. Since we employ different channel selection policies according to the type of routing locality to be measured, the direct comparison among results for different types of routing locality might not be fair. To verify the fairness of this 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LEGEND RORIR2Rb _ _ _ _ _ _ _ _H B & g _ _ _ _ _ _ _M Pl-ilO | Below Sal. Near Sal. Saturation Below S a t Near Sat. Saturation (a) Routing locality in VCN (b) Routing locality in dimension Figure 3.7: Routing locality under uniform traffic. comparison, we measure the maximum network throughput of each simulation. Their performance variance is within less than 3% of the highest one. This is because we assume the same crossbar design and the same routing algorithm for all types of routing locality. Consequently, the direct comparion among types of routing locality is fair. For an analysis of sensitivity of routing locality to traffic pattern, simulations are run under both a uniform and a non-uniform traffic pattern. In addition, each simulation is run under three different network load rates: below saturation (50% of saturation load rate), near saturation (90% of saturation load rate), and at saturation (the load rate at which messages are injected into the network as long as injection links are idle). 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. U niform traffic p a tte rn : Figure 3.7(a) presents the amount of routing locality within VCN under uniform traffic and the cost of locality misses. In more than 80% of routing trials, packets are shown to successfully obtain output channels in the current VCN up to the load rate near saturation. Even when packets fail to achieve output channels in the current VCN, their average travel distance, i.e., average cost of locality misses, is one hop. This indicates that routing locality within VCN sufficiently exists in true fully adaptive routing and. thus, decoupled router designs making use of routing locality within VCN have the potential to route packets without suffering internal complications, i.e.. internal blocking, internal deadlock, and multiple cycle delay. Figure 3.7(b) shows the results for routing locality within dimension. In over 65% of the routing trials, packets are shown to successfully route without requesting output channels of other dimensions, up to the load rate near satura tion. The average cost of locality misses within dimension is similar to that of locality misses within VCN. The important fact obtained from this result is that the amount of locality within dimension is less than that of locality within VCN. This means that router designs exploiting locality within VCN has the potential to outperform router designs exploiting locality within dimension. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LEGEND: R0R1 R2Rb Below Sat. Near Sat. Saturation Below Sat. Near SaL Saturation (a) Routing locality in VCN (b) Routing locality in dimension Figure 3.8: Routing locality under non-uniform traffic. In Figure 3.7(a) and Figure 3.7(b) at saturation load rate, over 90% of packets are blocked due to heavy congestion. However, this does not impact the perfor mance of decoupled router designs since, in the deeply saturated network, few output channels are available anyway, regardless of router designs. N on-uniform traffic pattern (bit-reversal): In Figure 3.8, the simulation results under a non-uniform traffic pattern confirm the results derived from uni form traffic pattern. Both forms of locality sufficiently exist in fully adaptive routing and locality within VCN is higher than that within dimension. Although not presented here, this simulation was extended to other non-uniform traffic pat terns, i.e., perfect shuffle and transpose traffic pattern. The results were similar to the case of bit-reversal traffic. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In summary, both forms of routing locality sufficiently exist in true fully adaptive routing. Additionally, assuming 100% exploitation of existing routing locality, router designs for locality within VCN, e.g., hierarchical and enhanced, have the potential to outperform router designs for locality within dimension, e.g.. cascade, owing to higher routing locality. However, the performance of decoupled router designs depends not only on the amount of routing locality but also on the appropriate exploitation of the locality. Hence, the following section examines how well each decoupled router design exploits the corresponding locality. 3.3.2.2 Exploitation of R outing Locality The degree to which each crossbar design exploits routing locality can be deter mined by how well each design preserves routing locality. Consider a decoupled crossbar design which makes relative loads among subcrossbars extremely un balanced. In this case, blocked packets in a heavily loaded subcrossbar tend to move to lightly loaded subcrossbars, increasing subcrossbar traversals of packets. This can reduce routing locality and network performance. Therefore, the load distribution among subcrossbars can be a good indicator of how well a decoupled router design preserves and exploits routing locality. To evaluate load balancing, average load of each subcrossbar is measured through simulation. Average load of a subcrossbar is defined as (No c / N ac) * 100. where .V o c is the average number 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Avera^ load 100% 50% LEGEND: 1st 2nd 3rd subcrossbar J B L C=LC=2C=3 Below Saturation C=1C=2C=3 Near Saturation C=1C=2C=3 Saturation Figure 3.9: Subcrossbar load balance in the C-CB and E-CB router design. of occupied output channels in the subcrossbar and Nac is the aggregate number of output channels assigned to the subcrossbar. Simulations are run under three different network load rates: below saturation (50% of saturation load rate), near saturation (90% of saturation load rate), and at saturation (the load rate at which messages are injected into the network as long as injection links are idle). All proposed decoupled crossbar designs with one to three connect channels are sim ulated under uniform traffic. Messages are assumed to be injected into routers using injection link virtual channels selected by a simple round-robin scheduling policy. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ^ In Figure 3.9, the load among the subcrossbars of the cascade and the enhanced- hierarchical routers are equally balanced, regardless of network load rates. More over. load balancing among the subcrossbars does not change as the number of connect channels (C) increases. The main reason is that packets are injected into each subcrossbar with equal probability and the movement of packets among subcrossbars is well balanced. In contrast, the hierarchical router design has unbalanced loads among sub crossbars, as shown in Figure 3.10. In particular, the first subcrossbar always has the highest load among subcrossbars. The reason is that this design injects packets into only the first virtual channel network subcrossbar and the routing locality within VCN makes packets stay in the first VCN until the first VCN is heavily congested. The key fact that should be noted is that the lowest subcross bar of the hierarchical router design is not a performance bottleneck, unlike what was expected in Section 3.2.2. Additionally, as the number of connect channels increases, at high load rates, loads among subcrossbars become balanced. This indicates that more connect channels help packets in congested subcrossbars to move to other subcrossbars by providing more paths between subcrossbars. 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Avera^ Load 100%" ~ LEGEND: 1st 2nd 3rd subcrossbar 50% C=1C=2C=3 C=IC=2C=3 C=1C=2C=3 Below Saturation Near Saturation Saturation Figure 3.10: Subcrossbar load distribution in the Hierarchical router design. The previous results indicate that the cascade and the enhanced-hierarchical router designs have equally balanced loads among subcrossbars while the hierar chical router design does not. Now, taking into account the effect of the load dis tribution among subcrossbars on performance of each router design, Figure 3.11 compares the performance of decoupled router designs. As the message injec tion rate increases, network throughput (in flits/node/cycle) and average latency (in cycles) are measured for each decoupled router design. For simplicity, two connect channels are assumed for all decoupled router designs. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Figure 3.11, the enhanced-hierarchical router design achieves 17% higher throughput and 7% lower latency than the cascade router design. This result can be expected since routing locality within VCN is higher than that within dimension and the two designs all have load-balanced subcrossbars. Compar ing the performance of the hierarchical and enhanced-hierarchical router designs, both exploit the same routing locality, but have different relative load distri butions among the subcrossbars. The enhanced-hierarchical router design has higher maximum throughput than the hierarchical router design. This indicates that, at high load rates, balanced loads among subcrossbars minimize subcross bar traversals of packets as well as congestion in each subcrossbar by equally distributing loads over all the subcrossbars, which increases maximum through put. However, at low load rates (below saturation), the hierarchical router design is shown to have 8% lower latency than the enhanced-hierarchical router design despite unbalanced loads among its subcrossbars. This behavior is explained by the following. First, at low load rates, even the first VCN subcrossbar of the hierarchical design, i.e.. the subnetwork hav ing the highest load, utilizes only 20% output channels. This allows packets in the highest congested subcrossbar to acquire profitable output channels without suffering long blocking time. Second, extremely unbalanced loads among sub crossbars minimize physical bandwidth sharing of virtual channels, increasing 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. actual bandwidth assigned to packets. For instance, assuming that a physical channel consists of three virtual channels and all virtual channels have data to send, only one-third of the physical bandwidth is allocated to each virtual chan nel due to bandwidth sharing. However, this bandwidth sharing is minimized in the hierarchical router design since, at low load rates, only the first subcrossbar, i.e., the first virtual channel network, has data to send most of the time. It follows that equally balanced loads among subcrossbars preserve routing locality well, regardless of load rate, resulting in higher maximum throughput. However, in the enhanced-hierarchical router, the balanced loads increase commu nication latency at low load rates due to physical bandwidth sharing. In contrast, unbalanced loads among subcrossbars cause low latency at low load rates while reducing maximum throughput by poorly preserving routing locality at high load rates. Now, consider a packet injection policy which can maximize the performance of the enhanced-hierarchical router design. Note that the enhanced-hierarchical crossbar design is a variant of the hierarchical crossbar design. Therefore, if the packet injection policy for the enhanced-hierarchical router design is modified in a way to make loads of subcrossbars unbalanced at low load rates and to make them balanced at high load rates, then the enhanced-hierarchical router design can achieve as low latency as the hierarchical design at low load rates 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 0 0 2 5 0 150 -o - : C-CB(C=2) : H-CB(C=2) •- : E-CB(C=2) 100 0.25 0.15 0.2 0 0.05 0.1 Throughput: flits/node/cycle Figure 3.11: Effect of load balance on performance under uniform traffic. while maintaining its maximum throughput. The modified injection scheduling for the enhanced-hierarchical router design is as follows. Unless the injection channel associated with the first VCN subcrossbar is busy, new packets from processor nodes are forced to be injected into the first subcrossbar. Only when the injection channel of the first or the second subcrossbar is busy, packets are allowed to be injected into the second or third subcrossbar, respectively. By using this new injection scheduling policy, at low load rates, most packets are 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Avenil t Load 5 O T -- C=IC=2C =3 C=1C=2C=3 Below Saturation N ear Samration C =IC =2C =3 Saturation Figure 3.12: Effect of new injection on load balance and performance of E-CB. injected into the first subcrossbar, resulting in load-unbalanced subcrossbars and at high load rates, i.e.. when the first subcrossbar becomes saturated, packets get injected into the second and the third subcrossbars, equally balancing the load across subcrossbars. This is shown in Figure 3.12(a). Figure 3.12(b) verifies that the enhanced-hierarchical router design with the new injection scheduling can achieve as low latency as the hierarchical design at low load rates while maintaining its maximum throughput at high load rates. In what follows, the enhanced-hierarchical router design uses the modified injection scheduling. 3.3.2.3 Performance com parison of router designs This section compares the performance of the various router designs to achieve a high-speed, true fully adaptive router design. Maximum network throughput (in flits/node//zsec) and average latency (in //sec) are measured which take into 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. account multi-cycle and router delay penalties of the different designs. The clock speed of each router design achieved in Section 3.3.1 is used for this performance comparison. All router designs are simulated under one uniform and two non- uniform traffic patterns. U niform Traffic R esults: Router designs using C-CB with 1 to 3 connect channels are compared against the U-CB router design in Figure 3.13(a). As shown, the throughput of routers with C-CB improves drastically as the number of connect channels increases. This result indicates that connect channels in C- CB are critical resources and that locality within dimension is not high enough to exploit the flexibility provided by adaptive routing. To have comparable through put as U-CB. C-CB requires at least 2 connect channels. Additionally, the C-CB router designs have up to 58% lower latency than the U-CB router. Router designs using H-CB with 1 to 3 connect channels are compared to the U-CB router design in Figure 3.13(b). The H-CB router has higher max imum throughput and lower latency than the U-CB router by up to 25% and 65%, respectively. Furthermore, unlike that of the C-CB router, the performance of the H-CB router is not affected significantly by the number of connect chan nels. This indicates that the frequency of packets using connect channels is much smaller in H-CB than in C-CB, which means that locality within virtual channel 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • 0 a a a o If I a a a o a a m ii iw w i CB S i a a a a o to Figure 3.13: Latency and throughput comparisons under Uniform traffic. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. network is better able to exploit adaptive routing, unlike locality within dimen sion. However, the H-CB router has slightly lower maximum throughput than the C-CB router with three connect channels although the locality within VCN is higher than that in dimension. This indicates that load-unbalanced subcrossbars of H-CB limit its maximum throughput. As shown in Figure 3.13(c), the performance of the E-CB router exceeds that of U-CB and all other crossbar designs. The E-CB router with only 2 connect channels has up to 64% lower latency and 51% higher maximum throughput than the U-CB router. Moreover, there is not a wide performance gap in going from one connect channel to three, which indicates that they are not critical resources. One reason why the E-CB router shows such good performance is that its dynamic message injection controls the load distribution among subcrossbars of E-CB to achieve low latency at low load rates and high maximum throughput at high load rates, as explained in Section 3.3.2.2. Another reason is that locality within virtual channel network is high and the design still exploits the full capabilities of adaptive routing. Moreover, the possible performance losses of the decoupled crossbar structure are negligible compared to the overall advantages. N on-uniform Traffic R esults: The performance of these router designs is also characterized using bit-reversal and perfect shuffle non-uniform traffic pat terns. As shown in Figure 3.14(a). each additional connect channel increases 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. it)CiK>n m 30 M % 30 2 S m o 10 ib i i f « 0 t o is 20 a J O 0 i d l n m w l C l j s a a » 40 to 30 3 Figure 3.14: Latency and throughput comparisons under Bit Reversal traffic. 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the maximum throughput of the C-CB router by % 8 units. However, unlike the case with uniform traffic, two connect channels are not enough to obtain the maximum throughput equal to the U-CB router. This shows that not only are connect channels critically limiting resources but also locality within dimension under bit-reversal traffic pattern is worse than that under uniform traffic. In Fig ure 3.14(b), the H-CB router has lower latency and higher maximum throughput than the U-CB router by up to 58% and 41%, respectively. Furthermore, as the number of connect channels varies, the performance of the H-CB router is shown to negligibly change. This confirms the conclusion drawn from the results with uniform traffic, i.e.. in the H-CB router design, connect channels are not critical resources to performance, due to high routing locality within VCN. In Figure 3.14(c) the E-CB router shows a noticeable performance improvement compared to the U-CB router (up to 58% lower latency and 51% higher maximum throughput). This result indicates that exploitation of locality within virtual channel network is profitable under uniform as well as non-uniform traffic. Further, the dynamic message injection and the sufficient locality within virtual channel network helps to mitigate the potential problems associated with its decoupled crossbar structure, i.e., internal blocking and multiple cycle delay, so as to minimize their effects. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. itttlfciM tfM M c a ? I 9 30 » 40 0 to 1 I 30 9 19 a 30 3 9 • 0 9 1 0 Figure 3.15: Latency and throughput comparisons under Perfect Shuffle traffic. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.2: Summary of Results. Router Designs Maximum throughput • RAN) Avenge message latency (RAN) Maximum throughput <B R > Avenge message latency (BR) Maximum throughput »PS) Avenge message latency (PS) U-CB 1 I I 1 I I C-CB (C=I) 065 0.56 0.59 068 0.89 068 C-CB (C=2) 1.21 0.42 085 062 1.15 044 C-CB <C=3) I 43 0.42 1 . 1 1 0.52 1.37 0 44 H-CB < c= n 1.2! 0.41 1 41 042 1.19 0.46 H-CB(C=2) 1.24 0.35 1.26 0 43 1.22 0.37 H-CB iC-3) 1.24 0.35 t 15 043 1 26 0.35 E-CB <C*l> 1.433 0.35 1.48 042 1.37 035 E-CB (C=2) 1.51 0.36 IM 043 1.45 0J9 E-CB (C=3) 148 0 J 6 1.47 0 44 1.41 041 Simulation results under perfect shuffle traffic (Figure 3.15(a), (b), and (c)) further confirm the conclusions derived from the results under uniform and bit- reversal traffic patterns. The E-CB router outperforms all other routers including the U-CB router by up to 45% higher maximum throughput and 65% lower latency. Table 3.2 summarizes our results for the four alternative router designs pre sented. The average message latency and maximum throughput of all router designs are normalized to those of the U-CB router at saturation. Due to insuf ficient locality within dimension, the C-CB design requires at least two connect channels to outperform the U-CB design. The H-CB design, overall, has higher performance than the U-CB design (up to 24% higher maximum throughput and 65% lower latency). However, the unbalanced load among virtual class networks 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. limits its maximum throughput (up to 20% lower maximum throughput than the E-CB design). The E-CB (C = 2) design gives the highest throughput and lowest latency under uniform and non-uniform traffic. We. therefore, conclude that the enhanced-hierarchical router design with its dynamic message injection scheduling yields the greatest performance for fully adaptive deadlock recovery routers. 3.4 Sum m ary This work explores the notion of exploiting dynamic routing behavior, identified as routing locality, in order to achieve high-speed, true fully adaptive router de signs. Proper exploitation of routing locality makes it possible for the internal crossbar of a deadlock recovery-based router to be partitioned into smaller and faster subcrossbars. This minimizes the delay caused by the increased adaptiv- ity of deadlock recovery-based routing and enables decoupled crossbar designs to support unrestricted routing capability without compromising router speed. An extensive evaluation of alternative crossbar designs verifies that there exists suffi cient routing locality in true fully adaptive deadlock recovery-based routing and that routing locality can be exploited to support unrestricted routing without sacrificing router speed. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Results reveal that the cost of higher delay of the unified-crossbar design over whelms the benefits of non-blocking connectability for high dimensional, large virtual channel networks. Subcrossbar connect channels in the cascaded crossbar design are limiting resources because routing locality "within dimension” mini mally exists in a true fully adaptive routing, particularly for non-uniform traffic. Increasing the number of connect channels has an overall effect of improving per formance up to the point where the subcrossbar size becomes prohibitively large. Connect channels in the hierarchical crossbar design are not critical resources. Instead, unbalanced loads among subcrossbars limit the maximum throughput of the design. The enhanced-hierarchical crossbar design outperforms the unified- crossbar design considerably: it is almost three times faster and achieves 50% higher throughput. Of the decoupled crossbar designs, it requires the fewest connect channels as it is able to exploit routing locality "within virtual channel network”. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 Evaluation o f Q ueue A rchitectures for D eadlock R ecovery-based R outers Chapter 3 proposed and evaluated decoupled crossbar designs that can efficiently reduce the cost of deadlock recovery schemes without hampering their true fully adaptive routing capability. As a direct extension of the crossbar research, this chapter presents, characterizes, and evaluates various queues structures that bet ter exploit true fully adaptive routing capability of deadlock recovery schemes and, thus, to maximize network performance. The key benefit of true fully adaptive routing capability is to minimize message blocking in networks by maximizing network resource utilization. In a router architecture, one of the major components associated with message blocking is queues that temporarily store messages until network resources are available. 9 -5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, to efficiently exploit true fully adaptive routing capability of deadlock recovery schemes, queue architectures must be optimized so that messages do not suffer unnecessary blocking and, thus, can be routed faster. Furthermore, these optimized queues need to efficiently and effectively cooperate with crossbar designs so as to minimize the cost of true fully adaptive routing capability and, thus, to maximize router speed. First, to better exploit true fully adaptive routing capability of deadlock re covery schemes, this chapter presents and characterizes various queue designs. Secondly, this research identifies crossbar designs which best match each queue design. One of the alternative queue designs and its best matching crossbar de fines each router design to be evaluated. By extensively simulating these router designs, this work gives an optimal fully adaptive router design that can efficiently minimize message blocking while maximizing routing speed. For this research, input queueing is assumed due to its high speed and high scalability, as mentioned in Chapter 2. 4.1 R outer Q ueue D esigns This section presents four queue designs: the circular queue (CQ), the dynami cally allocated multi-queue (DAMQ), the dynamically allocated multi-queue with 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Tail Pointer \ Head Pointer Figure 4.1: Circular queue (CQ) and first-in-first-out queue (FIFO). recruit registers (DAMQWR). and the virtual channel-based dynamically allo cated multi-queue (VCDAMQ). CQ and DAMQ are well-known queue designs that have been widely employed in commercial routers while DAMQWR and VCDAMQ are the queue designs proposed in this work in order to resolve the limitations of CQ and DAMQ and. thus, to better accommodate the true fully adaptive routing capability of deadlock recovery schemes. 4.1.1 Circular Q ueue Figure 4.1 shows the circular queue (CQ) which is one of the most well-known queue architectures for routers. CQ is basically a first-in-first-out queue (FIFO) that stores incoming packets and schedules their routing in arrival order. The main difference between CQ and FIFO is the registers that point to the head and 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the tail, as shown in the figure. The head and the tail are the reading and the writing port of CQ. respectively. When CQ is empty, these pointers enable an incoming packet to be directly stored in the head such that the packet can be immediately routed. This saves the unnecessary queueing delay that packets take reaching the head in FIFO and, thus, makes CQ more suitable for fully adaptive routers than FIFO. The key benefit of the circular queue is design simplicity due to its simple and straightforward scheduling policy, i.e., first-in first-out. This can reduce the architecture complexity of fully adaptive routers, making this queue worthwhile to evaluate. The main problem of CQ is the head-of-line blocking problem (HOL). HOL blocking is the situation in which, when the packet in the head of a queue is blocked due to traffic congestion, the following packets heading for idle output channels are also blocked unnecessarily. This significantly hampers network per formance. which might make CQ inadequate for fully adaptive routers. A general solution to this problem is virtual channels. This alleviates the head-of-line block ing problem by allowing packets to bypass blocked packets. Nevertheless, virtual channels complicate router architectures by increasing crossbar size and routing 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and arbitration complexity [11. 62]. making routers slow. Therefore, the trade off between architecture complexity and performance benefit of virtual channels needs to be taken into account in evaluating CQ. 4.1.2 D ynam ically A llocated M ulti-Q ueue Figure 4.2 (a) illustrates a well-known dynamically allocated multi-queue (DAMQ). As shown, DAMQ comprises K + 1 subqueues whose space is dynamically main tained in the form of linked lists, where K is the number of output links in a router. Each of K subqueues is associated with one of the K output links and. thus, all the packets in a subqueue are destined for the corresponding output link. Besides K subqueues, DAMQ has one additional subqueue to collect empty memory ceils for incoming packets to be stored into. The queue structure of DAMQ resolves the HOL blocking of CQ by eliminating routing interference be tween packets heading for different output links. Therefore, routers with DAMQ do not require virtual channels to resolve HOL blocking and to improve network performance. This simplifies router architectures, making routers faster. As an other benefit, DAMQ can well adapt to dynamic network traffic by dynamically allocating queue space based on network traffic demand. Despite the performance merits mentioned above, DAMQ suffers high design complexity caused by its linked lists and dynamic queue management, resulting 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Legend HP: K TP: K Head Pointer for Subqueue K Tail Pointer for Subqueue K EL: Link List for Empty Memory Cells TP-1 HP-1 HP-2 TP-3 HP-3 CO X b a r , TP-K K * K ? r o « a b « r (a) DAMQ with one reading port L eg en d : HP: Head P o in te r PRC: O u t- lin k Cor P ro c e s s o r TP: T a i l P o in te r EL: L in k L i s t Cor Empty C e lls IP-PRC, HPX< HP Y. 7 HP EL (b) DAMQ with multiple reading ports for a 2 dimensional network router Figure 4.2: Dynamically allocated multi-queue (DAMQ). 1 0 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in slow queues and routers. Additionally. DAMQ generally provides only one reading port and. thus, the port needs to be shared by its subqueues, becoming a bottleneck. To solve this problem, this work refers to DAMQ as the DAMQ in which each subqueue has its own reading port. However, this assumption might significantly increase crossbar size, complicating router architectures. For example, suppose a two-dimensional network router with one virtual channel per physical link. In case of DAMQ with one reading port, the router requires a 5 x 5 crossbar because five router input ports need five DAMQs (i.e.. X+, X-, Y+, Y-, Proc) and each DAMQ has one reading port. In contrast, the router utilizing DAMQ assumed in this work needs a 20 x 5 crossbar due to five router input ports and four reading ports per DAMQ, as shown in Figure 4.2 (b). But, the increased architecture complexity can be minimized by decoupled crossbar designs. This is because all the packets in each subqueue of DAMQ are destined for a pre-defined output link and, thus, the crossbar does not have to provide the subqueue with the connections to all the output links. This will be further discussed in Section 4.3. The second problem of DAMQ is its limitation of true fully adaptive rout ing capability. In DAMQ, when a packet is incoming, a routing decision for the packet is made in order to assign the packet to one of the subqueues and to 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. route the packet through the output link associated with the subqueue. There fore, after being stored in a subqueue, the packet loses routing adaptivitv. This might degrade network performance and, thus, make DAMQ not suitable for fully adaptive routers. As seen, despite the performance benefits of CQ and DAMQ, i.e., design sim plicity, no HOL blocking and dynamic queue management, CQ and DAMQ have some limitations in supporting true fully adaptive routing capability. Therefore, there is a need to develop queue designs that better support fully adaptive routers by efficiently resolving the problems of DAMQ and CQ. 4.1.3 Enhanced D ynam ically A llocated M ulti-Q ueue To overcome the problems of traditional queue designs, i.e., DAMQ and CQ, and, thus, to better exploit true fully adaptive routing capability of deadlock recovery schemes, this work proposes two enhanced dynamically allocated multi-queues: dynamically allocated multi-queue with recruit registers (DAMQWR) and virtual channel dynamically allocated multi-queue (VCDAMQ). As shown in Figure 4.3 (a), DAMQWR is basically DAMQ that consists of K size-varying subqueues associated with K output links and one additional subqueue for empty memory cells. The architectural difference between DAMQ and DAMQWR is that each subqueue of DAMQWR (except for PRC subqueue) 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Legend: HP:K : Head P o in te r fo r Subqueue K TP:K : T a il P o in te r fo r Subqueue K PRC O u t-lin k fo r P ro cesso r EL: L ink L is t fo r Bnpty Memory C e lls IO R e c ru it R e g is te r fo r Subqueue K Memory C e ll I n p u t L in k to X bar T P : K H ?: K C T P : EL HP: EL (a) DAMQ with recruit registers (DAMQWR). L egend: : Recruit Register i c t Subqueue FFC : Recruit Register i z z Sutqueue X* : Recruit Register ter Subqueue V* ^ 0 : Recruit Register :zr Jutqueue - to X bar TP Y- HP Y- TP B . (b) DAMQWR for the X-F direction in a two dimensional torus network router. Figure 4.3: Dynamically allocated multi-queue with recruit registers. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. has I\ - 2 additional registers called recruit registers. Each of these registers in a subqueue is associated with each of the output links excluding the output link assigned to the subqueue and the opposite directional output link. For instance, Figure 4.3 (b) represents the DAMQWR associated with the X+ input link in a two dimensional torus network router having five output links, X+, X- Y+, Y-, PRC, where PRC stands for the output link toward the processing node. In this figure, there is no X- subqueue because this work assumes no backtracking of packets in routing. As shown, the X+ subqueue has three recruit registers that are associated with Y+, Y-, and PRC. Each of these recruit registers points to a packet that can be routed to the output link that the register is associated with. Namely, the Y+ recruit register of the X+ subqueue points to a packet in the X-F subqueue that can be routed toward the Y + direction. The main function of these recruit registers is to help an empty and idle subqueue to actively recruit a packet from the other subqueues and thus, to transmit the recruited packet through the associated idle output link. Namely, when a subqueue has no packet to send, the subqueue scans its associated recruit registers, chooses one of the packets that can be routed through the empty subqueue, recruits the packet by updating the linked lists, and transmits the packet through the idle subqueue and its coupled output link. This resolves the problem of DAMQ, i.e., poor accommodation of routing adaptivity, by allowing blocked packets in a 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. subqueue to be routed through the other subqueues associated with idle output links, which reduces message queueing time and increases queue utilization. As another advantage, just like DAMQ, DAMQWR does not require virtual channels to improve network performance because its queue structure resolves the HOL blocking problem. This simplifies router architectures and makes routers faster. Additionally, this queue design can well adapt to dynamic network traffic due to its dynamic queue management as DAMQ does. Consequently, this queue design is believed to efficiently exploit true fully adaptive routing capability by minimizing message blocking while maximizing router speed. The main disadvantage of DAMQWR is the additional delay incurred by its recruit register updates and packet recruit operations. To minimize this prob lem. this work makes a couple of assumptions. First, regarding recruit register updates, each empty recruit register can be updated by scanning all the pack ets in its associated subqueue, which is straightforward, but very expensive. To simplify this process and, thus, to make DAMQWR faster, in this work, empty recruit registers of a subqueue are assumed to be updated only when a new in coming packet is stored into the subqueue. For example, suppose a packet that has two routing choices, e.g., X + and Y + , and is about to be stored into the X-F subqueue according to the routing decision logic. If the Y-t- recruit regis ter of the X-F subqueue is empty, the value of the recruit register is updated to 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. point to that packet, which eliminates the necessity to scan all the packets in a subqueue. Although this can leave some recruit registers empty for a while, this does not hamper network performance because there are n - 2 recruit registers associated with a subqueue and, thus, the possibility that an empty subqueue can find packets to recruit is sufficiently high. Secondly, packet recruit operations are also expensive compared to normal queue operations such as reads and writes. To minimize the negative impact of this expensive operation on network perfor mance, this work assumes these operations to be multi-cycle operations without elongating clock cycle time. Another enhanced dynamically allocated multi-queue proposed in this work is Virtual Channel-related DAMQ (VCDAMQ). This queue combines the virtues of DAMQ and CQ, i.e., dynamic queue management and full utilization of all rout ing choices given. First, from an architectural perspective. VCDAMQ is similar to DAMQ which is shown in Figure 4.2 (b). The only difference is that sub queues of VCDAMQ are associated with virtual channels while those of DAMQ are associated with output links. Hence, like DAMQ, VCDAMQ can efficiently adapt to unbalanced traffic loads among virtual channels by dynamically allocat ing queue space to virtual channels. Secondly, VCDAMQ also can be viewed as a set of size-varying CQs assigned to a physical link. Therefore, each subqueue of 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. VCDAMQ is not coupled to any specific output link and, thus, can route packets to any direction, better accommodating true fully adaptive routing capability. As another benefit of VCDAMQ, this queue is designed to well support the hierarchical and the enhanced hierarchical crossbar structures that have subcross bars associated with virtual networks. In these crossbar structures, although routing locality has been shown to exist in Section 3.3.2.1, load balancing among different virtual networks can be easily broken due to the decoupled crossbar structure. VCDAMQ can efficiently handle the unbalanced loads by dynamically allocating queue space to virtual networks, which is expected to improve network performance. The problem of VCDAMQ is the complicated queue architecture incurred by its dynamic queue space management, similar to DAMQ. This might de grade overall network performance by increasing queue latency. Additionally, VCDAMQ has the HOL blocking problem just like CQ. However, this problem might be mitigated since VCDAMQ operates as a set of virtual channels that can reduce HOL blocking. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 Queue Im plem entation Figure 4.4 presents an implementation example of the four queues presented, i.e.. CQ, DAMQ, DAMQWR, and VCDAMQ. Based on the implementation ex ample, this section provides insight on the relative implementation costs of the proposed queue designs. Note that the main purpose of this section does not lie in queue design optimization. Rather, it makes an effort to identify implementation differences among the presented queues and, thus, to evaluate their relative rea sonable costs. To reflect the latest implementation technology trends, i.e.. large die size and small transistor feature size, the following assumptions are made for queues and routers: virtual cut-through switching, large queue size (i.e.. capable of storing more than 8 packets), input queueing. As shown in the figure, these queues commonly include packet cells, demul tiplexers, and multiplexers. A packet cell is a circular queue that can store a whole packet. The multiplexer (MX) and the demultiplexer (DM) select one of the packet cells to read and write, respectively. The reason why each queue de sign comprises packet cells rather than phit cells is to reasonably reduce the high cost of large queues assumed in this work. A phit cell is a set of registers that can store a phit. where a phit is the data unit delivered within one clock cycle through a physical link. Generally, a phit is a subset of a packet, meaning a 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Circular queue (CQ). LEGEND: O M C D a n td n p ln ar G o n tr e C a r M * K O u^ort tar Smune K S - K S tA q u a u a K b a w tf i iM w dU * \ % •y « | Pk m Cmi i | % i ♦ • | P M M tC tfS f • t i | P U M C M ^ 1 • V ^ 2 1 ,f o w i* o O 0 - t > o 0 j » (b) DAMQ and VCDAMQ. Packet Cttll Packet OH 2 packet CeitN Q uaut C o n tro l Ir u re a tf u eior 21 /tacn* f \ % * = E 3 ' - " 7 T 7 T nw - (c) DAMQWR. Figure 4.4: Implementation of queue designs for fully adaptive routers Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. smaller data unit. Therefore, when a large queue consists of phit cells, the size of MX and DM required must be large. For instance, consider a 128 phit-deep queue with a packet size of 8 phits. In case of a queue comprising phit cells, a 128-to-l demultiplexer/multiplexer is required to select one out of 128 phit cells, which makes queues considerably slow. In contrast, if a queue consists of packet cells, a 16-to-l demultiplexer/multiplexer is sufficient for the queue. Of course, in this case, multiple hierarchical accesses must be required to read and write a phit from/to a queue. Namely, to write and read a phit. this hierarchical queue design first selects a packet cell then, subsequently, selects a phit cell to read and write within the pre-selected packet cell, which can increase the average access time of queues. However, this increased access time can be minimized by exploiting the queue access behavior of virtual cut-through switching where data flowing between routers are controlled on a packet basis instead of a phit basis. With this switching, once a packet cell is selected, many subsequent queue accesses for reads and writes are done within the selected packet cell, which makes for the common case of queue accesses. This reduces average queue access time almost to the average time to access a phit cell without changing packet cells. Furthermore, the control process to select a packet cell and the control process to select a phit cell can be overlapped instead of being serialized. This is because the time when the next packet cell is selected can be predicted, i.e., right after the last phit of a 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. packet cell is read. Consequently, this hierarchical queue design can reasonably reduce the cost of large queues, resulting in fast queues. The main implementation difference between CQ and DAMQs (i.e., DAMQ, VCDAMQ, and DAMQWR) lies in the control of packet cell accesses. By design, CQ always accesses packet cells sequentially. Therefore, as shown in Figure 4.4 (a), its input and output controllers, i.e.. DMC and MXC, can be realized by- two log iV-bit incremental counters, where Ar is the number of packet cells in CQ. In contrast, packet cell accesses in DAMQs are controlled by linked lists. For these linked lists, each DAMQ in Figure 4.4 (b) and (c) requires {2K + .V) (log iV)-bit pointer registers, where K and N are the number of subqueues and the number of packet cells in a DAMQ, respectively. As shown, each of the N pointers, named packet cell link pointers (Pnt), is coupled to one of the N packet cells and down-links the coupled packet cell to another packet cell. 2K pointers are the head (H) and the tail (T) pointers of K subqueues that point to the packet cells located in the head and the tail of the subqueues. For example, look at an example of a linked list for Subqueue 1, which is given in Figure 4.4 (b). The head pointer of Subqueue 1, i.e., the left most box located in Queue Control containing "H,” points to Packet Cell 1, which means that Packet Cell 1 is at the head of Subqueue 1 and, thus, is to be read. The packet cell link pointer 11 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coupled with Packet Cell 1. i.e.. Pnt-PCl, points to Packet Cell 2, which down links Packet Cell 1 to Packet Cell 2 in the linked list of Subqueue 1. In this way, packet cells are linked and form the linked list for Subqueue 1 as shown in the figure. Besides the pointers mentioned above, DAMQWR requires additional 2 x (K — 1) x (K — 2) (log iV)-bit recruit registers as shown in Figure 4.4 (c). Each subqueue of DAMQWR (except for PRC subqueue) has K — 2 recruit registers pointing to the packet cells that can be routed to the other K — 2 output links (excluding the output link associated with the subqueue and the opposite directional output link). Each recruit register in a subqueue accompanies one pointer providing the up-link of the packet cell pointed to by the recruit register in the linked list. This pointer is required when a packet cell is recruited so that the associated linked list is updated. These linked lists and recruit registers increases the cost of DAMQ and DAMQWR. However, the increased cost is negligible compared to the total cost dedicated to packet cells in a queue. For instance, consider a 128 phit-deep DAMQ and DAMQWR with a packet size of 8 phits and a phit size of two bytes assuming five subqueues (e.g., 2D torus). The queue space required for packet cells is 256 bytes in both DAMQ and DAMQWR while the queue space for pointers of DAMQ and DAMQWR is only 13 bytes and 28 bytes, respectively 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (about 10% of the queue space dedicated to packet cells). This makes queue speed the more important issue than queue cost (overhead space). The speed of queues can be evaluated in terms of two main queue operations: read and write. Due to the hierarchical access structure of queues, these two operations are again classified into intra-packet cell (intra-PC) and inter-packet cell (inter-PC) operations. An intra-PC operation consists of reading and writing a phit from/to a queue without changing packet cells while an inter-PC operation does so with changing packet cells. Note that, in the queue designs presented here, the packet cell designs of all the queues are identical, i.e., circular queue. This makes the delays of all intra-PC operations for all the queues equivalent. In contrast, the queue designs presented have different delays for inter-PC operations due to their different queue management. First, take into account the inter-packet cell operation of performing a write. To find and select an empty packet cell for a phit to be written into, CQ simply increases the value of the counter controlling the demultiplexer, i.e., DMC, by one, according to the se quential access policy of CQ. In contrast, DAMQs, i.e., DAMQ, VCDAMQ and DAMQWR, load a pointer value from the Empty Cell Queue into DMC to se lect an empty packet cell, as shown in Figure 4.4 (b) and (c). While loading the incoming packet into the selected packet cell, DAMQs update a linked list in order to assign the incoming packet to a subqueue. Updating a linked list 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to write requires DAMQs to modify two pointers, i.e.. the tail pointer and the packet cell link pointer in order to add an incoming packet to the tail of the linked list. Besides these pointer updates, if the incoming packet has additional routing choices, DAMQWR updates two more pointers, i.e., the recruit register and the associated up-link pointer. Although DAMQs need to update multiple pointers, these updates can be done independently and simultaneously, which does not significantly increase queue delay. Moreover, note that the inter-PC write delay of DAMQ and DAMQWR includes the delay of the routing decision logic because DAMQ and DAMQWR make routing decisions before storing packets, which is not included in the inter-PC write delay of CQ and VCDAMQ. Namely, although the inter-PC write delay of DAMQ and DAMQWR is higher than that of CQ and VCDAQM. considering overall routing delay, the inter-PC write delay increased by DAMQ and DAMQWR can be negligible. Regarding the inter-PC operation of performing a read, CQ increments the value of the counter (MXC) by one, to control the packet cell multiplexer and, thus, to select a packet cell to read from. In contrast, DAMQ, VCDAMQ, and DAMQWR commonly modify the head pointer of the subqueue such that the old packet cell in the head of the subqueue is replaced by the next packet cell in the associated linked list. Besides the pointer update, when a subqueue is empty, DAMQWR scans recruit registers to choose a packet to recruit and then updates 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.1: Delay comparison of the various queue designs. CQ (1 VC) CQ (2/4VC) DAMQ VCDAMQ DAMQWR Write (Intra-PC) 2.2 ns 2.2 ns 2.2 ns 2.2 ns 2.2 ns Write (Inter-PC) 3.1 ns 2.2 ns 3.7 ns 3.5 ns 3.9 ns Read (Intra-PC) 2.0 ns 2.0 ns 2.0 ns 2.0 ns 2.0 ns Read (Inter-PC) 2.9 ns 2.0 ns 3.3 ns 3.3 ns 3.3 ns Recruit-Read N/A N/A N/A N/A 6.2 ns the associated linked lists to recruit the packet. This operation is complicated and expensive compared to the other operations, resulting in high latency. However, considering that (1) this recruit operation is executed only when a subqueue has nothing to send and (2) DAMQWR recruits only packets that are not in the head of subqueues and are waiting idle, it is not difficult to see that the increased latency of the recruit operation does not hamper the performance of DAMQWR. To provide insight on the speed of the queue designs presented. Table 4.1 provides the delays of five queue operations for six queue designs. Based on the implementation technology of the WARRP router [67] (see next chapter), these queue delays are estimated by identifying the critical path of each queue operation and adding up the delay components along the critical path. For this queue delay evaluation, a queue depth of 128 phits, a packet size of 8 phits. and a two dimensional torus network are assumed. Additionally. CQ is assumed to implement either one, two, or four virtual channels per physical link by dividing a large queue into as many small CQs as the number of virtual channels. Namely, 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in case of one virtual channel per physical link, the size of CQ allocated to each virtual channel is 128 phits while, in case of four, the size of CQ allocated to each virtual channel is 32 phits. The size of packet cell multiplexers and demultiplexers decreases as the number of virtual channels increases, which makes the delay to select a phit cell dominant rather than the delay to select a packet cell. This is why CQ with two and four virtual channels has smaller delays of inter-PC operations than CQ with one virtual channel in Table 4.1. As shown in the table, the delay of each intra-PC operation is the same for all the queue designs because their packet cell designs are identical. The reason why the intra-PC operation of write takes a longer time than that of read is that, with the implementation technology of the VVARRP router, a write into a register has a longer delay than a read from a register. Regarding inter-PC operations. DAMQWR. DAMQ, and VCDAMQ have up to 77%, 68% and 59% longer delays than CQ, respectively. This indicates that the dynamic queue man agement of DAMQs is more expensive than the static queue management of CQ and, moreover, the complicated recruit function makes DAMQWR slowest. How ever. the increased delay of DAMQs, i.e.. DAMQ, VCDAMQ, and DAMQWR, does not significantly hamper router speed. This is because (1) the clock cycle time of a router is determined by the data pass-through delay rather than the path set-up delay, as explained in Section 3.3.1, and (2) in virtual cut-through 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. switching, inter-PC operation delays are related to the path set-up delay while intra-PC operation delays are a part of the data pass-through delay. In other words, although the path set-up latency of routers with DAMQs is much longer than that of routers with CQ, router clock speed is not really affected by queue designs. Rather, the clock speed is predominantly affected by crossbar designs, which will be taken into account in the following section. 4.3 Incorporation o f Queue and Crossbar In Chapter 3, partitioned crossbar designs are shown to effectively reduce the cost of true fully adaptive routing capability and to increase performance. Therefore, to maximize network performance, queue designs must efficiently cooperate with the crossbar designs so that routers can minimize network congestion while max imizing routing speed. In this regard, this section takes into account how to effectively incorporate queue designs and crossbar designs in routers by identi fying crossbar designs that best match each queue design. For this work, three crossbar designs, i.e.. the unified (U-CB), the cascade (C-CB), the enhanced hi erarchical crossbar (E-CB), and four queue designs, i.e.. CQ, DAMQ, VCDAMQ, 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and DAMQWR are considered. The reason why the hierarchical crossbar is ex cluded is its inferiority to the enhanced hierarchical crossbar, as shown in the last chapter. First of all. the circular queue (CQ) can be combined with any type of cross bar. However, in the last chapter, true fully adaptive routers assuming CQ were shown to have the highest network performance for the enhanced hierarchical crossbar (E-CB) design as opposed to the other crossbar designs. This is be cause E-CB efficiently exploits routing locality within virtual channel networks and. thus, successfully minimizes the cost of the unrestricted routing capability of deadlock recovery schemes. Therefore, this research assumes E-CB to be the crossbar best matching CQ for fully adaptive router designs. From the functional view. VCDAMQ is simply a set of size-varying CQs im plementing virtual channels. Therefore. VCDAMQ is also expected to perform best with the enhanced hierarchical crossbar (E-CB). Besides, due to its dynamic queue management, VCDAMQ is expected to better accommodate the decou pled crossbar structure of E-CB than the circular queue, as mentioned before. As a result, in this work, E-CB is assumed to be the best crossbar design for VCDMAQ. Unlike CQ and VCDAMQ, each subqueue of DAMQ and DAMQWR is cou pled with a pre-defined output link. This eliminates the necessity for a crossbar 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (S>*€D*f[IkfE>*€D*<S>- — C&>*IEKQIKSIK<£>-* • ( S > * H K < S > -» ■ <3>*iD*fni>S>*€D*<E} (£>*€IIWO»iD*<S>' <S>*o»ax5D- * • <Z>€i>*®>€d* * € d*<X) ( S > € D * f E > * t E } * < S > C x I>*S>K£)-K CJT>03*€1IK<2> Cg>*4O*€n*€n*0I>*<£> DeadlocK B u ffer Figure 4.5: Port-directed crossbar. to provide connections between a subqueue and the output links unrelated to the subqueue, which enables decoupled crossbars. Since subqueues of DAMQ and DAMQWR are associated with output link directions, the decoupled crossbar expected to best work with DAMQ and DAMQWR is a variant of the cascade crossbar shown in Figure 4.5. This crossbar is named the port-directed, crossbar (PD-CB). Note that this crossbar does not have connect channels between sub crossbars. This is because there is no possibility that packets in a subqueue head 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.2: Crossbar size according to queue designs and network dimensions. Crossbar Size (2D) Crossbar Size (3D) CQ & U-CB (1VC) 5 X 6 7 X 8 CQ & U-CB (2VC) 10X 11 14X 15 CQ & U-CB (4VC) 20X21 28X29 CQ & E-CB (2/4VC) 7 X 8 9 X 10 VCDAMQ & E-CB (2/4VC) 7 X 8 9 X 10 DAMQ & PD-CB 4 X 2 6 X 2 DAMQWR & PD-CB 4X 2 6 X 2 for the output links that the subqueue is not associated with. This can further reduce crossbar size while completely resolving the internal complications of the decoupled crossbars. Therefore, this work identifies PD-CB as the best crossbar for DAMQ and DAMQWR. To provide insight on how queue designs affect crossbar size, Table 4.2 provides the size of crossbars required by routers with the queue designs presented. In this table, two dimensional torus and three dimensional torus networks are assumed. Routers incorporating CQ or VCDAMQ are assumed to have either one, two, or four virtual channels per physical channel in order to reduce the HOL blocking problem. In contrast, for router designs having DAMQ or DAMQWR. no virtual channel is assumed because these queues does not suffer from HOL blocking at all. Additionally, two connect channels are assumed for E-CB because E-CB with this configuration was shown to provide the best performance in the last chapter. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Note that the table does not include the crossbar size of the routers with E-CB and one virtual channel per physical link. This is because the E-CB router with one virtual channel is identical to the U-CB router with one virtual channel. As shown, the routers incorporating PD-CB and DAMQ or DAMQWR re quire the smallest crossbars (2D: 4 x 2 and 3D: 6 x 2) in both 2D and 3D torus networks. This indicates that the queue structure of DAMQ and DAMQWR minimizes crossbar size and. thus, potentially leads to the fastest fully adaptive router designs. In addition, note that the size of E-CB is not affected by the number of virtual channels. This enables the routers with CQ and VCDAMQ to efficiently resolve HOL blocking without sacrificing routing speed, making the router design with E-CB and CQ or VCDAMQ an optimal router design candi date for fully adaptive networks. Finally, the router with U-CB and four virtual channels requires the largest crossbar, which might make routers significantly slower and. thus, degrade network performance. Indeed, this table identifies the queue designs which lead to smaller crossbars and faster routers. However, the queue designs leading to fast routers might not be the optimal queue designs for fully adaptive deadlock recovery networks. This is because network performance is not determined by only router speed but also by the capability to minimize message blocking. Thus, to find an optimal router design for fully adaptive networks, the following section evaluates various router 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. designs incorporating one of the alternative queue designs and its best matching crossbar design through extensive simulations. 4.4 Perform ance Evaluation To evaluate the performance of router designs, each router design is assumed to incorporate one of the four alternative queue designs presented and its best- matching crossbar design, i.e., CQ k E-CB, VCDAMQ k E-CB, DAMQ k PD- CB, and DAMQWR k PD-CB, in the deadlock recovery-based WARRP router design [67]. In addition to these routers, this evaluation includes the primitive router designs of CQ and U-CB assuming 1, 2. or 4 virtual channels per physical link in order to estimate the performance improvement of the router designs proposed. Among the router designs to be evaluated and compared, the router designs with CQ/VCDAMQ and E-CB assume four virtual channels per each physical link so as to resolve their HOL blocking problem and, thus, to maximize router performance. This does not hamper router speed because the decoupled structure of E-CB minimizes the cost of virtual channels as mentioned in the last chapter. The reason why four virtual channels are chosen for the routers will be given and justified in Section 4.4.2.1. In contrast, the routers with DAMQ and 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DAMQWR do not employ virtual channels since they do not suffer from HOL. Each router design is named by its queue and crossbar design. First, this section estimates the speed of each router through synthesis in order to provide insight about the effect of each queue and its crossbar on router speed. Secondly, the performance of each queue is evaluated to examine how well the queue design accommodates the true fully adaptive routing capability of the underlying routing scheme. At last, the network performance of each router design is evaluated to determine which design is optimal for true fully adaptive deadlock recovery-based routers. 4.4.1 R outer Speed The speed of router designs is determined by two main metrics i.e., Tset- ap, and Tpass-thru- As described in Section 3.3.1, Tset- U p and Tpass- thru are the minimum delays which a header phit (the first phit) and a data phit (one of the subsequent phits) take passing through a router, respectively. Because this work assumes the same router delay model, i.e., VVARRP router delay model, and the same implementation technology that the last chapter used, the expressions for the metrics and the value of their delay components are identical to those given in the Section 3.3.1, except for Ta d and T/c which are affected by queue design complexity. By identifying the critical paths of the associated modules and adding 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.3: Delay of router designs. "Tsei-up Tpass-tfiru Max Clock Speed (2D/3D) (2D/3D) (2D/3D) CQ & U-CB (1VC) 8.5 / 9.7 ns 4.2 / 4.6 ns 238/217 MHz CQ & U-CB (2VC) 11.8/16.0 ns 5.4 / 6.2 ns 185/161 MHz CQ & U-CB (4VC) 18.1/23.4 ns 7.4 / 9.0 ns 135/132 MHz CQ & E-CB (4VC) 9.7/10.9 ns 4.6/5.1 ns 217/196 MHz VCDAMQ & E-CB (4VC) 9.7/10.9 ns 4.6 / 5.1 ns 217/196 MHz DAMQ & PD-CB 7.9/9.1 ns 4.0 / 4.4 ns 250 / 227 MHz DAMQWR & PD-CB 7.9/9.1 ns 4.0 / 4.4 ns 250 / 227 MHz up the delay of the basic logic components (e.g., register, and, or, etc.) that the critical paths comprise and are achieved based on the implementation technology of the VVARRP router, Tad and T/c are re-estimated as follows: Tad = 2.6 ns and Tfc = 2.7 ns. Table 4.3 gives internal router delays and the achievable maximum clock speed for the seven alternative router designs. The achievable maximum clock speed is determined by the inverse of Tp a a j of each router design in order to minimize the fragmentation of clock cycle time. Additionally, the speed of all the router designs is estimated for two dimensional and three dimensional torus networks since most up-to-date commercial network routers are based on these networks, e.g.. Cray T3D. Cray T3E, Alpha 21364 and, thus, router performance evaluation under these networks is interesting. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As shown in Table 4.3, regardless of network dimension, the router designs with decoupled crossbars, i.e., E-CB and PD-CB, have up to 85% higher clock speed than those with the unified crossbar (U-CB). This is because the decou pled structure of E-CB and PD-CB significantly reduces routing, arbitration, and crossbar complexity as shown in the last chapter. Moreover, regardless of the number of network dimensions, the routers with PD-CB are up to 16% faster than those with E-CB. This indicates that the queue structure of DAMQ and DAMQWR enables the ability to further reduce crossbar size by eliminating con nect channels of E-CB, which further simplifies router architectures and results in faster routers. In addition, as shown, the routers for the 3D torus network is up to 15% slower than those for the 2D torus network. This indicates that the increased network connectivity of the 3D torus network complicates router architectures although it might reduce network message congestion. Overall, the router designs with PD-CB and either DAMQ or DAMQWR is shown to be the fastest routers for both 2D and 3D torus networks. However, fast routers do not always result in high network performance, which raises the need to evaluate the network performance of each router design. That evaluation will help to determine which router design is optimal for fully adaptive deadlock recovery-based networks. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.2 N etw ork Perform ance Evaluation This section evaluates and compares the performance of various router designs through extensive simulations using FlexSimNA, a more enhanced version of FlexSim 1.2 in which a queue is capable of storing multiple packets. The rea son why simulations are run using FlexSimNA instead of FlexSim 1.2 used in Chapter 3 is to handle large buffers assumed in this work. For all simulations, the following assumptions are made. Packets are 8 phits long. A queue depth of either 64 or 128 phits is assumed, each of which is capable of storing 8 or 16 packets. All router designs use one injection and delivery physical channel. A true fully adaptive minimal deadlock recovery routing scheme (Disha) is as sumed with a default time-out of 25 cycles before deadlock is suspected. Each simulation is run for a duration of 50,000 simulation cycles beyond the initial transient period (the first 10,000 cycles) so that data is collected during steady state only. Other particular assumptions for each simulation will be given in the corresponding section, if necessary. First, the following section analyzes how well each queue design can accom modate the unrestricted routing capability of deadlock recovery schemes. Second, the performance of each router design will be evaluated in order to examine howr efficiently and effectively each queue design cooperates with its crossbars and, 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. thus, can minimize message congestion while maximizing router speed. This will give an optimal router design for fully adaptive networks. 4.4.2.1 Queue Perform ance The performance of each router design incorporating one of the queue designs presented, i.e., CQ, VCDAMQ, DAMQ, and DAMQWR, is evaluated by mea suring maximum network throughput (in phits/node/cycle) and network latency (in cycles). For this evaluation, each router design is assumed to incorporate the unified crossbar so as to eliminate the side effects of crossbars on this queue performance evaluation. Before evaluating queue performance, an optimal number of virtual channels for the router designs with CQ and VCDAMQ is chosen to maximize router performance. This number is achieved by simulating router designs with CQ and VCDAMQ under a 16 x 16 two dimensional torus while varying the number of virtual channels from one to four. For this simulation, the size of each queue assigned to a physical is assumed to be 128 phits. Figure 4.6 (a) and (b) provide the simulation results under a uniform traffic pattern (random) and a non-uniform traffic pattern (bit-reversal), respectively. As shown in Figure 4.6, the routers having four virtual channels per physical link have the highest performance in both traffic patterns. Therefore, this section 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 0 0 w 2 5 0 d ) o > , 200 u > 1 5 0 o c 0 ) u 1 8 100 -CQ (1VC) -VCDAMQ (2VC) -CQ (2VC) -VCDAMQ (4 VC) -CQ (4VC) 5 0 0 . 1 0 . 2 0 . 3 0 . 4 T h r o u g h p u t ( p h i t / n o d e / c y c l e ) (a) Random traffic —* - C Q ( 1VC) — CQ (2VC) - * - C Q (4VC) — •♦—VCDAMQ (2VC) VCDAMQ (4VC) 3 00 S 250 i-H t? 200 & 150 § 2 ioo ( 0 j 50 0.2 0 . 3 0 0.1 T h r o u g h p u t ( p h i t / n o d e / c y c l e ) (b) Bit-reversal traffic Figure 4.6: Effect of Virtual Channels. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. chooses 'four to be an optimal number of virtual channels for all the router designs with CQ and VCDAMQ. The reason why the performance of the routers having more than four virtual channels is not included and not presented here is that their performance is almost comparable with that of the router having four. This means that four virtual channels are sufficient to resolve the HOL problem of CQ and VCDAMQ. First, under a 2D torus network and a uniform traffic pattern (random), each router is simulated by varying queue size in order to examine the effects of queue size on the performance of each queue. Second, to analyze how the number of network dimensions affects queue performance, the performance of each router is evaluated under a 3D torus network. At last, the effect of non-uniform traffic patterns on queue performance will be examined. Queue Size: To examine the effects of queue size on queue performance, each router design is simulated by varying queue size from 64 to 128 phits under the random traffic pattern. All the simulations are run under a 16 x 16 two di mensional torus network with duplex links. Figure 4.7 compares the performance of the router designs with one of the four alternative queue designs presented. As shown, in case of small queues (64 phit-deep queues), CQ, VCDAMQ, and DAMQWR have the highest maximum throughput (15% higher maximum 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — CQ(4VC)_64 VCDAMQ(4VC)_64 - A — DAMQ_64 DAMQWFL64 — * —CQ(4VC)_128 — • — VCD AMQ(4VC)_128 —h ~DAMQ_128 — ■ — D AMQWR_128 3 0 0 S 2 5 0 > 200 > 1 5 0 Z 100 50 0 . 3 0 . 5 0 0.2 0 . 4 0.1 T h r o u g h p u t ( p h i t / n o d e / c y c l e ) Figure 4.7: Effect of queue size in a 2D torus network. throughput than DAMQ). This indicates that CQ, VCDAMQ and DAMQWR better support the given routing adaptivity by allowing packets to be routed through any direction while the queue structure of DAMQ somewhat limit routing adaptivity by storing packets in one of the subqueues and. thus, tying packets to the associated routing direction. Furthermore, the problem of DAMQ becomes more serious when queue size is small because, in this case. DAMQ can often allocate only one packet cell to subqueues and, thus, packets heading for the subqueues might significantly suffer. In contrast, CQ, V'CDAMQ, and DAMQWR 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performs best by allowing packets to be routed through any directions and better utilizing small queue space. As the queue size increases to 128 phits, the maximum throughput of DAMQ and DAMQWR increases by up to 7% while those of CQ and VCDAMQ in crease by less than 2%. This means that, DAMQ and DAMQWR necessitate large queue space in order to fully benefit from their advantages, i.e., no HOL blocking problem and dynamic queue allocation. Additionally, this helps DAMQ to minimize its problem mentioned above by allocating more packet cells to each subqueue. Indeed, with 128 phit-deep queues, the router with DAMQWR has the highest performance (up to 4% higher maximum throughput than the other routers). This result indicates that DAMQWR best supports the given routing adaptivity of deadlock recovery schemes while efficiently utilizing queue space. Additionally, the routers with DAMQ has the lowest maximum throughput and the highest latency in both cases of 64 and 128 phit-deep queues (up to 17% lower maximum throughput than the router with DAMQWR). This indicates that the queue structure of DAMQ considerably limits network performance by not efficiently exploiting routing adaptivity. The recruit capability of DAMQWR indeed resolve the problem of DAMQ, which improves network performance. As another observation, regardless of queue size, the router with VCDAMQ has slightly lower performance than that with CQ (1% lower throughput). This 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. shows that VCDAMQ often makes traffic loads among virtual channels unbal anced by dynamically and not equally allocating queue space among virtual chan nel networks. Indeed, as shown in [77, 14], network performance is maximized when loads among virtual channels is balanced. Violating this, V'CDAMQ slightly reduces its maximum throughput as shown. In contrast, CQ statically and evenly allocates queue space to virtual channels, which helps to balance loads among vir tual channels and increases network throughput. Overall, when sufficient queue space is provided. DAMQWR best exploits the routing adaptivity given by the underlying routing scheme while effectively and dynamically utilizing queue space, which results in the highest router per formance. To better justify this conclusion and to analyze the effect of network dimension on queue performance, the simulation of each router design is repeated under a three dimensional torus network. 3D torus: Each router design is simulated under an 8 x 8 x 8 three dimen sional torus network and random traffic. The depth of the queue assigned to each physical link is assumed to be 128 phits. Although giving a performance preference over DAMQ and DAMQWR. this assumption, i.e., 128 phits rather than 64 phits, can be justified by the trend of router queues, i.e., large queue. 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CQ(4VC)_128 VCDAMQ (4VC) _12 8 - a— DAMQ_12 8 DAMQWR_12 8 3 5 0 - 3 0 0 2 5 0 “ 200 c 1 5 0 « 100 5 0 0 0 . 4 0.6 0.8 0.2 T h r o u g h p u t ( p h i t / n o d e / c y c l e ) Figure 4.8: Three dimensional torus (8x8x8). and the trend of CMOS technology, i.e.. small transistor and large die. Figure 4.8 compares the performance of the router designs. First, the performance of CQ is almost comparable to that of DAMQWR under the three dimensional torus network, which is highest. This is because the three dimensional network improves the degree of routing adaptivity (in 2D, two routing choices while, in 3D, three routing choices) and. thus, the relative benefits of DAMQWR, i.e., dynamic queue allocation and no HOL blocking, diminish. However, this does not mean that routers with CQ and DAMQWR 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have comparable network performance under high dimensional networks because this evaluation does not take into account crossbar designs and router architecture complexity. This raises a need to evaluate each router design taking into account crossbar design and router speed, which will be given in Section 4.4.2.2. Additionally, the performance of DAMQ shows up to 17% lower maximum throughput than the others. This verifies the conclusion derived from the two dimensional network, i.e., DAMQ degrades network performance by not well supporting the given routing adaptivity. Moreover, VCDAMQ is shown to have slightly lower throughput than CQ and DAMQWR. This also verifies that the dynamic queue allocation of VCDAMQ produces load imbalance among virtual channel networks and, thus, reduces network throughput. Nevertheless, when combined with the enhanced hierarchical crossbar (E-CB). VCDAMQ is expected to alleviate the problems associated with the decoupled crossbar structure by dynamically allocating queue space among virtual channel networks, improving overall network performance. Non-uniform traffic pattern: To evaluate the capability of each queue to handle hot spots and biased traffic, each router design is evaluated under a non-uniform traffic patterns, i.e., bit-reversal. All simulations are executed under 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — CQ ( 4 V C ) _ 1 2 8 V C D A M Q ( 4 V C ) _ 1 2 8 — — DAMQ_12 8 DAMQWR_12 8 3 5 0 m 3 0 0 cj 2 5 0 200 C 1 5 0 a ) « 100 5 0 0 . 2 5 0 . 3 0 . 3 5 0 0 . 0 5 0 . 1 0 . 1 5 0.2 T h r o u g h p u t ( p h i t / n o d e / c y c l e ) Figure 4.9: Non-uniform traffic pattern. a 16 x 16 two dimensional torus network with duplex links. Queue size is set to 128 phits. The simulation results are shown in Figure 4.9. First, DAMQVVR and VCDAMQ show the highest performance under the non-uniform traffic pattern (up to 7% higher maximum throughput than CQ. This means that the dynamic queue allocation capability of DAMQWR and VCDAMQ helps to efficiently handle the biased and congested traffic generated by the bit- reversal pattern. Second, DAMQ has 20% lower throughput than DAMQVVR, which is the lowest performance. Moreover, the performance gap between DAMQ and DAMQVVR becomes wider in the non-uniform traffic pattern than in the 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. uniform traffic pattern (17% to 20%). This indicates that, indeed. DAMQ does not well support true fully adaptive routing capability and suffers more in the non-uniform traffic pattern where routing adaptivitv is more advantageous. Overall, regardless of network configurations, the router design with DAMQWR is shown to exploit the true fully adaptive routing capability of deadlock recovery schemes best while efficiently utilizing queue space, resulting in the highest per formance. However, this is not sufficient to claim that the router with DAMQWR is an optimal router design for fully adaptive deadlock recovery-based networks because this evaluation does not take into account crossbar designs and router architecture complexity, both of which significantly affect network performance. 4.4.2.2 Architecture Com plexity-aware Router Performance Comparison This section evaluates and compares the performance of router designs to achieve a high-speed, true fully adaptive router design. For this evaluation, the maximum network throughput (in phits/node/nano-sec) and the average latency (in nano sec) of each router design are measured taking into account router clock speed and crossbar designs. The clock speed of each router design achieved in Section 4.4.1 is utilized for this evaluation. Additionally, the simulation of each router design is run under a 16 x 16 two dimensional torus network with the queue depth of 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 128 phits. All the router designs are simulated under a uniform traffic pattern (random) and a non-uniform traffic pattern (bit-reversal). U niform traffic p a tte rn : Figure 4.10 (a) compares the performance of the router designs under uniform traffic pattern. As shown, the router of DAMQVVR and PD-CB has the highest performance (up to 102% higher maximum through put and up to 86% lower network latency than the other routers). This indicates that (1) the recruit function of DAMQVVR indeed increases queue utilization and minimizes message blocking and (2) the queue structure of DAMQVVR enables the ability to minimize crossbar size while completely resolving the internal com plications of decoupled crossbars (see Section 3.2.2). Consequently, the router of DAMQVVR and PD-CB best exploits the true fully adaptive routing capability of deadlock recovery schemes under uniform traffic. The router of DAMQ and PD-CB shows the second highest performance (14% lower maximum throughput and up to 71% higher network latency than the router of DAMQVVR and PD-CB). This result is interesting because, in the last section, the router of DAMQ was shown not to exploit true fully adaptive routing ca pability well, resulting in the lowest performance. This indicates that, although somewhat hampering true fully adaptive routing capability of deadlock recovery 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -CQ 4 D-CB (1VC) -CQ 4 U-CB (4VC) -VCDAMQ 4 E-CB (4VC) -DAMQWR 4 PD-CB -CQ 4 U-CB (2VC) -CQ 4 E-CB (4VC) -DAMQ 4 PD-CB 1800 1600 1400 - 1200 > 1000 800 600 400 200 0.04 0.06 0.08 0.1 0.02 0 Throughput (phit/node/ns) (a) Random traffic -C Q & U -C B (1V C ) -C Q & U -C B (4V C ) -V C D A M Q 4 E-CB (4V C ) -D A M Q W R & PD -C B -C Q & U -C B (2V C ) -C Q Sc E-CB (4V C ) -D A M Q Sc PD -C B 1800 1600 1400 e 1200 >• 1000 800 600 400 200 0.1 0.06 0.08 0 0.02 0.04 Throughput (phit/node/ns) (b) Bit-reversal traffic Figure 4.10: Performance comparison under a 16 x 16 two dimensional torus. 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. schemes, the queue structure of DAMQ enables tje ability to minimize cross bar size, which maximizes router speed and, thus, results in high performance. Namely, the benefit of DAMQ outweighs its problem. The router of VCDAMQ and E-CB (4VC) and the router of CQ and E- CB (4VC) show the third and the fourth highest performance under uniform traffic, respectively (19% and 22% lower throughput and up to 121% and 173% higher network latency than the router of DAMQWR and PD-CB, respectively). However, the last section shows that the queue performance of VCDAMQ and CQ was comparable with that of DAMQWR and higher than that of DAMQ while, in this section, lower than both queues. There are two reasons. First, although E-CB of VCDAMQ and CQ can reduce router cost, the router cost of E-CB is still higher than that of PD-CB with DAMQ or DAMQWR. Second, even though routing locality minimizes the internal complications of E-CB, the decoupled crossbar structure of E-CB still hampers network performance while PD-CB with DAMQ or DAMQWR does not. Additionally, as expected, the most primitive routers, i.e., the routers of CQ and U-CB (1, 2, and 4VCs), show the lowest performance in the uniform traf fic pattern due to either their high router architecture complexity (router with 4VCs) or their poor routing adaptivity exploitation (router with 1VC). However, against intuition is the result that the number of virtual channels is shown to 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. adversely affect router performance (4VCs: lowest performance. 1YC: highest performance). This is because this work assumes large queue size (> 64 phits) and large queues diminish the benefit of virtual channels by increasing network capacity while not reducing the cost of virtual channels, thus making the cost of virtual channels outweigh their performance benefits. Non-uniform traffic pattern: Figure 4.10 (b) compares the performance of router designs under non-uniform traffic. The simulation results under non- uniform traffic verifies the results achieved under uniform traffic: (1) DAMQVVR and PD-CB best exploit true fully adaptive routing capability by minimizing message blocking and maximizing router speed (up to 103% higher maximum throughput and up to 91% lower network latency than other routers), (2) despite its limited exploitation of routing adaptivity, the queue structure of DAMQ can maximize router speed by employing PD-CB, resulting in the second highest performance (13% lower maximum throughput and 50% higher network latency than the router of DAMQVVR and PD-CB), (3) although E-CB with VCDAMQ or CQ efficiently reduces the cost of the unrestricted routing capability by exploiting routing locality, E-CB still suffers from its internal routing complications, and (4) when U-CB is employed, the cost of virtual channels outweighs their performance benefits, resulting in the lowest performance. 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A new observation made from the simulation results of non-uniform traffic is that the performance gap between the router of CQ and U-CB (1VC) and the router of CQ and U-CB (2VCs) is reduced compared to the uniform traffic (the router with 2VCs has 7% lower throughput than that with 1VC in the uniform traffic while 2% lower in the non-uniform traffic). This indicates that more virtual channels are required in the non-uniform traffic than in the uniform traffic. This is because the non-uniform traffic pattern generally causes hot spots and congested areas where both virtual channels and routing adaptivitv are required to allow messages to bypass the congested areas. In addition, the performance gap between the router of DAMQ and PD-CB and the router of DAMQVVR and PD-CB increases under non-uniform traffic as compared to that under uniform traffic (14% to 20%). This indicates that DAMQ suffers more in the non-uniform traffic pattern case by not well accommodat ing the given routing adaptivitv, and DAMQVVR indeed resolves the problem of DAMQ. The validity of all the conclusions derived above is further confirmed by re peating the simulations of the routers under an 8 x 8 x 8 three dimensional torus network, as shown in Figure 4.11 (a) and (b). The router of DAMQVVR and PD-CB has up to 102% and 106% higher maximum throughput and up to 51% and 54% lower network latency than other routers in the uniform pattern and the 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. non-uniform pattern, respectively. This verifies that DAMQWR and PD-CB best exploits the true fully adaptive routing capability of deadlock recovery' schemes while minimizing the cost of the capability, regardless of network dimensions and traffic patterns. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. —*— CQ £ U-CB (1VC) — CQ £ U-CB (2VC) —6— CQ £ U-CB (4VC) —• * — CQ i E-CB (4VC) —*— VCDAMQ £ E-CB (4VC) — DAMQ £ PD-CB — i — DAMQ £ PD-CB 1800 1600 ' 1400 • C 1 2 0 0 >, 1000 800 j 600 400 0 0 0.05 0.1 0.15 0.2 Throughput (phit/node/ns) (a) Random traffic -CQ £ U-CB (1VC> -CQ £ U-CB (4VC) -VCDAMQ £ E-CB (4VC) -DAMQWR £ PD-CB -CQ £ U-CB (2VC) -CQ £ E-CB (4VC) -DAMQ £ PD-CB 1800 1600 1400 2 1200 > , 1000 O § 800 u < Q *3 600 400 200 • 0 0.02 0.04 0.06 0.08 0.1 Throughput (phit/node/ns) 0 .1 2 (b) Bit-reversal traffic Figure 4.11: Performance comparisons under an three dimensional torus. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.5 Sum m ary This chapter proposes and characterizes two enhanced dynamically allocated multi-queue designs: dynamically allocated multi-queue with recruit registers (DAMQVVR) and virtual channel-based dynamically allocated multi-queue (VC DAMQ). These queues are designed to overcome the limitations of traditional queues and thus, are more optimal queue designs for fully adaptive deadlock recovery-based routers. Through extensive simulations, this work verifies that DAMQVVR best ex ploits the true fully adaptive routing capability of deadlock recovery schemes. The recruit function of DAMQVVR resolves the structural problem of DAMQ, i.e., limited routing adaptivitv exploitation, while fully inheriting the benefits of DAMQ. This maximizes network resource utilization while efficiently adapt ing to dynamic network traffic. Furthermore, the queue structure of DAMQVVR minimizes crossbar size while completely eliminating internal complications of de coupled crossbars, resulting in the port-directed crossbar design (PD-CB). This crossbar minimizes the cost of fully adaptive routers and maximizes routing speed. Consequently, the router employing DAMQVVR and PD-CB maximizes network performance and, thus, is chosen to be an optimal router design for fully adaptive networks. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In addition, simulation results show that VCDAMQ well exploits true fully adaptive routing capability, thus efficiently handling hot spots of non-uniform traffic patterns. Moreover, by employing the enhanced hierarchical crossbar (E- CB) and efficiently exploiting routing locality, VCDAMQ successfully reduces the cost of true fully adaptive routing capability, leading to fast routers. Nevertheless, since the E-CB required by VCDAMQ is larger than PD-CB needed by either DAMQ or DAMQWR, the router of VCDAMQ is slower than that of DAMQ or DAMQWR. Furthermore, although there exists sufficient routing locality in fully adaptive networks, the router of VCDAMQ and E-CB still suffers from internal complications of decoupled crossbars, degrading overall network performance. Therefore, the router of VCDAMQ and E-CB shows lower performance than that of DAMQ/DAMQWR and PD-CB. A commonly used queue, the DAMQ, is shown to limit the routing adaptivitv allowed by deadlock recovery schemes, degrading network performance. However, the queue structure of DAMQ enables the ability to eliminate connect channels of decoupled crossbars, resulting in a smaller crossbar, i.e.. PD-CB. This simplifies router architectures and. thus, maximizes routing speed without suffering from internal complications of decoupled crossbars. Furthermore, its dynamic queue management helps the router to efficiently handle network congestion caused by non-uniform traffic patterns. Consequently, simulation results show that the 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. benefits of DAMQ outweigh its problems, resulting in the second best network performance. Additionally, the results show that the circular queue (CQ) needs virtual chan nels in order to minimize the HOL blocking problem. However, virtual channels complicate router architectures and, thus, can lead to slower and lower perfor mance routers. This is verified by showing that the router with CQ and U-CB assuming four virtual channels per physical link has the lowest network perfor mance in both uniform and non-uniform traffic patterns. This problem is shown to be resolved by employing a decoupled crossbar, i.e., E-CB. E-CB minimizes the cost of virtual channels by efficiently exploiting routing locality. Neverthe less. as mentioned above. E-CB required by CQ is larger than PD-CB due to connect channels, which leads to slower routers. Moreover, the router of E-CB suffers from internal complications of decoupled crossbars, degrading network performance. Additionally, the static queue management of CQ is shown not to deal with hot spots and network congestion well. Overall, the router with CQ might not be adequate to support the fully adaptive routing capability of deadlock recovery schemes. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 5 Im plem entation o f a True Fully A daptive D eadlock R ecovery-based R outer Architecture: W A R R P To demonstrate the feasibility of some of the router architecture components pro posed, this chapter designs and implements a fully adaptive deadlock recovery- based router: WARRP router ( Wormhole Adaptive flecoverv-based flouting via Preemption). The WARRP router is implemented using a bipartite structure that separates deadlock handling resources [5] from normal (non-deadlock) routing re sources. Each can then be optimized separately to achieve maximum flexibility and efficient recovery. The internal router architecture is further optimized by employing the enhanced hierarchical crossbar proposed in Chapter 3. This allows 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the WARRP router to implement the highest degree of adaptivitv while not sacri ficing router speed, regardless of the number of virtual channels per physical link. Further, the WARRP II version of the router addresses the pin-out problem by integrating dense high-bandwidth optoelectronic transceivers onto the router chip capable of providing over 300 GBytes/sec of optical I/O per square centimeter [32]. 5.1 R outer A rchitecture 5.1.1 O verview The WARRP router separates normal routing resources from deadlock handling resources to allow minimum coupling between the normal routing process and the deadlock handling process. This bipartite structure also allows for maximum design flexibility and exploitation of unrestricted routing using the normal routing resources while still ensuring deadlock freedom. The internal structure of the normal routing section is partitioned into three virtual network (VN) modules as shown in Figure 5.1. Incoming packets are directed to one of the virtual networks in the normal routing section or to the deadlock module (DRN) in the deadlock handling section, depending on the status of the control signals sent along with the packet (signifying deadlock or normal). 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A normal packet entering the router is received by the input module of the corresponding virtual network module. Routing information is extracted from the header flit (first flit of the packet) by the address decoder which sequentially makes requests to the routing and arbitration module for a permissible output link by cycling through the following request types until being granted: (1) an output module in the same dimension of the current virtual network, (2) an output module in another dimension of the same virtual network, (3) a connect channel to the next virtual network, and (4) the recovery lane in the deadlock module (only eligible packets are allowed to make this request). If the first request type is granted, the input-queued packet is routed over the current virtual network's crossbar to the allocated output module after the header is updated. If a connect channel is allocated to the packet due to all the available output links in the current virtual network being occupied by other packets, the packet is transferred to the next virtual network’s input module where the same routing actions are performed. If the recovery lane request is granted due to the permissible output modules and connect channel of the virtual network being busy, the crossbar is configured to redirect the suspected deadlock packet to the deadlock module to invoke the recovery process once the token is captured. The status of the packet is then changed from “normal” to “deadlock” . Deadlocks are recovered using the Disha-Sequential scheme [5]. As a final step, packets are time-multiplexed 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. onto the physical link from the output modules of the three virtual networks in a round-robin fashion or given priority over the link if coming from the deadlock module. The input, output, and deadlock modules are deallocated from the packet once the tail flit passes. A deadlock packet entering the router is latched directly into the central deadlock module in the deadlock handling section of the router. The header flit of the deadlock packet is decoded by the address decoder in the deadlock module. This routing information is used to decide which physical link will be allocated to the deadlock packet, preempting other normal packets which may be routing over the link. The control of normal and deadlock packet flow between routers is regulated by the flow control module to ensure no overflow. The WARRP router implements these functions using the major components described below. 5.1.2 V irtual N etw ork M odule The WARRP router implements a decoupled internal router architecture par titioned into separate virtual networks which are linked by intrarouter connect channels to maximize router speed and minimize implementation cost. This or ganization enforces routing locality in virtual network in which routing options that direct packets to output links within the same virtual channel network are selected over those that change virtual network class, as illustrated in Figure 5.2. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. By partitioning the internal architecture into smaller units based on virtual net work, routing arbitration complexity and crossbar size is considerably reduced without sacrificing routing freedom, as shown in Chapter 3. The virtual network modules are identical, and each has complete routing functionality implemented by the input modules, output modules, crossbar, and router and arbitration module, as shown in Figure 5.3. Each virtual network has an injection input module connected to the processor node to distribute traffic injection over all virtual networks and balance the load. In addition, there are four input modules from the .Y+, X — . Y + , and Y — directions and an input module from the i — 1 mod 3 virtual network connect channel. The seven output modules connect to the four directions, the processor node port, the i + 1 mod 3 virtual network, and the deadlock module. Also, each virtual network has its own crossbar and associated crossbar routing and arbitration module so that routing decisions in all three virtual networks can occur independently and simultaneously. The 6 x 7 crossbar allows non-blocking connections between all input modules and output ports. 5.1.3 Input M odule The input module queues packet flits, decodes header flits, generates requests for permissible output links and header updates, and regulates internal flow of 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. flits across the crossbar via handshaking with the output module and deadlock module. A block diagram is shown in Figure 5.4. For simplicity of address decoding and updating, the WARRP router uses relative addressing in which the hop distance from the current node and the des tination node is specified by a 3 bit X-offset and 3 bit V'-offset in the header. The header format is shown in Figure 5.5 and includes destination addressing infor mation and two reserved bits. Each dimension offset field represents a distance range of -4 to 3 for both positive and negative direction routing across the bidi rectional channels. The two reserved bits are used for extending the functionality of the WARRP router. The following functions are encoded: 00 = normal packet, 01 = deadlock packet eligibility, 10 = reserved. 11 = reserved. As mentioned in Section 5.2. the WARRP router provides selectivity in determining which packets may use the recovery lane. This is implemented by setting the first reserved bit if a packet makes a westward turn, a dateline route, or is injected as a priority packet. True fully adaptive routing permits a broad number of routing choices. How ever, previous studies have indicated that allowing the input module to make multiple output link requests does not significantly improve performance (see Section 2.1.2.4). This is primarily due to the fact that the probability of multiple headers residing in the same router and in the same virtual network is usually 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. very low. obviating the need for this parallelism. Hence, each input module is allowed to request at most one of several permissible output links each cycle, al though multiple input modules may make requests to the routing and arbitration module in parallel. This design choice reduces the path set-up delay and simpli fies the address decoder by removing the components needed to select one from a number of possible acknowledged requests. Moreover, to avoid unnecessary' requests for currently occupied output ports, status information is fed back from the routing and arbitration module to the address decoder. A routing decision is made within the same clock cycle of the request. If the request is rejected, the address decoder regenerates a request signal for another appropriate output link the next successive cycle until a request is granted. 5.1.4 R outing and A rbitration M odule In the WARRP router, each virtual network has an associated crossbar routing and arbitration module. This allows routing and arbitration decisions to be made in parallel across virtual networks, and it simplifies the arbitration logic. Also, at most only a single request can be made by each input module. Hence, in the worst case, arbitration among six requests must be handled by each routing and arbitration module instead of as many as thirteen requests if the internal router architecture was not partitioned. 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Arbitration is based on round-robin priority to guarantee freedom from star vation. An arbitration decision for resources in the normal routing section can be made within the same cycle of the request. However, access to resources in the deadlock routing section requires an additional cycle as explained below. In the first cycle, arbitration among the requesting input modules within each virtual network is performed so that one input module in each network is selected. In the second cycle, the deadlock module’s arbitration unit arbitrates among the possible requests by the three virtual networks so that only one input module in the router is allowed access to the recovery lane. The arbitration module in the selected virtual network configures the crossbar to connect the granted input module to the deadlock module port. This two-cycle arbitration delay for the recovery lane does not degrade router performance since recovery lane access delay is dominated by token capture (see Section 5.1.6) which serves as a "time-out” mechanism. 5.1.5 Flow Controller M odule The WARRP router uses a source-synchronous signaling scheme [12, 9] to control asynchronous flit flow externally between routers. Two external flow control signals are required as shown in Figure 5.6. The SendStrobe pulse is sent along with each packet flit to indicate that the receiving input module is to latch in the 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data, queues in the input module are used to re-synchronize incoming data with the internal clock of the router to obviate the need for additional PLL/DLL to maintain phase-svnchronization between routers. To fully utilize link bandwidth, an Almost-Full handshaking signal controls the flow' of flits between routers. This signal reflects the status of the input module’ s input queue and is asserted once the queue nears capacity. Enough queue space is provided in the WARRP router to allow for round-trip propagation delay of flow control signals as multiple flits may be in flight before these signals are detected. Two additional external control signals are required to establish the path between the output queue of one router and the corresponding input queue of the next router as shown in Figure 5.6. The VCJD signal identifies to which virtual network the flit should be transmitted, and the DB-Path signal designates that the flit should be transmitted to the recovery lane. This signal is asserted by the router’s external flow controller, possibly preempting physical channel bandwidth from normal packets. These signals are encoded in two bits (three virtual channel queues and one deadlock queue). The tail and token are additional control lines implemented by the router. Internal flow between input queues and output queues within the WARRP router is controlled by the in te n d and in ju ll control signals. As shown in Fig ure 5.7, these signals are used to represent the status of virtual network queues 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to send and receive data. Internal flow is synchronized with the router clock to simplify router design and to reduce synchronization overhead. The in_full signal is also used to generate the backpressure to stop internal data flow when the output queue is preempted by the deadlock queue. 5.1.6 D eadlock M odule The deadlock module resolves impending deadlock by providing a recovery lane for potential deadlock packets [5]. A block diagram of this module is shown in Figure 5.8. For the recovery lane to be free from deadlock, mutual exclusive access to the deadlock module is implemented. This is maintained by the DB-Arbitrator in the deadlock module and an external token that circulates among all the routers. When an eligible packet tries to access the deadlock module due to excessive blockage on normal queue resources, two different types of arbitration for the deadlock queue resource occurs: arbitration among eligible packets in the current router and arbitration among other routers which may have eligible deadlock packets. The first is performed by the DB-Arbitrator. as described in Section 5.1.3. The second is accomplished by having the eligible packet capture a circulating token. Once a packet is granted access to the recovery lane, the Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. token stops circulating until the deadlock packet reaches its destination where the token is regenerated. A deadlock packet partially queued in a router’ s deadlock module is trans mitted to the deadlock module of a neighboring router by preempting the band width of the physical link along any minimal path toward the destination. The preemption operation is initiated by asserting a control signal, DB-Preempt, to the external flow controller of the preempted physical link. To maximize band width utilization of links, preemption stops until more data flits are ready to be transmitted by the deadlock module if a bubble occurs in deadlock packet transmission. 5.2 R ecovery Eligibility Since only one of the packets involved in an impending deadlock needs to be removed to resolve deadlock, the WARRP router implements some amount of selectivity by allowing only eligible packets to access deadlock buffers and route on the recovery lane. In addition to blockage and token capture, packet eligibility is also determined by two routing criteria: westward turns [57] and dateline [78] crossings. 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Previous work [57. 78j has shown that restricting turns made in routing and restricting routes over the dateline (e.g.. wrap-around links in tori) can prevent the formation of cyclic dependencies necessary for deadlock. It follows that, for deadlock to form, at least one packet involved must make a turn or dateline route. The WARRP router uses this property to minimize the number of packets which might be mistaken for being deadlocked unnecessarily. Only those packets which make a westward turn or a dateline route is given eligibility to use the recovery lane. This selectivity criteria allows the recovery lane to be used for special circum stances that can benefit from deadlock-free preemption of network bandwidth for fast delivery other than deadlock handling. For instance, priority packet routing can occur by making packets eligible to use the recovery lane upon injection into the network. Relief from network congestion can also occur by allowing packets meeting the selectivity criteria (which may not actually be involved in deadlock) to be eligible to recover. 5.3 R outer Im plem entation The WARRP router architecture described in this chapter was designed at the behavior level using a Y'HDL hardware description language and implemented in 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. RAM-based FPGA technology. This allowed us to quickly simulate and verify our design at the gate and architecture levels to further optimize the design and estimate performance prior to fabrication in optoelectronic technology. Prelim inary simulations indicate that the router can be clocked at approximately 35 MHz using this technology. The WARRP router was also implemented in CMOS-SEED optoelectronic technology which allows the integration of CMOS VLSI logic circuitry with dense surface-normal, high-speed optoelectronic transceivers (AlGaAs-based multiple- quantum-well self-electrooptic-effect devices or SEEDs) via flip-chip bonding. SEEDs are passive diode devices which change their light absorption when an applied electric field is varied across the device terminals. This results in mod ulation of an off-chip source controlled by an electrical signal for its use as an optical transmitter. The same device can also be biased to work as a receiver that generates a leakage current upon receiving light, thus converting an optical signal back to an electrical signal. These devices organized as an array of 20 x 10 pixels are flip-chip bonded on top of the 2mm x 2mm router circuitry to allow a total of 200 optical I/Os (only some of which are used). Each diode is 20 x 60fim2 with a separation of 62.5fxm and 125pm from neighboring devices. Hence, this technol ogy is capable of supporting approximately 5,000 optical I/Os per cm2 with each I/O capable of modulating/receiving data at more than 2 G b/s [89] using only 159 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 300/iW' of optical power, for an aggregate capacity on the order order of 100s of GBytes per sec per cmr (even with differential signal mode assumed). Network routers can benefit from this high bandwidth I/O capability in several ways in cluding realizing implementations with wider channels (data and control), higher node degree, full duplex links, redundant channels (for fault-tolerance), etc., to reduce latency and increase throughput. As described in [68], W ARRFs core deadlock handling mechanisms were im plemented in monolithic GaAs MESFET VLSI technology integrated with GaAs OPFET photodetectors and GaAs/InGaP double heterostructure LEDs. We also implemented a fullv-functional router. However, due to limited chip real-estate (only 2m m x 2mm), a scaled-down CMOS-SEED optoelectronic version of the WARRP router is implemented. This version, referred to as WARRP II, imple ments a unidirectional 4-bit wide 1-D torus topology with one virtual channel network and associated deadlock recovery mechanisms. It is implemented using 0.5fim (with 0.7/zm drawn gate length). 3-metal layer, 3.3V CMOS, and it con tains approximately 26,000 transistors, of which 3.000 are used for transceiver, testing, and pad driver circuits. Only 60% of the chip area (1.2mm x 1.2mm) is allocated for the router datapath logic which consists of 4-flit-deep input queues, 3-flit-deep output queues, address decoder, 2 x 3 subcrossbar, crossbar arbitrator, and deadlock module (i.e.. deadlock queue and its associated flow controller and 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. channel preemption logic). The remaining 40% of the chip area is allocated for optical I/O transceiver circuits, testing circuits, bonding pads, etc. It is packaged in a 44-pin PGA which uses only 23 electrical signals and only 32 optical signals (18 dual-rail). A layout of the chip is shown in Figure 5.9. To our knowledge. WARRP II was the first fully-functional optoelectronic network router imple mented. 5.4 Perform ance Figure 5.10 gives the pin-to-pin latency for the two router implementations and the various components composing the latency. The router clock cycle time is determined by the maximum of the three major router (pipeline) stages upon which flits may be operated: routing (only header flits), internal flow-thru, and svnc/VCC. The first stage consists of address decode, header update, routing arbitration and crossbar set-up time. The second stage consists of internal router flow-thru time (each virtual network crossbar traversal adds one clock cycle for header flits only). The third stage consists of synchronization and virtual channel control delay. For both implementations, the internal flow-thru dominates. The FPGA implementation (2-D 3-virtual channel version of the WARRP router) happens to be faster than the optoelectronic WARRP II implementation. 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This is due to design constraints in the WARRP II rather than technological limitations. For one, only two metal layers were available for gate interconnects as the third metal layer was reserved for wiring between drivers and optical I/Os. This significantly increased the average net length between standard cells as well as the size of their drivers. Second, an automated synthesis tool (EPOCH) was used to generate the VLSI layout of the WARRP II router as opposed to using a customized layout design tool. Furthermore, many of the optical I/Os were not used due to the scaled down router design. Altogether, these design constraints limit the WARRP II router to operate at 30MHz and yield an aggregate bandwidth of 30MB/s. well below the capa bilities of this technology. Nevertheless, WARRP II is the first fullv-functional optoelectronic network router chip to demonstrate that near-term optoelectronic technology is feasible for the design and implementation of advanced network routers requiring dense high-bandwidth connectivity. 5.5 Sum m ary This chapter demonstrates the feasibility of some of the proposed router archi tectures by implementing the WARRP router. The WARRP router efficiently 162 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. realizes progressive deadlock recovery-based true fully adaptive routing by em ploying the enhanced hierarchical crossbar design (E-CB) which maximizes flexi bility as well as speed. The WARRP II version addresses the I/O pin-out problem by integrating dense high-bandwidth optoelectronic transceivers onto the router chip capable of providing over 300 GBytes/sec per square centimeter [32]. To our knowledge, it is the first fully-functional optoelectronic network router im plemented to-date. 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. EFC EFC ■ VN-I Input Module Output Module Routine Arb Logic Crossbar PROC PROC EFC i. X+ an d X- VN-2 J I s \ « % s EFC ■ Y+and Y- VN-3 lo VN-I DB.Prttmpt Deadlock Data Path DRN Arbitrator Token Logic DB.Pkm Controller Header.l'pdale Token.Out Figure 5.1: Organization of the WARRP router. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LEGEND: ► channel in VN-1 ■* channel in VN-2 □ □ P--OD □ & - h □ PI 1 4 5 B =-B B - -Q P2 Example o f m uling locality inVN P -* - |v N -t|— |-VN.||- -|vN -t| - | VN-21 |vN-2| [vN-2| routing locality VN-I VN-1 VN-1 P2 routing locality □ □ c 2 3| □ □-€ b pi _ ii 4 >5 □ - 0 —0 -VN-D - V N *» W 1 Example o f no routing locality inVN no routing locality no routing locality Figure 5.2: Routing Locality in virtual network (VN). Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Send.Strobe i-l-FilfiT t-Datt(gbU) Send.Strobe I-Dal^gbit) Send .Strobe I-Data(gbit) Send.Strobe 1-Data (gbit) INPUT MODf*- (X+) INPUT MOD (X-) INPUT MOD (Y+) INPUT MOD (Y-) Send .Strobe ♦^rilfLT 1-Data (gbit) Send.Slrobe J-Data (gbit) in.full _ in.send XBAR (6x7) channel.rcq, ack INPUT MOD (Proc) Routing Arbitration INPUT MOD (CQ OUTPUT MOI (X+) O-Dala (gbit) Ont.Scnd O-Data (gbit) OUTPUT MOI (X-) Ont.Send O-Data (gbtt) OUTPUT MOI (Y+) Onl Send L 'i d T J * . O-Data (gbit) OUTPUT MOI (Y-) Ont.Scnd : * r . s r _ r. O-Data (gbit) OUTPUT MOI (Proc) (x.yLS tatus CONNECT CHANNEL To Deadlock Module o DB.req, DB.ack Figure 5.3: Virtual Network Module. 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. insend infull Send.Strobe FULL DATA Request Signals Ack Signals Address Decoder In Flow Controller toXBAR toXBAR from RAL Figure 5.4: Block Diagram of the Input Module. reserved bits X-OFFSET Y_OFFSET ” 2bit " 3bit 3bit Figure 5.5: Header Flit Format. 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. R outer A R o u ter B VC_ID DB.Path DATA \lmost_ful Send_Stro Tail Token EFC VCC EFC VCC DRN DRN VN-1 VN-1 Sam pling points o f a input buffer C L O C K ___ _____ Nonnal_data D ata link Send_Strobe A lm ost_full D B .P ath OB dftta Figure 5.6: External Flow Control Signals and their Timing Diagram. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. XBAR in_send in _fnll_ DATA IFC IFC In_Buffer (IB) c I I Clock Data leaving IB insend leaving IB Data arriving at OB insend arriving at OB £ X j l > 1 e i I < £ > a: IB sends A b: OB latches in A, IB sends B c,d,e: repeating a and b Figure 5.7: Internal Flow Control Signals and their Timing Diagram. 169 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DB_ack Obit) grab regenerate Ack Token_Out Token_In Full EFC Send, .Strobe VN-I Next DB VN-3 from previous PB. Token Logic DB_FC Deadlock Buffer Figure 5.8: Block Diagram of Deadlock Module. Electrical I/O pad (lOOum x lOOum per I/O) Transmitter driver circuits Optical I/O Seed device (20 urn x 60 urn per I/O) Receiver driver circuits Figure 5.9: The VLSI Layout of the WARRP II Router. 170 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. FPGA (2D,3VCs) OPTO (1D4VC) Stage for Routing 25 ns 29ns Stage for Internal data flow 28ns 33ns Max Clock Speed approx. 35MHz approx. 30Mhz (a) Component delays and clock speed in two different implementation. CLOCK » ■ « « ■ ■ - ^ * ■ (avg.) OPERATION Synch Routing internal data flow VCC FPGA: 3 x 28.5ns = 85.5 ns Optoelectronic: 3 x 33.3ns = 100ns (b) Pin-to-Pin Latency for Header Flit. Figure 5.10: Pin-to-pin latency for the two router implementations. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 6 Conclusions To accommodate rapidly growing communication demands of high performance multiprocessor systems, this dissertation optimizes true fully adaptive deadlock recovery-based router architectures by proposing decoupled crossbars and en hanced dynamically allocated multi-queues. Extensive simulations verify that decoupled crossbars (C-CB, E-CB, and PD-CB) successfully minimize the cost of the true fully adaptive routing capability of deadlock recovery schemes, thus maximizing router speed without hampering the capability. C-CB and E-CB efficiently exploit dynamic routing behaviors of fully adaptive networks, i.e, rout ing locality, such that the cost of true fully adaptive routing capability can be considerably decreased while the internal complications of decoupled crossbars (i.e., blockage, deadlock, multi-cycle delay) are minimized. Moreover, PD-CB 172 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. further reduces crossbar size by eliminating connect channels of C-CB and E- CB while completely resolving the internal complications of decoupled crossbars by exploiting the queue structure of DAMQ and DAMQVVR. This maximizes router speed without compromising the true fully adaptive routing capability of deadlock schemes. In addition, simulation results show that enhanced dynamically allocated multi-queues, i.e., VCDAMQ and DAMQVVR, effectively support true fully adap tive routing capability by overcoming the problems of CQ and DAMQ while fully benefiting from the dynamic queue management of the traditional DAMQ design. VCDAMQ dynamically allocates queue space to virtual channels networks and. thus, efficiently handles load imbalance among virtual channels networks caused by the decoupled crossbar structure of E-CB. Moreover, VCDAMQ resolves the problem of DAMQ, i.e.. limited support of routing adaptivity, by allowing pack ets to be routed through any direction and any virtual channel network, which improves network performance. In contrast, DAMQVVR recruits packets from highly congested subqueues and routes them through idle subqueues and links, which better supports true fully adaptive routing capability. Furthermore, the queue structure of DAMQVVR enables to minimize crossbar size and, thus, leads to the smallest crossbar, i.e., PD-CB, while completely resolving the internal complications of decoupled crossbars. 173 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Finally, through empirical evaluations, this dissertation verifies that the router incorporating DAMQWR and PD-CB, best accommodates true fully adaptive routing capability by minimizing message blocking while maximizing routing speed, becoming an optimal deadlock recovery-based router architecture. In ad dition, to demonstrate the feasibility of the proposed router architectures, this dissertation implements a router of E-CB and CQ utilizing both CMOS and FPGA technologies. Other conclusions derived from this research are summa rized as follows. • Routing locality sufficiently exists in fully adaptive deadlock recovery-based networks so that decoupled crossbars, i.e.. C-CB. H-CB, and E-CB, success fully minimizes their internal complications while increasing router speed. Sufficient existence of routing locality enables a large crossbar to be par titioned into smaller crossbars without compromising performance. This simplifies true fully adaptive deadlock recovery-based router architectures, hence increasing router speed. • Regardless of network traffic patterns, routing locality within virtual chan nel network (VCN) is higher than that within dimension. Therefore, H-CB and E-CB exploiting routing locality within VCN show higher performance 174 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. than C-CB exploiting routing locality within dimension. Due to low rout ing locality. C-CB needs more connect channels to support subcrossbar traversals of packets, resulting in complicated and slow routers. • H-CB has lower performance than E-CB although they exploit the same routing locality, i.e., routing locality within VCN. This indicates that the lowest virtual channel network of H-CB is a performance bottleneck which hampers efficient exploitation of routing locality. • When U-CB and large CQ are assumed, the cost of virtual channels easily outweighs the benefit of virtual channels, degrading overall network perfor mance. This is because, as queue size increases, the head-of-line blocking problem of CQ is alleviated due to increased network capacity and, thus, the performance benefit of virtual channels diminishes while the cost of vir tual channels does not change. Consequently, to fully benefit from virtual channels, the router of CQ necessitates E-CB which minimizes the cost of virtual channels. • The queue structure of DAMQ hampers the unrestricted routing capability of deadlock recovery schemes by forcing packets to be routed through pre determined output links. Nevertheless, the router of DAMQ provides high 175 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. network performance by enabling PD-CB. the smallest crossbar, and, hence, minimizing the cost of the unrestricted routing capability. • Although E-CB efficiently reduces the cost of virtual channels by exploiting routing locality, the crossbar size of E-CB is larger than that of PD-CB due to connect channels. This makes routers employing E-CB slower than those incorporating PD-CB. 6.1 Future W ork Although this dissertation has made considerable research in optimizing true fully adaptive deadlock recovery-based router architectures and achieving high performance networks, many research issues still remain unexplored. These issues are summarized as follows. • In this dissertation, the WARRP router design provides a noble platform for router speed and performance evaluations. However, this router design has some limitations (e.g., non-logarithmic delay scale and non-pipelined design) in reflecting most up-to-date design technologies. As mentioned in Chapter 3, although these limitations do not change the qualitative conclu sions of this dissertation, they somewhat affect the quantitative conclusions. 176 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, the validity of this dissertation can be further improved by in corporating the latest design technologies in the WARRP router. • Although this dissertation derives many meaningful conclusions from exten sive simulations, the scope of these simulations is confined to the network level. Therefore, it might be interesting to extend the scope of simulations to the system level (including processor and memory modules). This will give a wider insight into the effect of the proposed router architectures on system performance. • This dissertation shows that the true fully adaptive routing capability of deadlock recovery schemes can be efficiently implemented in router architec tures so that high performance networks (i.e., low latency and high through put) can be achieved. However, as mentioned in Chapter 2, the unrestricted routing capability has also the potential to enhance network fault-tolerance by providing alternative routing paths. Therefore, it can be interesting to explore router architectures which efficiently and effectively exploit this po tential and, thus, maximize network fault-tolerance without compromising network latency and throughput. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Quality of service is one of the important requirements of high performance networks. This requirement can be enhanced by true fully adaptive rout ing capability of deadlock recovery schemes because the routing capability increases the amount of network resources to be reserved for quality of ser vices. Therefore, efficient exploitation of this potential in router architec tures can be an interesting research issue for future work of this dissertation. • Although this dissertation demonstrates the feasibility of a proposed archi tecture, i.e.. a router with E-CB and CQ, the feasibility of the other router architectures remains not demonstrated. This can be another opportunity for future work of this dissertation. 178 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. R eference List [1] Anant Aganval, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Limand Ken Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Pro ceedings of the 22nd International Symposium on Computer Architecture, pages 2-13, June 1995. [2] J. Allen, P. Gaughan, D. Schimmel, and S. Yalamanchili. Ariadne—An Adaptive Router for Fault-tolerant Multicomputers. In Proceedings of the 21st International Symposium on Computer Architecture, pages 278-288. IEEE Computer Society, April 1994. [3] N. Anerousis and A.A. Lazar. An architecture for managing virtual circuit and virtual path services in ATM networks.. Journal of Network and Systems Management. 4(4), December 1996. [4] K.V. Anjan, Timothy M. Pinkston, and Jose Duato. Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent. In Proceedings of the 10th International Parallel Processing Symposium, pages 815-821. IEEE Computer Society Press. April 1996. [5] K.V. Anjan and Timothy Mark Pinkston. An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA. In Proceedings of the 22nd Interna tional Symposium on Computer Architecture, pages 201-210. IEEE Com puter Society, June 1995. [6] R. Arlauskas. iPSC/2 System: A Second Generation Hypercube. Technical report, Intel Corporation, 1988. [7] A. E. Barbour and I. Alhavek. A parallel, high speed circular queue struc ture. In Proceedings of the 32nd Midwest Symposium on Circuits and Sys tems, volume 2, pages 1089-1092, 1990. 179 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [8] N.J. Boden, D. Cohen, R.E. Felderman. A.E. Dulawik, C.L. Seitz. J. Seizovic, and W. Su. Myrinet-A gigabit per second local area network. In IEEE Micro, pages 29-36. IEEE Computer Society, February 1995. [9] J. Carbonaro and F. Verhoorn. Cavallino: The Teraflops Router and NIC. In Proceedings of the Symposium on Hot Interconnects IV, pages 157-160. IEEE Computer Society, August 1996. [10] R. Casado, A. Bermudez, J. Duato, F.J. Quiles, and J.L. Sanchez. A protocol for deadlock-free dynamic reconfiguration in high-speed local area networks. IEEE Transactions on Parallel and Distributed Systems, 12(2), February 2001. [11] Andrew A. Chien. A Cost and Speed Model for k-ary n-Cube Wormhole Routers. IEEE Transactions on Parallel and Distributed Systems, 9(2):ISO- 162, February 1998. [12] Andrew A. Chien and J. H. Kim. Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors. In Proceedings of the 19th Sym posium on Computer Architecture, pages 268-277, May 1992. [13] Yungho Choi and Timothy Mark Pinkston. Crossbar Analysis for Optimal Deadlock Recovery Router Architectures. In Proceedings of the 10th In ternational Parallel Processing Symposium. IEEE Computer Society. April 1997. [14] Yungho Choi and Timothy Mark Pinkston. Evaluation of Crossbar Archi tectures for Deadlock Recovery Routers. Journal of Parallel and Distributed Computing, 61(l):49-78, January 2001. [15] Intel Corp. Intel Paragon Supercomputers: The Scalable High Performance Computers. Technical report, Intel SSD, November 1994. [16] W. Dally. Virtual Channel Flow Control. IEEE Transactions on Parallel and Distributed Systems, 4(4):466-475, April 1993. [17] W. Dally and H. Aoki. Deadlock-free Adaptive Routing in Multicomputer Networks using Virtual Channels. IEEE Transactions on Parallel and Dis tributed Systems, 4(4):466-475, April 1993. [18] W. Dally, L. Dennison, D. Harris, K. Kan, and T. Zanthopoulos. Architec ture and Implementation of the Reliable Router. In Proceedings of the Hot Interconnects II Symposium, August 1994. 180 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [19] W. Dally and C. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Transactions on Computer, 36(5):547-553, May 1987. [20] William J. Dally, Philip P. Carvey. and Larry R. Dennison. The Avici Ter abit Switch/Router. In Proceedings of the 6th Symposium on Hot Intercon nects. IEEE Computer Society, August 1998. [21] W.J. Dally, J.A.S. Fiske, J.S. Keen. R.A. Lethin, M.D. Noakes, P.R. Nuth, R.E. Davidson, and G.A. Flyer. The Message Driven Processor: A Multi computer Processing Node with Efficient Mechanisms. IEEE Micro. [22] J. Ding and L. N. Bhuyan. Performance Evaluation of Multistage Intercon nection Networks with Finite Buffers. In Proceedings of 1991 In t’ l Conf. on Parallel Processing, volume 1, pages 592-599, 1991. [23] J. Duato. A New Theory of Deadlock-free Adaptive Routing in Worm hole Networks. IEEE Transactions on Parallel and Distributed Systems. 4(12):1320-1331, December 1993. [24] J. Duato. A Necessary and Sufficient Condition for Deadlock-free Adap tive Routing in Wormhole Networks. IEEE Transactions on Parallel and Distributed Systems. 6(10):1055- 1067. October 1995. [25] J. Duato, P. Lopez, F. Silla, and S. Yalamanchili. A high performance router architecture for interconnection networks. In Proceedings of 25th In ternational Conference on Parallel Processing, pages 61-68. IEEE Computer Society, August 1996. [26] J. Duato, S. Yalamanchili, M.B. Caminero. D. Love, and F.J. Quiles. MMR: a high-performance MultiMedia Router-architecture and design trade-offs. In Proceedings of the Fifth International Symposium On High-Performance Computer Architecture, pages 300-309. 1999. [27] Charles M. Flaig. VLSI mesh routing systems. Master's thesis, California Institute of Technology. [28] M. Galles. Spider: A High Speed Network Interconnect. IEEE Micro, pages 34-39. February 1997. [29] Davie Garcia and William Watson. ServerNet II. In Proceedings of the 2nd PCRCW. page 109. Springer-Verlag, June 1997. 181 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [30] D. Gelernter. A DAG-based algorithm for prevention of store-and-forward deadlock in packet networks. IEEE Transactions on Computers. C(30):709- 715. October 1981. [31] P. Goli and V. Kumar. Performance of a Crosspoint Buffered ATM Switch Fabric. In Proceedings of INFOCOM’ 92, pages 3D.1.1-3D.1.10. 1992. [32] K.W. Goossen and et al. GaAs MQW modulators integrated with silicon CMOS. IEEE Photonic Technology Letters, 7:360-362, 1995. [33] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer-Designing and MIMD shared memory parallel computer. IEEE Transactions on Computer, C(32):175-189. Febru ary 1983. [34] Linley Gwennap. Alpha 21364 to Ease Memory Bottleneck. In Microproces sor Report, pages 12-15. Microdesign Resources, October 1998. [35] R. Horst. ServerNet deadlock avoidance and and fractahedral topologies. In Proceedings of the International Parallel Processing Symposium, pages 275-280, April 1996. [36] Robert Horst. TNet: A Reliable System Area Network for I/O and IPC. In Proceedings of Hot Interconnects II, pages 96-105, August 1994. [37] Robert W. Horst and David Garcia. ServerNet SAN I/O Architecture. In Proceedings of the 5th Symposium on Hot Interconnects. IEEE Computer Society, August 1997. [38] C.F. Joerg and A. Boughton. The Monsoon Interconnection Network. In Proceedings of the International Conference on Computer Design, pages 156- 159, October 1991. [39] M. Karol, M. Hluchyj, and S. Morgan. Input versus Output Queuing on a Space-Division Packet Switch. IEEE Transactions on Communications. 35(12):1347-1356, 1987. [40] Parviz Kermani and Leonard Kleinrock. Virtual cut-through: A new com puter communication switching technique. Computer Networks, pages 267- 286, 1979. [41] J. Kim, Z. Liu, and A. Chien. Compressionless Routing: A Framework for Adaptive and Fault-tolerant Routing. IEEE Transactions on Parallel and Distributed Systems. 8(3):229-244. March 1997. 182 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [42] J. H. Kim. Bandwidth and latency guarantees in low-cost high performance networks. In Ph.D. Thesis, University of Illinois, Urbana-Champaign, Jan uary 1997. [43] S. Konstantinidou and L. Snyder. Chaos Router: Architecture and Perfor mance. In Proceedings of the 18th International Symposium on Computer Architecture, pages 212-221. IEEE Computer Society. May 1991. [44] Georgios Komaros, Dionisios Pnevmatikatos, Panagiota Vatsolaki, Georgios Kalokerinos, Chara Xanthaki, Dimitrios Mavroidis, Dimitrios Serpanos, and Manolis Katevenis. Implementation of ATLAS I: a single-chip ATM switch with backpressure. In Proceedings of the IEEE Hot Interconnects VI Sym posium, Stanford University. Stanford, California, USA, August 1998. [45] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302-313. April 1994. [46] Michael Laor. The 12000 GSR switch fabric architecture. In Proceedings of the 6th Symposium on Hot Interconnects. IEEE Computer Society, August 1998. [47] James Laudon and Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, pages 241-251, June 1997. [48] Dongwook Lee and Kiseon Kim. Virtual circuit connection method for RSVP multicasting supporting heterogeneous receivers on the ATM net work. Electronics Letters, 34(15):1474-6, July 1998. [49] D. Lenoski, J. Jaudon, K. Gharachorloo, Y V . YVeber, A. Gupta, J. Hennessy. M. Horowitz, and M. Lam. The Stanford Dash Multiprocessor. IEEE Com puter. [50] Peh Li-Shiuan and Y Y '.J. Dally. Flit-reservation flow control. In Sixth In ternational Symposium on High-Performance Computer Architecture, pages 73-84, 1999. [51] Ziqiang Liu and Andrew A. Chien. Hierarchical Adaptive Routing. In Pro ceedings of the Sixth IEEE Symposium on Parallel and Distributed Process ing, pages 688-695, October 1994. 183 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [52] D. Love, S. Yalamanchili. J. Duato, M. B. Caminero, and F. J. Quiles. Switch scheduling in the multimedia router (MMR). In Proceedings of 14th International Parallel and Distributed Processing Symposium, pages 5-11, 2000. [53] J.M. Martinez, P. Lopez, J. Duato, and T.M. Pinkston. Software-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks. In Proceedings of the 1996 International Conference on Parallel Processing, pages 182-189. IEEE Computer Society Press, August 1997. [54] N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, and M. Horowitz. The Tiny Tera: A Packet Switch Core. In Proceedings of the Symposium on Hot Interconnects IV, pages 161-173. IEEE Computer Society, August 1996. [55] S.-W. Moon, J. Rexford, and K. G. Shin. Scalable Hardware Priority Queue Architectures for High-Speed Packet Switches. In Proceedings of IEEE Real time Technology and Applications Symposium (RTAS), pages 203-212, 1997. [56] A. Mu, J. Larson, R. Sastry, T. Wicki, and W. W.Wilcke. A 9.6 GigaByte/s throughput plesiochronous routing chip. In Technologies for the Information Superhighway, Compcon ’ 96, pages 261-266. Digest of Papers, 1996. [57] L. Ni and C. Glass. The Turn Model for Adaptive Routing. In Proceedings of the 19th Symposium on Computer Architecture, pages 278-287, May 1992. [58] L. Ni and P.K. McKinley. A Survey of Wormhole Routing Techniques in Direct Networks. IEEE Computer, pages 62-76, January 1993. [59] Michael D. Noakes, Deborah A. Wallach, and William J. Dally. The J- Machine Multicomputer: An Architectural Evaluation. In Proceedings of the 20th International Symposium on Computer Architecture, pages 224- 235, 1993. [60] Ruoming Pang, Timothy M. Pinkston, and Jose Duato. The double scheme: deadlock-free dynamic reconfiguration of cut-through networks. In Proceed ings of 2000 International Conference on Parallel Processing, pages 439-448, 2000. [61] Young-Keun Park and Young-Keun Lee. Parallel iterative matching-based cell scheduling algorithm for high-performance ATM switches. IEEE Trans actions on Consumer Electronics, 47(1):134-137. February 2001. 184 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [62] Li-Shiuan Peh and V V . Dally. A Delay Model for Router Microarchitectures. IEEE Micro, 21(l):26-34, .Jan/Feb 2001. [63] C. Peterson and J. Sutton andP. Wiley. iWarp: a 100-MOPS, LIW micro processor for multicomputers. IEEE Micro. ll(3):26-29, June 1991. [64] F. Petrini, J. Duato, P. Lopez, and J. Martinez. LIFE: a Limited Injection. Fully adaptivE, Recovery-Based Routing Algorithm. In Proceedings of the 4th International Conference on High-Performance Computing, pages 316- 321, December 1997. [65] F. Petrini and M. Vanneschi. Performance Analysis of Minimal Adaptive Wormhole Routing with Time-Dependent Deadlock Recovery. In Proceed ings of the llth International Parallel Processing Symposium, pages 589-595. IEEE Computer Society, April 1997. [66] Timothy M. Pinkston. Flexsim 1.2. SMART Interconnects Group, 1997. [67] Timothy M. Pinkston, Yungho Choi, and Mongkol Raksapatcharawong. Ar chitecture and Optoelectronic Implementation of the WARRP Router. In Proceedings of the 5th Symposium on Hot Interconnects, pages 181-189. IEEE Computer Society, August 1997. [68] Timothy Mark Pinkston, Mongkol Raksapatcharawong, and Yungho Choi. Smart-Pixel Implementation of Network Router Deadlock Handling Mecha nisms. In Topical Meeting on Optics in Computing, pages 159-161. Optical Society of America, March 1997. [69] Timothy Mark Pinkston and Sugath Warnakulasuriva. On Deadlocks in In terconnection Networks. In Proceedings of the 24th International Symposium on Computer Architecture. IEEE Computer Society, June 1997. [70] Puente, Beivide, Gregorio, Prellezo, J. Duato, and C. Izu. Adaptive bubble router: a design to improve performance in torus networks. In Proceedings of 1999 International Conference on Parallel Processing, pages 58-67. 1999. [71] Wenjian Qiao and Lionel M. Ni. Adaptive Routing in Irregular Networks Us ing Cut-Through Switches. In Proceedings of 1996 International Conference on Parallel Processing, pages 52-60. IEEE Computer Society Press. August 1996. 185 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [72] J. Rexford, J. Hall, and K. G. Shin. A router architecture for real-time point- to-point networks. In Proceedings of International Symposium on Computer Architecture, pages 237-246, May 1996. [73] M.D. Schroeder and et al. Autonet: A high-speed, self-configuring local area network using point to point links. Technical Report SRC research report 59. DEC, April 1990. [74] Loren Schwiebert and Renelius Bell. The Impact of Output Selection Func tion Choice on the Performance of Adaptive Wormhole Routing. In Pro ceedings of International Conference on Parallel and Distributed Computing Systems, pages 539-544, October 1997. [75] S. Scott. The SCX channel: A new, supercomputer-class system intercon nect. In Proceedings of the third Symposium on Hot Interconnects. IEEE Computer Society, August 1995. [76] S. Scott. The GigaRing channel. IEEE Micro. 16(l):27-34, February 1996. [77] Steven L. Scott and Gregory M. Thorson. Optimized Routing in the Cray T3D. In Proceedings of the Workshop on Parallel Computer Routing and Communication, pages 281-294, May 1994. [78] Steven L. Scott and Gregory M. Thorson. The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In Proceedings of the Symposium on Hot Interconnects, pages 147-156. IEEE Computer Society, August 1996. [79] F. Silla and J. Duato. Improving the Efficiency of Adaptive Routing in Networks of Workstations with Irregular Topology. In Proceedings of 1997 International Conference on High Performance Computing, pages 330-335, 1997. [80] F. Silla and J. Duato. On the Use of Virtual Channels in Networks of Workstations with Irregular Topology. In Proceedings of the 2nd Parallel Computer Routing and Communication Workshop, June 1997. [81] F. Silla, M.P. Malumbres, A. Robles, P. Lopez, and J. Duato. Efficient Adaptive Routing in Networks of Workstations with Irregular Topology. In Proceedings of Computer Architectures and Communication Protocols, Apr 1997. 186 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [82] F. Silla, A. Robles, and J. Duato. Improving Performance of Networks of Workstations by using Disha Concurrent. In Proceedings of the International Conference on Parallel Processing, pages 80-87. IEEE Computer Society Press, August 1998. [83] A. H. Smai, D. K. Panda, and L. E. Thorelli. Prioritized demand multi plexing (PDM): a low-lateney virtual channel flow control framework for prioritized traffic. In Proceedings of the Fourth International Conference on High-Performance Computing, pages 449-459, 1997. [84] C.B. Stunkel and et al. The SP2 high-performance switch. IBM Systems Journal, 34(2):185-204, 1995. [85] Y. Tamir and L. Frazier. Dynamically-allocated Multi-Queue Buffers for VLSI Communication Switches. IEEE Transactions on Computer, 41(6):725-737, June 1992. [86] A. S. Vaidya, A. Sivasubramaniam, and C. R. Das. LAPSES: A Recipe for High Performance Adaptive Router Design. In Technical Report CSE-98- 010, Department of Computer Science and Engineering, The Pennsylvania State University, July 1998. [87] Sugath Wamakulasuriya and Timothy Mark Pinkston. A Formal Model of Message Blocking and Deadlock Resolution in Interconnection Net works. to appear in IEEE Transactions on Parallel and Distributed Systems, 11 (3) :212— 229, March 2000. [88] VV-D. Weber, S. Gold, P. Helland, T. Shimizu, T. Wicki, and W. Wilcke. The Mercury Interconnect Architecture: A Cost-effective Infrastructure for High- performance Servers. In Proceedings of the 24th International Symposium on Computer Architecture, pages 98-107. IEEE Computer Society, June 1997. [89] T. K. Woodward, A. L. Lentine, K. W. Goossen, J. A. Walker, B. T. Tseng, S. P. Hui, J. Lothian, and R. E. Leibenguth. Demultiplexing 2.48 Gb/s Optical Signals with a Lower-Speed Clocked-Sense-Amplifier-Based Hybrid CMOS/MQW Receiver Array. In Spatial Light Modulators Technical Digest. pages 52-54. Spring Topical Meeting, March 1997. [90] K. Yamakoshi, K. Nakai, N. Matsuura, E. Oki, and N. Yamanaka. 5-Tb/s frame-based ATM switching system using 2.5-Gb/s x 8-/spl Lambda/ optical switching technique. In IEEE Workshop on High Performance Switching and Routing, pages 88-92, 2001. 187 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [91] T. Yokota. H. Matsuoka, K. Okamoto, H. Hirono, and S. Sakai. A High- Performance Router Design for the Massively Parallel Computer RWC-1. In Proceedings of the Hot Interconnects Symposium, August 1995. [92] Zhenwei Yu, Yao Wang, Yucai Ping, Hongrong Zhai, and Mingrun Li. AMS:a new associated memory’ based ATM switch. In International Conference on Communication Technology Proceedings, volume 2, pages 1065-1068, 2000. [93] K.Y. Yun and V.L. Do. A scalable priority queue manager architecture for output-buffered ATM switches. In Proceedings. Eight International Confer ence on Computer Communications and Networks, pages 154-160, 1999. 188 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Architectural support for network -based computing
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Decoupled memory access architectures with speculative pre -execution
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
High performance crossbar switch design
PDF
An efficient design space exploration for balance between computation and memory
PDF
Cost -sensitive cache replacement algorithms
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
A flexible framework for replication in distributed systems
PDF
On The Learning Of Rules For An Information Extraction System
PDF
Automatic array partitioning and distributed-array compilation for efficient communication
PDF
Compiler optimizations for architectures supporting superword-level parallelism
PDF
A framework for coarse grain parallel execution of functional programs
PDF
Applying aggregate-level traffic control algorithms to improve network robustness
PDF
Characterization of deadlocks in interconnection networks
Asset Metadata
Creator
Choi, Yungho (author)
Core Title
Deadlock recovery-based router architectures for high performance networks
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Pinkston, Timothy M. (
committee chair
), Gaudiot, Jean-Luc (
committee member
), Heidemann, John (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-145719
Unique identifier
UC11329252
Identifier
3054723.pdf (filename),usctheses-c16-145719 (legacy record id)
Legacy Identifier
3054723-0.pdf
Dmrecord
145719
Document Type
Dissertation
Rights
Choi, Yungho
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical