Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Disha: a true fully adaptive routing scheme
(USC Thesis Other)
Disha: a true fully adaptive routing scheme
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DISH A: A TRUE FULLY ADAPTIVE ROUTING SCHEME by Anjan Kumar Venkatramani A Thesis Presented to the FACULTY OF THE SCHOOL OF ENGINEERING UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE IN COMPUTER ENGINEERING May 1995 Copyright 1995 Anjan Kumar Venkatramani This thesis, written by A .N ^ A ^ V'i'.NK/yTKAMAfJi under the guidance of Faculty Committee and a p p ro ved by all its members, has been presented to and accepted by the School of Engineering in partial fulfillment of the re quirements for the degree of H e .' U K C f ■ ' 1 ' 1 'rt t c M t’ v r i k L m 0 11 tv t "{ /: i Date. A p e d A i m s ' Faculty Committee Abstract This thesis proposes a new deadlock-free routing strategy. Disha is a deadlock recovery routing scheme. It is premised on the notion that deadlocks are rare and designed to maximize performance in the common case. Routing is true fully adaptive, could even be non-minimul, and is designed without regard for deadlocks. Deadlock cycles, upon forming, are efficiently broken by progressively routing blocked packets through a deadlock-free lane. This lane is implemented using a central “floating" deadlock buffer resource in routers and is accessible to all neighboring routers. W ith this alternate recovery lane, Disha successfully decouples flow resources from deadlock avoidance resources. Recovery from deadlocks can either be sequential or concurrent. Sequential recovery requires exclusive access to the escape lane; mutual exclusion can be implemented with a Token. Concurrent recovery requires structured routing on the deadlock-free lane. It allows simultaneous recovery from cycles and eliminates the Token hardware. This thesis describes both these alternate design approaches with formal proofs that they are deadlock-free. Exhaustive performance studies are done to show that this novel recovery scheme is far superior to previously proposed preventive schemes. Dedication To my mom. Dr. Gayatri Ramani, and my dad, Mr. Venkatramani. Acknowledgments I thank my advisor. Professor Timothy Pinkston. He has contributed in a large way to both my professional and personal development. He has been not only a great advisor but a good friend as well, easy to access and communicate with. I now know how useful organization and planning are. If I can now write a good paper or give an inspiring talk, it is solely due to his efforts. I would also like to thank Jose Duato. I have profited greatly from the large corre spondence I had with him. He took great pains to answer my smallest doubt. His work and accomplishments have been and will always be a major source of inspiration to me. Co-members of the SM ART Interconnects group went out of the way to share their computing resources every time “hope” was lost. This work could not have been done in such a short time without their cooperation. I specially thank Seelan for his friendship and support. I have to acknowledge my friends who have, knowingly or not, shaped my life in the past years. Their continued support, friendship and love will always be a part of my life. Jaya has always being there for me, to share my success and failures, happiness and sorrows. I have picked up this happy-go-lucky attitude from Ashok. We had fun climbing mango trees, we had fun sleeping at bus stops. Nigam introduced me to the world of music and merry making. He took me from Vag to Krec. Hiral and I have been together for the past 19 years - ever since kindergarten. Together we built confidence in ourselves and our abilities. Together we learned that nothing was impossible for us. I can always count on them and they surely are the best friends anyone could ever have. Most of all, I thank my parents for their love and bottomless support. They gave me everything 1 could possibly want, I never had a chance to ask. They were there when I was sick and hopeless, they were there when I was lost and dejected. They always believed that I would accomplish all that I hoped to. My dad was ever patient and moti vated me to set higher goals. My mom wanted me to be healthy and happy, nothing more. She did everything she possibly could to keep me that way and much much more, Her effi ciency is unmatched. If I am ready to take over the world today, it is due to their pain and effort. I dedicate this work to them. I will always remain indebted to my grandparents for all that they did when Mom and Dad weren’t around. I love my sister Neeru more than anyone else in the world. I will strive to be the ideal brother she can always look up to. v Table of Contents 1 Introduction................................................................................... 1 2 B ackground................................................................................... 7 3 Implementing D ish a .................................................................. 21 4 Sequential R ecovery.................................................................. 33 5 Concurrent R ecovery................................................................. 50 6 Performance Evaluation............................................................ 67 7 Conclusions and Future W ork................................................. 81 R eferences................................................................................... 84 vi 1 Introduction Adaptive wormhole routing in multiprocessor interconnection networks is suscep tible to deadlocks. Prevention has been the traditional solution to deadlocks. Preventive schemes suffer from high hardware costs and/or losses in adaptivity. To prevent dead locks, virtual channels are grouped into classes. Not all of them are available to packets at a router. Packets are generally restricted to belong to certain classes and this leads to a reduced utilization of network resources and a subsequent loss in performance. To enhance performance, without sacrificing on adaptivity, preventive schemes require a large number of virtual channels, significantly more than the minimum for prevention. Increasing virtual channels adversely affects machine performance. Virtual channels increase both the crossbar size and the Virtual Channel Controller (VCC) complexity slowing down the router. It has been shown that every additional virtual channel slows down the router by about 20-30%. A cost to performance trade-off ensues and for opti mum performance the number of virtual channels should be appropriately chosen. Other preventive schemes that do not require a large number o f virtual channels are only able to accomplish this by restricting the adaptivity of the routing algorithm to being partially adaptive. Performance in such schemes is also drastically affected. Disha * is a simple, efficient and cost effective routing strategy that considers deadlock recovery as opposed to prevention. It tries to extract advantage from the fact that ^ Disha means “ direction " in Hindi. 1 deadlocks are rare. If deadlock occurrences are extremely rare then it does not make sense to limit the adaptivity o f the routing scheme to solve an infrequent event. Limiting adaptivity also reduces fault tolerance capabilities of the system. It also does not make sense to devote some virtual channels specifically to prevent deadlocks; virtual channels should be used only as a means of improving flow such as with adaptive routing. Performance in Disha is optimized in the absence of deadlocks by allowing maximum flexibility in routing. Routing is allowed to be fully adaptive in the truest sense, and packets have a choice of all virtual channels at a router without any classification amongst them. Cycles upon forming are broken by progressively routing blocked packets through a deadlock free lane. This lane is implemented using a central “floating” Deadlock Buffer resource in routers which is accessible to all neighboring routers along the path. The router design is thus kept extremely simple with the bare minimal hardware being devoted to the recovery path. Virtual channels in Disha, serve the purpose of improving flow while Deadlock Buffers are devoted to recovery. Thus Disha successfully decouples deadlock avoidance resources from flow control resources. The goal here is to make the common case fast and Disha is a solution to this. To illustrate how a potential deadlock situation can be recovered from, consider the following example. Figure 1 shows a deadlock situation with a dependency cycle formed by four packets waiting on one another, thus allowing none of them to proceed in a mesh type interconnection network. As shown Pj -> P2 -> P3 -> P4 -> P |, where a -> b implies “a” is waiting on “b”. The packets are deadlocked. 2 h r * Legend: Source Destination P. P^ pi Fig. I shows a deadlock scenario in a mesh type interconnection network. Figure 2 shows how the recovery scheme works. The special buffer path can be viewed as an alternate network interleaved with the original. Using this alternate path, a packet, upon detection that it is deadlocked, can reach its destination where it will be sunk This breaks the deadlock cycle and other packets are able to proceed. In the figure, P] is able to use the alternate network to reach its destination. P4 is now able to go through followed by P3 and P2. Hence, recovery is achieved. Legend: Source Destination Fig. 2 shows deadlock recovery with Disha. 3 Disha is reminiscent of Duato’s algorithm of having two virtual networks -- one susceptible to deadlocks (possibly adaptive) and the other deadlock-free (possibly deterministic). However, there are significant differences. Escape paths in Duato’s scheme use edge buffers. Depending on the topology (mesh or torus) more than one virtual channel might be required. However, the escape channel in Disha is a single Deadlock Buffer central to the router. This buffer is shared between neighboring nodes and, unlike edge buffers, is not dedicated to any path. In Disha, the cost complexity of the VCC remains unchanged as the Deadlock Buffer is a central input resource as opposed to an edge resource. The crossbar complexity is increased by just one at the input and is unchanged at the output. Note that additional virtual channels needed for escape paths in preventive schemes can increase the crossbar complexity by the node degree and similarly increase the complexity of the VCC. Recovery through the Deadlock Buffers in Disha is however not as simple as recovery via the escape channels in Duato’s scheme. Deadlocks could arise even under deterministic routing on the special buffer path as shown in Figure 3. Even though P| is being routed along the y - dimension, it still waits on P2 which is being routed along the x - dimension as P2 has occupied the Deadlock Buffer of that particular router. Likewise P2 -> P3 ->P4 -> P(, leading to a dependency cycle on the special buffers. Hence care must be taken to ensure that there are no deadlocks on the special Deadlock Buffer path. Legend: Source Destination PI P2 Fig. 3 shows how deadlocks could occur on the special buffer path under x-y routing. 4 Recovery from deadlocks can be done either sequentially [2] or concurrently [3]. Under sequential recovery, routers need to obtain exclusive access to the Deadlock Buffer path before they are allowed to place a packet on it. Mutual exclusion could be implemented with a Token. Once exclusive access is obtained (i.e., by capturing the Token), a router can send out a deadlocked packet with the corresponding status line asserted to indicate that this packet should be placed on the Deadlock Buffer at the next router. The packet continues to use Deadlock Buffers at subsequent nodes until it is delivered at its destination. Concurrent recovery does not require routers to obtain exclusive access to the Deadlock Buffer path before they are allowed to route deadlocked packets. By appropriately structuring routing on the Deadlock Buffers, concurrent recovery is possible. This allows parallel recovery from single deadlock cycles and simultaneous parallel recovery from multiple cycles. This, too, requires just a single central Deadlock Buffer and completely eliminates the Token which could be a possible single point of failure. Exhaustive simulations show that this design philosophy is extremely efficient and results in superior performance. Hardware devoted to the recovery path is the bare mini mum and slows down the router by only about 1 %. This makes the scheme extremely sim ple and cost effective thereby enabling design of fast routers. The scheme is flexible as well and options for statically or dynamically varying the degree of misrouting and time out (before resorting to recovery) are just a few examples of the freedom this scheme offers. Disha thereby provides the framework for designing and testing a wide variety of alternate routing approaches. 5 In summary, Disha provides the framework for supporting deadlock-free adaptive wormhole routing with the following advantages: 1 ) true fully adaptive wormhole routing with no virtual channels, 2) applicability to any interconnection network topology, 3) simple design of fast routers, and 4) progressive routing with no abort-and-retry of deadlocked packets. The remainder of this thesis is organized as follows. Chapter 2 gives the background for this work. Chapter 3 describes the implementation aspects of the Disha recovery scheme. Chapter 4 is devoted to Sequential Recovery in Disha and a formal proof that this scheme is deadlock-free. Hardware support for a fast asynchronous Token passing protocol is also provided. Chapter 5 discusses Concurrent Recovery and proof that a single Deadlock Buffer is sufficient for safe recovery in Disha without mutual exclusion. Exhaustive simulations are provided in Chapter 6 . Chapter 7 gives the conclusions and scope for future work. 6 2 Background The interconnection network is the backbone for communication in a multiproces sor/multicomputer environment. System performance is determined not only by the effec tive utilization of multiple processor nodes, but also, to a large extent, by efficient communication amongst them. This has been a topic of much research in recent years and progress has been made on different fronts. Let us first consider switching options which determine how packets occupy chan nels along the path from the source to the destination. The earliest interconnection net works employed store-and-forward routing. The complete packet is stored at a node and is forwarded in its entirety from one node to the next. Figure 4 shows an example of this switching technique. s , 2 Time. Fig. 4: Store-and-forward switching. This scheme suffers from large packet delays. Latency is proportional to (Packet Size x Distance). Also, buffer depths at a node should be at least as large as the maximum 7 packet size. In order to reduce this delay for more efficient flow of packets in the network, pipelined versions of this scheme such as virtual- cut-through [2 0 ] and wormhole switch ing [26] were introduced. Example of pipelined packet transmission is shown in Figure 5. 2 4 Time, t'ig, 5: Pijvlineil packet transmission - virtual cut through wormhole. These two schemes differ in the action taken when packets block. Virtual-cut- through buffers the entire packet at a node when it blocks while wormhole allows the flits of a packet to occupy buffer resources at different routers. Packet delays in both these schemes is drastically reduced due to pipelining. Latency is proportional to (Packet Size + Distance). Because virtual-cut-through buffers all the flits at a node, buffer depths should be at least the largest packet size. Conversely, packet sizes are limited to the maximum possible buffer depth in a router. Larger sizes will require packetization and necessary support at the network interface. At the destination regrouping of packets would also be required. In wormhole, on the other hand, buffer depths could be as small as one. With small buffer depths the router could be run at a faster rate. Blocked packets, however, con tinue to occupy resources along the path. This makes wormhole more susceptible to dead locks. 8 Another area where progress has been made is in the routing algorithm itself. Adaptive schemes have been proposed [18] as alternatives to deterministic routing schemes [ 1 0 ] to increase the efficiency of routing, which determines the actual path taken by messages in the network. Routing algorithms could be partially or fully adaptive depending on the flexibility they provide in selecting minimal paths. Under deterministic routing (also known as oblivious), packets follow a fixed path depending entirely on the source and destination addresses. This scheme is highly restrictive, does not allow packets to adjust to varying traffic conditions and is not fault tolerant. It, however, has the advantage of being easy to implement, leading to simple and fast routers. In adaptive schemes, message paths are selected based on local network traffic. Adaptivity allows an increased utilization of available paths from source to destination, providing improved communication efficiency. Figure 6 shows an example of how adaptive routing can decrease packet latencies and increase network throughput. The figure shows an example of matrix transpose traffic pattern where nodes along (7, i) have destinations along (i, 7). With oblivious routing (Figure 6 a) only one o f these packets will be able to proceed, the others get blocked and will have to wait. Adaptive routing (Figure 6 b) on the other hand offers alternate paths and every packet can now make progress. Fig. 6b: Adaptive routing. Fig. 6a: Oblivious routing. 9 Adaptive schemes also offer additional flexibility to by-pass congestion or faulty nodes. Figure 7 shows an example o f a network with a faulty channel. Under oblivious routing (Figure 7a) the fault partitions the network and some packets could never reach the destination. Adaptive non- minimal routing (Figure 7b) or backtracking [1] could be used to route around the fault. Fig. 7a: Fault partitions network in oblivious routing. Fig. 7b: Route around fault adaptively. From the above discussion, it would appear that the scheme to adopt would have wormhole flow with fully adaptive routing capabilities. This potent combination, however, poses a few problems. Under wormhole movement, messages continue to remain in the network, blocking other packets that require the use o f resources they occupy. Packets could end up waiting for resources endlessly, in which case, they would be deadlocked as shown in Figure 8 . In the example shown all packets have destinations two hops away. Packet P] waits on packet P2 at router N2. Packet P2 waits on P3 at N 3, P3 waits on P4 which in turn waits back on P| leading to a cyclic dependency. Also, adaptive non-minimal schemes allow packets to be routed away from the destination and, if continuously routed in this manner, they may never reach the desired node leading to livelock. 10 N, n 4 □ □ □ □ □ □ P - □ □ N; N, Fig. S; A dead lo ck scen ario w h erein P | — > P i — » P ^— *■ P4 — > P( Most wormhole adaptive schemes focus on deadlock prevention. These schemes use the concept of virtual channels [10) to prevent deadlocks. Virtual channels are a logi cal abstraction o f physical channels. Providing separate physical networks is dollar costly and increases greatly the chip pin-out. To overcome this, each router is provided with a number o f buffer classes that are multiplexed on to the physical channel by the Virtual Channel Controller (VCC) as shown in Figure 9. These buffer classes serve the abstrac tion of physical channels. Physical channel VCC — Virtual channels Fig. 9: Concept of Virtual Channels. II Figure 10 below shows how virtual channels could be used for deadlock preven tion. Most preventive schemes group virtual channels into classes and packets are restricted to belong to only certain distinct classes. In the example below, the interconnec tion network in Figure 10a is susceptible to deadlocks. In Figure 10b, two virtual channels are provided and they are split into High and Low class channels. Packets going from lower numbered nodes to higher destinations must use the High channel and packets from higher to lower destinations should use the Low channels. Thus the two virtual channels form two distinct logical networks. It is easy to see that there can be no cyclic dependen cies and consequently no deadlocks. N, N , N , N, N . High N | — * ■ Nt — * N4 l.uw N . — ► N ( » Ni — * Ni N, N , Fig. 10a: Network susceptible to deadlocks. Fig. 10b: Breaking cycles by adding virtual channels In addition to preventing deadlocks, virtual channels serve an additional purpose - they can be used for improving flow in the network [II]. An example of this is shown in Figure 11. In the absence of virtual channels, packet P| waits on packet P 2 which is in turn blocked as shown in Figure 11a. With virtual channels packet P| can now route around packet P2 to make progress as shown in Figure lib . Virtual channels thereby decrease packet latencies and increase network throughput. 12 Blocked Fig. I la: P| waits on P i / ( Fig. 1 lb: Pj gets around Pi using virtual channels. Preventive schemes suffer from high hardware costs and/or losses in adaptivity. Consider first how adaptivity can suffer. The Turn Model (25), which does not require virtual channels, prevents deadlock by prohibiting those turns that could result in the formation of cycles in the channel dependency graph [10]. As a result, the adaptivity of routing algorithms based on the Turn Model is limited. The West-first algorithm, for example, demands that all packets be routed non-adaptively west first before they can be routed adaptively in other directions. Hence, if packets encounter congestion along the west direction, they must block, resulting in poor performance. Simulation studies show that such schemes produce unbalanced traffic conditions and often are outperformed even by non-adaptive algorithms [9], Such partially adaptive schemes also cannot be fault tolerant. Another partially adaptive deadlock avoidance scheme is Planar-Adaptive Routing [5), which requires six virtual channels for n-dimensional toroidal networks. It ensures deadlock freedom by restricting adaptivity to at most two dimensions at a time and structuring the passage of packets from one adaptive plane to another. Again, adaptivity suffers because some idle channels along minimal paths in the n- 2 other dimensions are automatically excluded by the routing algorithm since they lie outside o f the adaptive plane. Also, only one of the six virtual channels is available to a packet. To increase adaptivity, Linder and Harden [24] proposed a fully adaptive deadlock avoidance routing algorithm that requires an exponential growth (based on dimensionality) in the number of virtual channels to eliminate cycles in the channel dependency graph. Even with this large number of virtual channels, only a small subset of those channels are available to each packet as they are grouped into ordered levels or classes. Consider next how avoidance schemes can increase hardware complexity. Increasing adaptivity (for deadlock avoidance routing) by adding additional virtual channels generally results in increased router switch and Virtual Channel Controller (VCC) complexity. A study by Chien [6 ] shows that the increased complexity adversely affects machine performance by increasing router clock cycle time. In fact, deterministic schemes could outperform certain adaptive schemes which require many virtual channels. This is a motivation behind recent adaptive avoidance schemes which minimize the required number of virtual channels. Dally and A oki's Dynamic Routing Algorithm [12] is a more efficient fully adaptive scheme that reduces the number of virtual channels required. Deadlocks are avoided by not allowing cycles in the packet wait-for graph. Each packet keeps track of the number of dimension reversals it suffers. A packet is routed adaptively on any of the channels in the adaptive class until blocked with all suitable output channels being used by packets with equal or lower values of dimension reversals. The packet is then forced onto the deterministic class of channels which is deadlock-free and must remain there until routed to its destination. Here, a packet’s dimension reversals (relative to other packets at a given router) ultimately places an upper bound on adaptivity. 14 Duato’s fully adaptive deadlock avoidance algorithm [13, 15] provides additional flexibility in routing. Similar to the previous scheme, virtual channels are divided into adaptive and deterministic classes. Packets can be routed with full adaptivity using any available adaptive channel and can be routed with partial adaptivity using the deterministic channels which form escape paths. Only on blockage must a packet use an escape channel at the congested router; at subsequent routers, it is free to go back onto the adaptive channels (if free). The existence of a connected escape path which has no cycles in its extended channel dependency graph is sufficient for deadlock avoidance [ 15 [. The last two schemes require few additional virtual channels above that required for deterministic routing, but inefficiencies still exist. For example, from C hien’s model [6 ], if one determines that the optimum number of virtual channels for the expected load conditions is three for a toroidal n-cube, these schemes can use only one virtual channel for fully adaptive routing. Chien [6 ] shows that every additional virtual channel requires a 30 - 50% increase in load to justify its decrease in router speed. This being the case, it seems unjustified to allow fully adaptive routing on only a fraction of the total number o f virtual channels and restrict the choice of virtual channels along a dimension. Likewise, if one determines that the optimum number of virtual channels for flow control purposes is two for a toroidal network, fully adaptive routing is not permitted by either scheme. Deadlock recovery is a viable alternative to avoidance. Simulation studies have shown that deadlocks generally are rare [21, 27]. Because deadlocks are infrequent events, recovery seems to make more sense than prevention. In [27], an abort-and-retry mechanism is proposed as a technique for preventing and/or recovering from deadlocks. Compressionless Routing [21] is another fully adaptive deadlock recovery scheme based 15 on abort-and-retry that simply kills deadlocked packets. Although these schemes are simple and do not require virtual channels, they do have some implementation disadvantages. For instance, in [21], padding decreases effective channel utilization; when packet sizes are small compared to the network diameter or the buffers at routers are deep, the overhead due to padding is large. Packets have to be stored to allow retransmission if necessary, and killed packets suffer increased latencies. Moreover, injector and receiver interfaces, additional counters, and status lines increase hardware complexity. We believe routing should be fully adaptive in its truest sense and that the router design should be extremely simple. The goal here is to make the common case fast and Disha is a solution to this. Figure 12 shows a simulation study that shows that less than 2% of the packets deadlock. This being the case it is not advantageous to limit adaptivity at the expense of an infrequent event. Doing so negatively impacts performance and the fault- tolerance capabilities of the system. Likewise, it is not advantageous to devote virtual channels specifically to prevent deadlocks; virtual channels should be used only as a means of improving flow control [ 1 1 ] such as with adaptive routing. 0 25 O Z O .Q S 0.15 0.2 0.35 0 4 Fig. 12: Frequency o f deadlocks. 16 Disha is a deadlock recovery strategy that provides a framework for supporting deadlock-free fully adaptive wormhole routing. It effectively decouples routing flexibility from flow control resources by dedicating nominal hardware towards efficient deadlock recovery. Fully adaptive routing is permitted with no virtual channels, irrespective of the network topology. Virtual channels limit only the flow capabilities of the network, not routing adaptivity. Figure 13 below shows a simulation study of the effect of virtual channels on net work performance. This simulation was run for uniform traffic without regard for dead locks. Deadlocked packets were simply killed. This confirms results obtained in [11]. 500 'r- 4 0 Q 2 VCs 300 200 100 - 0.3 0 0 0.9 1 Lnad-K atc. 0 5 0 .6 Fig. 13: Effect of virtual channels. Figure 14 shows the same results taking into account the cost of virtual channels based on Chien’s model [6 ]. This graph demonstrates that three to four virtual channels are sufficient for flow assuming all are available for use at a node (four virtual channels 17 were found necessary for non uniform traffic patterns). If more virtual channels are used, then the cost of providing for them surpasses the performance improvement obtained and net performance is in fact reduced. 3500 2000 4 VCs 1500 •J' O ' 1000 o 500, 0.4 0.5 07 0 8 0.3 Jig. 14: lilfeci o f virtual channels inking cost inlo account. The following conclusions can be drawn: - Flow requirements should decide the number of virtual channels. - Existing virtual channels should be used effectively and efficiently. Classification of algorithms as non adaptive, partially adaptive or fully adaptive has been based on the choice of minimal physical paths from source to destination [9, 25]. This classification does not account for the increased flow additional virtual channels pro vide. Drawing comparisons among schemes that are fully adaptive but requiring differing numbers of virtual channels is therefore difficult. A more accurate classification could be 18 based on the number of minimal virtual channel paths from source to destination. Figure 15 shows the distinction between the two definitions. For the source destination pair (N (, N 3), there are only two physical channel paths but sixteen virtual channel paths. Preven tive schemes may not allow a choice of all virtual channels and some of them cannot be used even if they happen to be free. This reduces performance and more virtual channels will have to be provided to increase flow. This now slows down the router and increasing virtual channels may not even be beneficial. The Degree of Adaptivity (DoA) is formally defined as the ratio o f the number of minimal virtual channel paths from source to destination offered by the routing algorithm to the maximum possible virtual channel paths from source to destination. To demonstrate the flexibility in choice offered by Disha, consider the following example. Assume a 256 node system connected by a 2D torus with four virtual channels per physical channel. For this configuration, the average distance between source and destination is 8 hops. With minimal routing, packets can chose up to a maximum of 8 virtual channels (when the des tination is not along the same row or column as the current node). On the average, there ■ VC #2 ■ VC # I Fig 15: 2 physical paths and 16 virtual channel paths possible between N ( and N^. 19 are 4 s x 8 ! / (4! + 4!) = 45,87,520 VC paths to destination. Table 1 compares the different schemes based on this new definition. PAR requires a constant six virtual channels independent of dimensionality. Linder and Harden requires six virtual channels for the 2D torus. However, this number grows exponentially with dimensions. The Dally and Aoki scheme offers a choice of four virtual channels to packets on adaptive path but only one when they are on the escape path. Duato allows packets on escape channels to come back to adaptive channels if they happen to be free at a subsequent router. Disha has a Do A of 100% as long as there are no deadlocks and 0% for packets that are deadlocked. Simulations, however, show that only 2 % of the packets are susceptible to deadlocks. Table 1 : Comparison between different routing schemes. Routing Scheme Adaptivity Min. VCs Choice/ 8 DoA Deterministic (DOR) None 2 2 = 0 % Turn Model Partial 2 3 (Ave.) 0.2% (Ave.) Planar Adaptive Routing Partial 6 2 = 0 % Linder & Harden Full 6 (exp. t ) 2 = 0 % Dally & Aoki (Dynamic) Full 3 (4-A; 1-E) 0.4% (Max) Duato (Sufficient) Full 3 5 3% Disha True Full 1 8 1 0 0 % / 0 % 20 3 Implementing D i s h a In comparison to many adaptive schemes, Disha's hardware requirements are significantly less for a much higher degree of adaptivity, resulting in faster router operation. This chapter examines these important implementation issues and explains the recovery mechanism in Disha. Channel Specifics: Data Ready / Send Status Token Fig. 16 Channel Details. Figure 16 shows the implementation details for a unidirectional channel. Ready and Send are flow control signals used for handshaking between routers Ra and R^- Router Ra asserts the Ready line after it has placed data on the channel. Rf, uses this to latch onto the data on the channel and asserts Send when it is ready to accept more. The Status line is used to differentiate between incoming flits intended for input buffers or the Deadlock Buffer. Sequential recovery, explained in the next chapter, requires exclusive access to the Deadlock Buffers. The Token line is used to implement mutual exclusion by circulating a 21 Token. Concurrent recovery (Chapter 5) does not require mutual exclusion. A router asserts the Token line to pass the Token on to the next node along the Token path. This Token line has been indicated by a dotted line as it is not required with every unidirectional channel; only those channels along the Token path have it. Token is additional to what is found in other routers, while Status would be additional only if the number of virtual channels is = 2 m, where ‘m ’ is the number of status lines required to distinguish between the virtual channels. For example, if the number of virtual channels is 3, 5, 6 , 7, 9..., we would not need an additional Status line because this signal can be encoded on the existing lines used for differentiating between the virtual channels. Router Architecture: Figure 17 illustrates, at a conceptual lev elf , the router architecture for a mesh type interconnection network. DB is the Deadlock Buffer which constitutes the escape path. This buffer is central to the router and is on the input side alone. The concept of this central Deadlock Buffer is similar to the central queue proposed in the Chaos router [23] for virtual cut-through switching. The queue in the Chaos router receives messages through a bus from all the input frames (the Deadlock Buffer could be implemented similarly also). S j’s are the Status lines that enable or disable the tri-state buffers. In the absence of deadlocks, these tines are unasserted and the packet arriving on a channel is placed into the input buffer. When a flit comes in with the corresponding Status line asserted, data on the channel by-passes the input buffer to be placed directly into the Deadlock Buffer. * Note: Not all com ponents o f the router are shown (i.e., flow controller, address decoder, etc.) Only those com ponents which make Disha distinctive from basic router designs are featured. 22 The Decision and Control Logic arbitrates amongst the signals it receives from the address decoders and, based on this, generates control signals to make connections in the Crossbar Switch. The central Deadlock Buffer at the crossbar input constitutes the recovery path. This increases the crossbar complexity by just one. The complexity of the Virtual Channel Controller (VCC) that multiplexes virtual channels onto the physical channel remains unchanged as the DB is at the crossbar input alone. x + IB IB Y + IB IB X - Bar Switch 6*5 Decision and Control Logic Processor Node I ibJ m Data N Status 1 Deadlock Buffer Tri - State Buffers Fig. 17: Deadlock Recovery Router Block Design. Deadlock Detection: Deadlocks can be detected using a time-dependent selection function. The idea is similar to those suggested in [17] and [21] and Figure 18 shows the deadlock detection circuitry. T ciapsed T outare counters; T d a p ^ keeps track of the number of clock cycles 23 for which the router is unable to send out the header and T out is loaded with a constant threshold time-out selected to optimize system performance. “Deadlock” indicates whether or not there is a deadlock for that packet. This deadlock detection circuitry is not required by deadlock preventive schemes. Reset out Comparator Fig. 18: The circuitry lor deadlock detection. The router waits for the arrival of a packet. On receiving one, it resets T elapsed- * * then attempts to send out the header. If unable to do so, T e|apS ed is incremented. The router continues to increment T elapsed on every cycle until it is successful in sending out the header or until the time-out (T out) interval for that packet has been reached. When T e|apS ed > T oul, the router changes its state to “Deadlocked”. The selection of a proper time-out interval is important to obtaining optimum performance. Deadlock feeds upon itself in that if cycles are not broken quickly, more and more routers can get engulfed in cycles, and the entire network could come to a standstill. A large T ou( can limit peak obtainable performance. On the other hand, small T out’s trigger false deadlock detections. The choice of T oul thus becomes critical. 24 Crossbar allocation: Crossbar inputs and outputs can either be separate or multiplexed. M ultiplexing the virtual channels at the crossbar input can cause internal data blocking and a subsequent loss in bandwidth and is not done. Crossbar outputs on the other hand can be separate or multiplexed. With separate outputs the crossbar connection is allocated for the entire duration of a packet. Multiplexed outputs requires the crossbar to be reconfigured anew at every flit transmission. Data through delay could be reduced by not multiplexing output ports [7]. Arbitration in such a case is also simpler. Figure 19 shows the different crossbar allocation policies. M ultiplexed outputs requires a flit-by-flit allocation policy and does not tie up the crossbar port. A deadlocked packet at the input could easily gain access to the crossbar output. A packet-by-packet allocation, however, requires reconfiguration of the crossbar to break a cycle. ► ► ► p H I ' X-Bar Packet-by-Packet DB DB Fig. 19: Different crossbar allocation policies. 25 Crossbar Reconfiguration: Crossbar reconfiguration is required only for a packet-by-packet allocation scheme as explained above. In the absence of data on the Deadlock Buffer, the crossbar is configured as specified by the Decision and Control Unit’s arbitration logic. However, under a deadlock situation, it could so happen that the Deadlock Buffer input to the crossbar is required to be connected to the same output as is currently being used by some other packet in a normal input buffer. The crossbar would then have to be reconfigured to connect the Deadlock Buffer to the required output terminal. The decision logic thus needs to remember the state of the crossbar before it was reconfigured so that it can be reconnected once deadlock has cleared. To illustrate this point, Figure 20a shows the initial crossbar configuration before a deadlock situation. With deadlock and data requiring to be put out on X +, the crossbar is reconfigured as shown in Figure 20b. The reconfiguration buffer stores the state of the crossbar before change-over. This buffer does not need to store the entire crossbar state, but only that of the input that was disconnected. Data on the Deadlock Buffer is guaranteed to reach its destination and the crossbar is reconnected using the information stored in the reconfiguration buffer as shown in Figure 20c. X p. X + X Y + Y + Y y - DB NODE NODE X DB NODE NODE 2 X X + X X Y + Y + Y \ Ps" Y- DB NODE NODE (a) (b) (c) This notation implies that inpul 2 (X ') needs to reconnected to output 1 (X +). Fig. 20: Reconfiguration o f the crossbar. 26 Routers can be implemented with internal flow control so that the edge output buffers are guaranteed to clear. In doing so, the VCC cooperatively controls with the internal flow controller intranode data flow. It accepts only those flits that can be delivered based on internode flow control. Such an implementation could in fact reduce the critical path through the router [7]. Hence, the output buffer will eventually clear and the crossbar can be reconfigured. Keep in mind that the How is wormhole and that the path is not released until the tail has gone by. Thus, although there might be discontinuities, as for example the previous flits in packet P| could have advanced while the flits at this router are held back, the Decision Logic remembering the crossbar state before reconfiguration is sufficient guarantee that all flits will ultimately reach the correct destination. Simulations show that less than 2% of the total packets injected seize the Token. We could then expect the number of packets that are likely to suffer such discontinuities to be around 4 to 5% in the average case, which should not be a problem. We cannot do without reconfiguring because with the Token held at this router, packet Pj could be blocked upstream leading to a deadlock situation. Deadlock Buffer could be given priority over the other input buffers and data on this buffer be allowed to bypass normal data on the output channel if they exist. This ensures quick recovery. It is also possible to delay reconfiguration until the packet using that output (P t) either is completely routed or becomes blocked. 27 Cost Model: Here, we compare the additional cost of implementing Disha with the ’ -Channels Router [8 ], which prevents deadlocks based on Duato’s theory [14]. In drawing this comparison, we use typical router parameters and the cost model given in [6 ], This model does not take into account the external channel delay though. The network is assumed to be a 2D mesh with three virtual channels per physical channel. The routing latency for path set-up is greater than that for flow control (data through). To maximize performance, the router clock is assumed to run at the data through cycle time. Path set-up for every packet is required only once at each router and is implemented in multiple clock cycles [7]. Any additional latency that may arise in implementing path set-up in Disha can therefore be hidden. The data through cycle time based on the model in [6 ] is given as T jata trough = Tfc + T ch + T vcc, where T fc is the flow controller delay, Tct, is the delay through the crossbar and T vcc is the virtual channel controller delay. For a 0.8 micron CMOS process, these module delays are given by Tfc = 2.2 ns, T ch = 0.4 + O.blogP ns, and T vcc = 1.24 + 0.6logV ns, where P is the number of inputs to the crossbar and V is the number o f virtual channels that the VCC must multiplex onto a physical channel. Substituting typical router parameters into this model, the data through delay for the ’ -Channels router is T *-channels = 7 ns. The central Deadlock Buffer in Disha increases the size of the crossbar by one. The size of the VCC remains unchanged. The data through delay for Disha can be com puted as Tdisha = ns. The negligible increase in data through latency is overridden by the 28 advantages that Disha provides in routing flexibility as shown by simulations in the next section. This is because Disha provides fully adaptive routing on all buffers. The *- Channels Router, on the other hand, can provide fully adaptive routing on only one of the virtual channels for a torus and on only two virtual channels if the topology is mesh-type. Example of Recovery in Disha: The complete operation of Disha is explained below assuming routers with output buffers. Figure 21a shows a possible initial state of routers Ra, Rh and Rc. Only one virtual channel is shown for clarity of illustration. Here, packet P] is blocked by P2 and is unable to proceed. Router Ra is unable to send out the header of P] for T ou( clock cycles (see next section) and assumes packet P ( to be deadlocked (P ( -> P2). Ra now sends out the header of Pj on X + with the corresponding Status line asserted. Router Rh places the incoming h ea d ero f P ( on the Deadlock Buffer, by-passing the normal input buffer currently occupied by packet P2. Now assume that at router Rh, packet P] requires the same output channel as is currently being used by some other packet P 3 . For a packet-by-packet allocation scheme, as explained earlier, router Rh reconfigures the crossbar. The output buffer at Rh occupied by P 3 is bound to clear, allowing packet P ( to proceed. The state of the crossbar is stored in the reconfiguration buffer for later reconnection. Packet P ( now follows the Deadlock Buffer path through subsequent routers to reach its destination. These actions are indicated in Figure 21b. The flits of packet P 3 in router Rc could have moved forward as shown in Figure 2 lb. Although this might lead to discontinuities in flit reception, all flits of packet P3 are bound to reach the correct destination. 29 X X Y Y ^ s - o - j —L kmH r ~ ^ IB X - Bar OB - o m dbH IB X - Bar OB IfEEh IB X - Bar OB Fig. 2 1 a shows a possible scenario. X X Y Y o - o - - o - o - o - Q IB X - Bar OB IB X - Bar OB Fig. 21b shows how deadlocks arc resolved. It is not sufficient to reconfigure the crossbar once the tail flit has gone by. This is because the tail flit may never come by. The sole case which presents such a possibility is when a packet deadlocks on itself. This situation is analyzed in depth below. Normal (adaptive) path Deadlock Buffer path Fig. 22 shows the path traced by packet P| 30 Figure 22 shows the path traced by a packet P ( in trying to reach from its source node ‘S’ to its destination ‘D‘. The solid lines represent the initial adaptive routing path taken by the packet. Packet P| is forced to misroute at node R j and continues to misroutes until it reaches node R j. At this node Ri, packet P| finds that it can no longer misroute due to a finite upper bound placed on the misroute count to avoid livelock. It is now forced to take a minimal path. The corresponding channel could be busy, following which Pj times out and seizes the Token. It now follows the Deadlock Buffer path shown in dotted lines in Figure 22. X + _ X - p - r > L _ h ■ O - 4 ™ l- IB X - Bar OB IB X - Bar OB (a) < h) Fig. 23 shows the crossbar stales in Roulcr R2. Figure 23a shows the initial state of the crossbar at route R2. When packet P| arrives on the Deadlock Buffer at R2 the crossbar will have to be reconfigured as shown in Figure 23b. The leading flits of P] (from router R2 to R ]) will drain following the Deadlock Buffer path to reach the desired destination ‘D’ where they will be sunk. If now router R2 waits for the tail flit to reconfigure the crossbar, it will never arrive. Trailing flits from ‘S’ to R2 will never be able to reach the destination. Also, the buffers once occupied by P| will never be released (in wormhole routing the tail flit releases the buffers). This problem could be 31 handled in software by restricting routing to only minimal paths. This solution, however, would provide less fault tolerance than is possible. A fully adaptive minimal scheme cannot route around a faulty channel when this channel happens to be on the last unmatched dimension before the destination. To keep the fault tolerant capabilities of Disha intact and to provide the option of misrouting if necessary, we propose a hardware solution. In wormhole routing, if the header flit advances, all succeeding flits can advance as they follow a reserved path. The end of the packet is determined by the tail and the path is released. Also, a router asserts its Ready line (Figure 16) as long as it has data to send. Going back to Figure 22, when all the leading flits at router R2 have drained, router R| no longer has any data to send. Consequently, its Ready line is pulled down. Router R2 meanwhile has not received the tail flit. It now reconfigures the crossbar to allow the trailing flits of packet P] to proceed as shown in Figure 23c. The trailing flits just need to take the Deadlock Buffer path and not the entire one traced by the header. This maintains the continuity of packet P (. Routers on the trace {R3 , Rj, etc.) release the path the same way that R2 detected a need for reconfiguration (Ready not asserted with tail not gone by). With the hardware solution suggested above, Disha allows fully adaptive routing inclusive o f ail non-minimal paths. This provides maximum possible fault-tolerance. When the misroute count has reached zero, only minimal paths are allowed (to keep from becoming livelocked). Hence, a situation could arise where the channel is faulty and misrouting is not allowed. Disha handles this case as an exception; the misroute count is restored to its original value, allowing routes around the faulty channel (or node). In summary, hardware support for recovery via Deadlock Buffers is minimal. The effect of this additional hardware on the router speed is negligible. 32 4 Sequential Recovery Deadlock Buffers form escape paths for packets that are deadlocked. Consequently, routing on these escape channels is required to be deadlock-free. Recovery from deadlocks via the Deadlock Buffers can either be sequential or concurrent. The basic concept in sequential recovery, hardware requirements and formal proof that this form of recovery is deadlock-free are described in detail in this chapter. The next chapter is devoted to concurrent recovery. As mentioned in earlier chapters, Deadlock Buffers unlike edge buffers, are central to a router. Ensuring deadlock-free routing on these buffers is therefore more complex than ensuring freedom from deadlocks with the escape channels in Duato\s scheme that are formed by edge buffers. A deadlock situation could arise even under deterministic routing on the Deadlock Buffers as explained in Figure 3. In sequential recovery [2], packets gain exclusive access to the Deadlock Buffers. Once a packet has been granted the right to use the escape channel, it can be placed and routed continually using this resource until it reaches its destination to be sunk there. This breaks the deadlock and other packets involved in this cycle can make progress sequentially. The packet using the Deadlock Buffers cannot deadlock with other packets as it has exclusive access to the escape lane. Minimal routing on the Deadlock Buffers could ensure that this packet will not deadlock with itself. Thus, safe recovery is achieved. 33 Implementing M utual Exclusion: Capturing a single Token propagating in the network can give packets the required exclusive access to the escape lane formed by the Deadlock Buffers. This requires additional hardware both at the router and in the network. These details are discussed here. Token Propagation: Fig. 25 Hardwired Token path. Figure 25 shows one possible hardwired Token path (a Hamiltonian cycle) for a mesh/torus. As explained in Chapter 3, the router, if unable to send out the header, checks to see if it is deadlocked. This is implemented by comparing the blocked time of the packet with the time-out threshold selected. If blocked longer than the time-out interval, the router prepares to seize the Token. Once the Token is captured, the deadlocked packet is then sent out with the Status line asserted, conveying to the next node that the incoming flits are to be placed into the Deadlock Buffer. This packet will be routed to reach its destination and break one dependency cycle. The Token is then regenerated by the destination. Table 2 gives details o f the simplistic logic needed for implementing the Token. The cost o f this line is about ‘l/kth’ of a regular status line as it does not include all paths, where ‘k ’ is the node degree. The status of the DBit indicates whether or not the Token 34 should be captured. As will be explained later in this chapter, the DBit is not necessarily set as soon as deadlock is detected. Table 2: Token Logic Input Token DBit Output Token 0 0 0 0 1 0 1 0 1 1 ] 0 The Token can be clocked in synchronism with the router clock. This, however, restricts the Token to propagate at the router speed. On average, a node would then have to wait N/2 clock cycles to capture the Token, where ‘N’ is the network size (number of nodes). For a 256 node network, the average time to capture the Token would then be an unacceptable 128 clock cycles. From the table above we see that the Token implementation minimally requires just one gate on the critical path. Hence, it would be profitable to implement the Token as an asynchronous signal (with respect to the router clock). The Token is now defined as a propagating pulse as shown in Figure 26. Token at Router R Token at Router R2 Ri Token Fig. 26: Token Propagation between routers. 35 Token Capture: In the figure, PW is the pulse width and Tp the propagation delay from router R j to R 2 - T p includes the gate delay at a router and the wire delay between routers. To ensure that the Token can be safely latched-in when required, the pulse width should minimally be approximately two and a half times the latch clock period (router clock). If this were not the case a situation could arise where the edge of the clock misses the Token. Since the Token is a propagating wide-band pulse, we have a situation wherein the Token could be simultaneously present in two or more routers at the same time, thus violating the exclusion requirement. With the Token running asynchronously with respect to the router clock, it is possible that router R ( changes the state of the DBit at an instant when the Token is passing through it. Meanwhile, the initial part of the Token has propagated to router R 2 which could already be in a deadlocked state with its DBit being set even before the Token arrives. Now both routers R t and R 2 could potentially seize the Token, as shown in the timing diagram in Figure 27. Token at Router R | Deadlock Bit at R | Router Clock R ( Token at Router R2 Deadlock Bit at R 2 Router Clock R2 Token Seized at R Token Seized at R Fig. 27 illustrates how tw o routers could potentially seize the Token at the same time and, hence, violate mutual exclusion. 36 To solve this problem, a router is prohibited from changing the state of the DBit while the Token is passing through it. The routers follow a protocol by which they are allowed to set the DBit at the falling edge of the Token. When the Token comes around the next time, it is captured and sunk by only one router. The DBit is reset by the router after it has placed the packet on the Deadlock Buffer. This protocol ensures mutual exclusion and the proof of this is presented by analyzing all possible cases. Case 1: Routers R | and R2 enter a deadlocked state at an instant when the Token is passing through them. Both routers cannot seize the Token. They, however, set their DBits at the falling edge of the Token. Case 2: Router Rj has already set its DBit. Now the Token does not propagate irrespective of the state of R 2 . In effect, router R | seizes it. Case 3: Router Rj enters a deadlocked state while the Token is passing through it and while router R 2 is already in a deadlocked state having set its DBit in the previous Token cycle. Now Rj will be unable to get the Token but R2 seizes it. 37 Clock Reset from DCL Reset from DCL Clock 1 | Input Token Output Token egalive Edge Detectc Signal to DCL Critical path lor the Token that plays a part in Tp. R egenerated Regeneration Logic | | Deadlock Detection Circuit Token Capture Circuit | | Token Propagation Circuit Fig. 28: A possible Token Implementation. This is included with the Decision and Control Logic (DCL) circuitry o f Figure 17. Shown in Figure 28 is one possible implementation of the Token as a propagating pulse with mutual exclusion. The state of the latch indicates whether the Token has been captured. The Token logic has just two gates on the critical path including a gate to propagate a regenerated Token pulse. The Token can therefore be clocked much faster than the router to enable quick deadlock recovery initiation. The Token seizure time has thus been reduced considerably from the case where the Token and the router clock were in synchronism. A disadvantage with this protocol is that the routers have to wait a full Token cycle before they can seize it. The average Token capture time is now given by T pN/2 + TpN = T p3N/2. For N = 256 and assuming that Tp is 1/5 of the router clock period, the average Token capture time is about 75 clock cycles. 38 The Token capture time can be further reduced by another factor o f three with a clever improvisation. The Token is modified to have two pulses as shown in Figure 29 below. Pulse P2 Pulse P| ------- ------------► ------- T . , T 2 T) A router sees pulse P[ before pulse P2. Pig. 29: A two pulse-train Token. The DBit is always set at the falling edge of pulse P |. P] thus effectively acts as the synchronizer. Pulse P ( would go onto the next router R2. If router R[ is in a deadlocked state, then pulse P2 does not make it to R2. Router R2 could synchronize with pulse P ( setting its DBit. However, if it does not receive pulse P2 in the specified time interval T 2, it resets its DBit. The router regenerating the Token, issues both pulses P] and P2. The average Token seizure time is now reduced to TpN/2 + T | + T 2 which is approximately 25 clock cycles. This reduction in the Token capture time from 128 clock cycles justifies any increase in Token implementation complexity. Given that there is more than just a single packet involved in deadlock cycles, the seizure time for deadlock recovery initiation will be further reduced. M oreover, this latency for capturing the Token can be hidden by selecting an appropriate time-out. If, for example, we determine that a time-out of 32 provides optimum performance results, and the average time to capture the Token is 25 clock cycles, then the capture latency is hidden by selecting Tout = 32 - 25 = 7. 39 There is a reliability issue with trying to latch a Token which is asynchronous with the router clock. When the Token is being sampled, if it is not stable for the set-up and hold times of the latch, the flip-flop may go into a metastable state. In such a state, the output is unpredictable and the latch could theoretically take an indeterminate amount of time to recover from such a state. In practice, however, the probability that the flip-flop will stay in the metastable state decreases exponentially with time. This problem is not impossible to solve. Many solutions exist such as having the Token input itself as a strobe. The latch is now clocked with the AND of the strobe and the router clock. This is not a general solution for solving the meta-stability problem but should work well here. Token Regeneration: The Token cannot simply be released by the node that captures the Token once the tail flit has gone by. The tail could have gone by long before the packet reaches its destination. This would be particularly true if we had small packets with deep networks (large buffer sizes) or networks with large diameters. Another node could potentially seize the Token and mutual exclusion on the Deadlock Buffers would be violated. To avoid this, we instead allow the destination to release the Token once it receives the deadlocked packet. The Token information need not even be put in the header. A packet, once placed on the Deadlock Buffer continues to follow the Deadlock Buffer path until it reaches its destination. The destination node releases (regenerates) a new Token if it receives a packet on this path after the packet has completely drained. This ensures that there is always at most only one packet on the Deadlock Buffers. 40 Instead of waiting for the entire packet to drain, the Token could also be released once the header reaches the destination. The header having reached the destination is sufficient guarantee that there will be no cycles on the Deadlock Buffer path as it would be sunk there. This allows the Token to propagate earlier, resulting in even faster recovery of other potential deadlocks. It should be noted that a once deadlocked router might, at a later time, resume normal packet transmission without ever receiving the Token. This could come about by some other router also involved in the dependency cycle seizing the Token and initiating deadlock recovery to break the cycle. In this case, T eiapse(i is reset and detection for deadlock starts all over again. Definitions and Theorem for Sequential Deadlock Recovery: The Disha technique can be applied to any interconnection network. The definitions presented here are exactly the same as those found in [14]. They are included for completeness of the proof of our theorem. Strictly speaking, the theory developed in [14] cannot be directly applied to Disha because Disha uses both edge buffers (traditionally referred to as virtual channels) and central buffers. Additionally, the routing function is different for edge and central buffers. Thus, the routing function must consider the buffer where the packet is stored at the current node. A new formal theory which is a modification of [ 14, 15] has been proposed in [3]. Definition 1: An interconnection network / is a strongly connected directed multigraph, I = G(N,C). The vertices of the multigraph N represent the set of processing nodes. The arcs of the multigraph C represent the set of communication channels connecting the nodes. More than a single channel is allowed to connect a given pair of nodes. Each channel ct has an associated queue. The source and destination nodes of channel c, are denoted by ,v , and iit respectively. Definition 2: An adaptive routing function R : N x N Y\(C), where W(C) is the power set of C, supplies a set of alternative output channels from the current node nc to the destination node n(ft R(nc, n({) = {c/, c2 cp \. In particular, R(n, n) = 0 , V n e N. For a fully adaptive routing function in which all non-minimal paths are allowed, 'p' will be equal to the number of output channels of a node. At the other extreme, for a static routing algorithm ‘ p 1 could be just one. Definition 3: The routing function R for a given interconnection network / is connected iff V x, y e N , x * y , 3 cj. c2 ck e C such that c { e R(x, y) a cm+/ e R(dm,j ) V m e {I, k - I } a dk = j. That is, it is possible to establish a path P(x, y) a 1 1 (0 between them using the channels supplied by R. 42 Definition 4: A routing subfunction R { for a given routing function R is one that supplies a subset of the channels supplied by R. Thus R t restricts the routing options supplied by R. The set of channels supplied by R t is C/ = u v*. y)- More precisely, R i(nc, nd) = {r/, c'2, .... cq 1 where 0 < q < p and cy, c2 cq g Cy. Needless to say, C ; c C . The routing subfunction can also be expressed as R](nr n(i) =R(nc, n(i) n Cy V nr, nc i g N. Definition 5: Given an interconnection network /, a routing function R and a pair of channels cit Cj € C, there is a direct dependency from ct to cj iff c, e R(sit n) a Cj g R(dif n) for some n g N . That is, cj can be used immediately after c, by messages destined to some node n. Definition 6: Given an interconnection network /, a routing function R a channel subset C/ g C which defines a routing subfunction R } and a pair of channels cit Cj e C, there is an indirect dependency from c, to C j iff 3 C/, c2 q g C - Cf such that c, e R f sj, n) a c t e R(dr n) a cm+j g R(dm, n ) V m e { /, k - l} a dk = sj a Cj g Rj(sj, n) for some n g N. That is, it is possible to establish a path from si to dj for messages destined to some node n. The first and last channels in that path are c, and cy, and the only ones supplied by R p As t j and cj are not adjacent, some other channels belonging to C - C/ are used between them. Definition 7: A channel dependency graph D for a given interconnection network / and a routing function R is a directed graph D = G(C, E). The vertices o f D are the channels o f I and the arcs of D are the pairs of channels (c,, t j) such that there is a direct dependency from ct to cj. D efinition 8: The extended channel dependency graph D for a given interconnection network / and a routing subfunction R }, is a directed graph, DE - G(C}, Ee ). The vertices of D E are the channels supplied by the routing subfunction R t. The arcs of DE are the pairs o f channels (cit cj) such that there is either a direct dependency or an indirect dependency from c, to c}. D efinition 9: A configuration is an assignment of a list of flits to each queue, all o f them belonging to the same packet. The number of flits in the queue for a channel c, will be denoted size(c-j). The destination node for a flit fj will be denoted dest(fj). If the first flit in the queue for channel c, is a header flit destined for node ntl, then head(c{ ) ~ n(i. If the first flit is not a header and the next channel reserved by its header is cj, then nextfcf = ty that is, each flit must follow the same path as its header. A configuration is legal iff V c( € C, size(Ci) < cap(ct) a c, e R(st, dest(fj)) V fj e queue(c} ). That is, the queue capacity cannot be exceeded and all the flits stored in the queue have been sent there by the routing function. 44 Definition 10: A deadlocked configuration for a given interconnect network / and routing function /?, is a non-empty legal configuration verifying the following conditions: 1) V C ; e C such that head(t'j) e N = > head(ct) * c/, a size(cj) > 0 V fj e R(J,, head(cfi). 2) V t‘ ( - e C such that next(Cj) e C = > size(next(ct)) = cap(next(t j)). In this configuration, no header flit is one step from its destination and no header flit can advance because the queues for all the alternative output channels supplied by the routing function are full. Data and tail flits cannot advance because the next channel reserved by their packet header has a full queue. Also, in a deadlocked configuration, there is no packet whose header flit has already reached its destination. We make the following assumptions about our implementation. Assumption 1: The router is implemented with internal flow control such that output buffers clear. Assumption 2: All buffers have access to the Deadlock Buffer at neighboring nodes. Assumption 3: A packet that has been placed on the Deadlock Buffer at a node can use only Deadlock Buffers at subsequent nodes until it is delivered at its destination. The channels formed using the Deadlock Buffers constitutes the set C/. In Disha, routing on the Deadlock Buffers is restricted to be minimal. 45 Assumption 4: Mutual exclusion on the Deadlock Buffers is implemented with a T oken[l]. There can be at most one Token travelling across all nodes in the network at any given instant of time. Assumption 5: Deadlock Buffers can be used by packets only to recover from deadlocks. A packet can be routed on the Deadlock Buffers iff the propagating Token has been captured by that packet. Instantaneous with this action, the token is inhibited from further propagation until deadlock is resolved. Assumption 6: The captured Token is released by the destination node that receives a packet header on the Deadlock Buffer. This packet is the one that was originally deadlocked and initiated the need for a Token capture. Lemma 1: The routing subfunction /?/ is connected. Proof: Every node is provided with a Deadlock Buffer. From Assumptions 2 and 4, it is obvious that given any pair of nodes {x, y) it is possible to establish a path from x to y using the Deadlock Buffers. Thus R j is connected. 46 Lemma 2: There are no cycles formed by indirect dependencies in the extended channel dependency graph. Proof: Assumption 3 implies that we cannot find a pair of channels (< ■ ,, Cj) supplied by R } such that the channels between them are not supplied by R h If r, is supplied by R f then all subsequent channels must be supplied by /?/. Consequently, there are no indirect dependencies in the extended channel dependency graph. Lemma 3: The routing subfunction R j has no cycles in its extended channel dependency graph. Proof: Deadlock Buffers constitute the set C/. From Assumption 6 , at any given instant of time, there can be at most one packet Pi on the Deadlock Buffers that does not have its header delivered at its destination. Other packets on the Deadlock Buffers have at least one flit delivered at their respective destinations. These packets cannot be involved in any deadlocked configuration (Definition 10). From Assumption 3 routing on the Deadlock Buffers is restricted to be minimal. Hence, packet P, cannot be deadlocked with itself. Thus none of the packets that are currently on Deadlock Buffers are involved in any deadlocked 47 configuration. Lemma 2 shows that there are no cycles formed by indirect dependencies. It follows that a deadlocked configuration does not involve any of the Deadlock Buffers. Therefore, there exists a subset o f channels Cj that defines a routing subfunction /?/ which has no cycles in its extended channel dependency graph. Theorem 1: An adaptive routing function R for an interconnection network / is deadlock-free if there exists a subset of channels C; £ C that defines a routing subfunction R t which is connected and has no cycles in its extended channel dependency graph D E. Proof: The reader is referred to Duato [14] for a formal proof. Theorem 2: "The network can safely recover from deadlocks when deadlocked packets are sequentially routed on the Deadlock Buffers. ” Proof: Lemmas 1 and 3 prove that the routing function R ; is connected and has no cycles in its extended channel dependency graph. The theorem is proved by simply inserting these results into Theorem 1. 4R To get an intuitive feel of the proof, assume that the network is deadlocked. There may be one or more dependency cycles. Assume that there are ’k’ dependency cycles. A packet can be placed on the cycle-free Deadlock Buffer path. Following this path the packet is guaranteed to reach its destination where it will be sunk. This eliminates one cycle, reducing the number o f cycles to (k-1). By induction, it follows that the network safely recovers from deadlocks. 49 5 Concurrent Recovery Sequential recovery described in the previous chapter requires mutual exclusive access to the deadlock-free lane. The major drawback of this implementation is the Token logic. The Token forms the basis through which recovery is achieved and, thereby, pre sents a single point of failure. A less significant drawback is the fact that recovery from cycles is done sequentially. This could affect performance if deadlocks become frequent (though this is unlikely as simulations have shown) or get clustered in time. In this chapter, a selection-based concurrent recovery method is developed. Con current recovery relaxes this requirement and allows simultaneous recovery from dead locks. For k-ary n-cube meshes, the proposed scheme does not necessitate any additional resource cost; in fact, it eliminates the Token logic completely. For k-ary n-cube toroidal networks, an additional central buffer is required. With concurrent recovery, Disha is no longer susceptible to single-point failures due to Token propagation, capture and release. Parallel recovery from single deadlock cycles and simultaneous recovery from multiple deadlock cycles is now possible with Disha. Concurrent Recovery Methodology for Disha: The simple methodology we use for designing Disha free from the requirement of mutual exclusive access to the deadlock-free lane is presented here. For every deadlock cycle elim inating one single dependency is sufficient to break that cycle. The first step 50 towards achieving this goal is to lay out a criteria for selecting the packet(s) that will break the dependency cycle. Second, care should be taken to ensure that at least one packet in every dependency cycle satisfies this criterion. To achieve this, we make use of inherent information in deadlocked cycles. No global knowledge is necessary and in a distributed m anner packets are chosen to break cycles. It can be shown that under minimal routing there is at least one packet in every deadlock cycle with a neighboring node less than its destination and at least another packet with a neighboring node greater than its destination. Now, for example, only the packet(s) that has a neighboring node less than its destination could be selected to break the cycle. To ensure that these selected packets are routed in a deadlock-free manner, we con struct a Hamiltonian path on the k-ary n-cube mesh using the Deadlock Buffers similar to the one shown for the 2D mesh in Figure 30. The Deadlock Buffer at a node can be used only by a deadlocked packet at a neighboring node for which the packet’s destination has a higher label than the node to which the buffer belongs. A packet, once placed on a Dead lock Buffer, is restricted to using only Deadlock Buffers at subsequent nodes until it is delivered. Successive Deadlock Buffers can be used only in increasing label order. Hence, this path allows any node to reach any other node whose label is higher than its own. It is easy to see that this path is acyclic and routing on the Deadlock Buffers is deadlock-free. If now every cycle has at least one packet that satisfies this condition, then it is ensured that every cycle can be broken. 51 * u I- L- N Fig. 30: Hamiltonian path for a 5 by 5 mesh. A different algorithm has been developed in |3 | using the basic idea described above. As packets on the Deadlock Buffer lane are only able to reach higher labelled des tinations, a routing subfunction defined on the Deadlock Buffers is not connected. With subtle changes, a connected routing subfunction is formed that remains deadlock-free. With these changes, every packet can be selected to break the cycle giving the scheme improved flexibility. Modifications to the theory in [13, 15) are also made to provide the framework to verify a recovery scheme like Disha with both central and edge buffers. This theory is then applied to formally show that the proposed algorithm is dead lock-free. In this chapter, only the selection-based concurrent recovery scheme is explained. For k-ary n-cube toroids, it is more complex to prove that there is at least one packet in every deadlocked cycle with a neighboring node < its destination. This is left as future work. As of now, two Deadlock Buffers are required at a router. Two separate Hamiltonian paths are constructed using these buffers, one to be used if the current node > destination, and the other if the current node < destination. Routing on these paths are strictly in increasing or decreasing orders. Recovery is safely achieved. It should be noted, however, that deadlocked packets no longer follow minimal paths. Since deadlocks are 52 extremely rare, very few packets are subjected to increased path lengths. Since routing is wormhole, the effect on message latencies will be negligible. The proof for the fact that there is at least one packet with a neighboring node < its destination for a 2D mesh under minimal routing is given below. The idea could be applied to any k-ary n-cube mesh topologies even in the presence of limited misrouting. Some exam ples of these cases are presented later. Characterizing a Deadlocked Configuration: Definitions and T heorem : Consider a 2D mesh with ‘m ’ rows and ‘n’ columns. For the following definitions and theorem, the coordinates of the source node, current node and destination node are denoted by (sx, sy), (cx, cy) and (dx, dy), respectively. Definition 1: The label of a node / with coordinates (ix, iy), V x € [1, m] and V y e [ 1 , n], is given by the following Hamiltonian assignment: 7i[i] = Tt[x, y] = n(x - 1 ) + y if x is odd nx - y + 1 if x is even Row numbers increase from bottom to top and column numbers increase from left to right. This labelling for a 5 x 5 mesh is illustrated in Figure 31. 53 Fig. 31: Node Labelling in a mesh Definition 2: An interconnection network / is a strongly connected directed multigraph, / = G(N,C). The vertices of this multigraph N represent the set of processing nodes and the edges C represent the set of communication channels connecting the nodes. Definition 3: A dependency cycle is formed by a circular wait of packets requesting for a resource while at the same time holding onto others. A dependency cycle leads to a dead locked configuration in which no Hit is one step away from its destination and none can advance because the queue on the next channel is full. Definition 4: Minimal routing between source node s and destination node d specifies that the next node n{c) belongs to the set N(c) given by the following: if cx > d x, then {(cx - l ,c y), (cx, cy + 1 )} N(c) = {(cx - 1 , cy)} {(cx - l , c y), (cx, c y - 1 )} for C y < d y for cy = d y for cy > dy if cx = dx, then N(c) = {(cx, C y + 1)} | ( C x , C y 1 ) } f o r C y < d y for C y > d y i f c x < d x , then | ( C X , C y + 1 ) , ( C X + I , C y ) } N(c) = { (cx + l , C y ) } for cy < dy for C y = d y { ( C X , C y - 1 ) , ( C X + 1 , C V )} for cy > dy Definition 5: The neighboring node nj,(c) of some current node c belongs to the set Nh(c) given by Nh(c) = {(cx -1, cy), (cx +1, C y ) , (cx, cy -1), (cx, cy +1)} Before proposing our theorem, we first make some interesting observations. In a 2D mesh there exists a total of eight 90-degree turns {see Figure 32) which can form dif ferent types of dependency cycles leading to a deadlocked configuration. 55 (a) (h) Fig. 32: (a) Clockwise and (b) Counter Clockwise cycles We classify these dependency cycles into three basic categories, depending on the turns used: clockwise dependency cycles (Figure 32a), counter clockwise dependency cycles (Figure 32b) and abstract dependency cycles. Abstract cycles are formed when a prohibited turn in a clockwise (counter clockwise) cycle is emulated by three counter clockwise (clockwise) turns. We classify abstract cycles into four subclasses - clockwise left, clockwise right, counter clockwise left and counter clockwise right. These cycles are shown in Figure 33. Other cycles are supersets of these basic categories. Clockw ise Left Clockwise Right T6 i Counter Clockwise Left C ounterclockw ise Right Fig. 33: Abstract Cycle types. In previous work [25], it has been shown that a routing algorithm is deadlock-free if a single turn is eliminated from each of these three classes of cycles. In the theorem to follow, we simply make certain assertions about the packets involved in cycles. The next section deals with how cycles can be broken using results from the theorem. Theorem: With minimal routing, there is at least one packet in a deadlocked configuration with a neighboring node label less than its destination node label (in most cases the neigh boring node is also the next node on the minimal path). Proof: The proof is presented by analyzing all cases. C a se 1: Clockwise Cycle: We first consider a clockwise deadlocked configuration similar to the one shown in Figure 32a and prove our theorem for this case. The existence of a clockwise deadlock configuration implies that packets are involved in making turns T j, T 2, T 3 and T4. Con sider turn T | which is a dependency formed by a packet travelling in the upward direction waiting to make a right turn. The absence of this turn or dependency would contradict the assumption that a clockwise deadlocked configuration exists. From the definition of mini mal paths it can be seen that sx < d x and sy < dy. 57 For the case of cx < dx and cy < dy, the set of next nodes is given by N(c) = {(cx, cy + 1), (cx + 1, cy)}. Consider next node n(c) = (cx, cy +1). The label of this node is given by 7i(n(c)] = n(cx - 1 ) + cy + I if cx is odd ncx - cy ifcx iseven. The label of the destination node is given as n[d] = n(dx -l) + dy if dx is odd ndx -dy +l ifdyiseven. Since row and column numbers are integers and cy<iriax) = n, it is easy to see that 7t[n(c)J < it[d]. Since the next node is also a neighboring node, there is a packet with neighboring node less than its destination. For the case when cx= dx and cy < dy or cx < dx and cy = dy, rt[n(c)] could be > T C [d] and there could be no packet with a next node along a minimal path less than the destina tion. The neighboring nodes of c belong to the set Nh(c) = {(cx -1, cy), (cx +1, cy), (cx, cy - 1 ), (cx,cy + 1 )}. Consider the neighboring node n^C ) = (cx -1, cy). Since sx < dx and cx = dx, cx * 1 . The label of the neighboring node is given by 7c[nh(c )] = n(cx -2) + cy if cx is odd n(cx -l)-cy +l if cx is even. In any case, Jt[nh(c)] < Jt[d]. Therefore there always exists a packet with a neigh boring node less than its destination for the case o f a clockwise dependency cycle. An illustration of this case is shown in Figure 34 below. 58 25 Legend: 16 ► Reserved Channel(s) 15 ► Requested Channel(s) 6 5 Fig. 34: An example o f case 1. Packet P] is involved in a turn T j. It is currently at node 18. Since routing is along minimal paths, its destination cannot be < 15 or nodes 21 and 22. If its destination is > 23, it has next nodes 19 and 23 < destination node. If on the other hand the destination is on the same row, then there is no next node on the minimal path < destination. Now, however, neighboring node 13 < destination. Case 2: Counter clockwise Cycle: Assume a counter clockwise deadlocked configuration like the one shown in Fig ure 32b exists. By analyzing turn or dependency T7 along similar lines, it can be shown that a counter clockwise dependency cycle has at least one packet with a neighboring node < its destination. 20 11 P4 10 59 Case 3: Abstract Cycles: As mentioned earlier, these cycles are formed when a turn, either in the clockwise or the counter clockwise cycle, is emulated by three appropriate turns in the other cycle. Only the counter clockwise left abstract cycle (shown in Figure 35) is analyzed here. The proofs for the other three abstract cycles follows along similar lines. Fig. 35: A Counter Clockw ise Left Abstract Cycle. Assume a counter clockwise left abstract cycle exists. This implies that none of the above turns are prohibited. Consider turn or dependency T4 (T8), Again, from the defini tion o f minimal paths it can be seen that sx < dx and sy > dy. For the case of cx < d x and cy > dy, the next node belongs to the set N(c) = {(cx, cy - 1), (cx +1, cy) | . Consider n(c) = (cx, cy -1). The label of this node is given as ic[n(c)] = n(cx -l) + cy -l if cx is odd ncx - cy + 1 if cx is even. 60 The label of the destination node is given as n[d] = n(dx - 1 ) + dy if dx is odd ndx - dy + 1 if dy is even. It is easy to see that 7t[n(c)] < Tt[d]. Therefore there exists a packet for which the next node along the minimal path is < its destination. Since the next node is also a neighboring node, there is a packet with neighboring node < its destination. For the case when cx = d x and cy > dy or cx < dx and cy = dy, 7t[(n(c)] could be > 7t[d] and there could be no packet with a next node along a minimal path < destination. Con sider the neighboring nodes of c. The neighboring nodes belong to the set Nb(c) = {(cx -1, Cy), (cx +1, Cy), (cx, C y - I ), {cx, C y +1)}. Consider the neighboring node nb(C) = (cx -1, cy). Since sx < dx and cx = dx, cx * 1. The label of the neighboring node is given as 7t[nb(c)] = n(cx -2) + cy if cx is odd Since cy(max) = n 7t[nh(c)) < 7t[d] in all cases. Therefore, a counter clockwise left abstract cycle has at least one packet with a neighboring node < its destination node. An example of this case is shown in Figure 36 below. Packet P3 is involved in a turn Tg. The packet is currently at node 22. It has a neighboring node 19 < its destination. n(cx - 1 ) - cy + 1 if cx is even. P8 5 Legend: Reserved Channel(s) ► Requested Channel(s) Fig. 36: An example of Case 3. 61 The other three abstract cycles can be analyzed similarly to obtain the same result. This completes the proof of the theorem. Note that only when cx = d x or cy = dy, could the neighboring node not be on the minimal path (as it is not a next node). Concurrent Recovery Methodology for Disha with a Single Deadlock Buffer: The theorem derived in the previous section allows us to relax the mutual exclu sion requirement in Disha without adding any additional resources (e.g.. Deadlock Buff ers). Presented below is the recovery methodology. 1) Construct a Hamiltonian path using the Deadlock Buffers as shown in Figure 37. This path allows any node to reach other nodes whose label is higher than its own. 2) The Deadlock Buffer at a node can only be used by a packet at a neighboring node for which the destination node has a higher label than the node to which the buffer belongs. 3) A packet once placed on a Deadlock Buffer is restricted to using only Deadlock Buffers at subsequent nodes until it is delivered. 4) Successive Deadlock Buffers can only be used in increasing label orders. 21 *^25 16 11 15 Fig. 37: Hamiltonian path for a 5 x5 mesh. 62 It is easy to see that this path is acyclic and routing on the Deadlock Buffers is deadlock-free. The theorem presented previously shows that every deadlocked cycle has at least one packet with a neighboring node < its destination. This packet after timing-out can be routed to its destination using the Deadlock Buffers thus breaking the cycle. Since every cycle can be broken recovery can be achieved. Extensions to other topologies: It is difficult to find a selection criteria for packets on tori. W raparounds introduce cycles turns that are no longer formed only by 90 degree turns. It is simple, however, to extend this scheme to k-ary n-cube meshes. For toroids we use two Deadlock Buffers and adopt the following methodology. 1) Divide the buffers into two groups - one to be used if the current node > destination node and the other to be used if the current node < destination node. 2) Form two separate Hamiltonian paths in opposing directions to one another as shown in Figure 37. 3) Number all the nodes on the hamiltonian path in sequence. 4) Packets follow buffers strictly according to increasing (or decreasing) numbers. Fig. 37: Hamiltonian paths for a 5 by 5 mesh. 63 As an optimization short-cuts could be allowed if possible. Figure 38 below shows a Hamiltonian path. The dotted lines show short-cuts. Movement against the correct flow is never allowed though. Fig. 38: Short cuts are shown in dotted lines. The proposed implementation of Disha with 2D B’s is deadlock free. There are no cycles and there can be no reconfiguration deadlocks. This is because at a router there can be two packets on the D B’s only if they are using differing networks (higher and lower). Since each network is individually deadlock free, one of the packets has to clear and then the crossbar can be reconfigured to allow the second packet to clear. There is no restriction that says that the crossbar has to be immediately reconfigured. In it’s simplest form, however, the 2DB scheme is fault-intolerant. The assumption here is that packets use either the low-dimension network or the high-dimension network alone and there are no dependencies between the two. This is illustrated in the example below in Figure 39. Dn E > l D3 — ► s — k - Fig. 39: Network with a faulty channel. 64 If the link shown is faulty, then destination DI is unreachable. Destinations D2 and Dn (higher up the ladder) have alternate paths. It could be possible to provide improved fault-tolerance by adopting a more flexible routing scheme (though not deviating from the basic idea of 2 D B ’s). This is done by allowing inter dependencies between the two buffer classes. The basic Hamiltonian paths (without short-cuts) are chosen to be exactly opposite to one another. Numbering of buffers (as opposed to nodes) is in one continuous stretch. In the example below in Figure 40, it would be 1 to 72 with end-points shown. Packets always use buffers in increasing order of magnitude. Labels 1 to 36 would form the high- dimension network and 37 to 72 the low-dimension network. For the example buffers used would be 8 , 56, 57, 58 (to reach D l) and 8 , 56, 57, 58, 59, 62 (to reach D3). This gives increased flexibility in routing around faults. Resources (Deadlock Buffers) are always used in increasing sequence of magnitude and routing is deadlock-free. 36,37 Dn D l { D3 . 4 — t S T 1,72 Fig. 40: Increasing fault tolerance with appropriate node labelling. 65 In summary, it is possible to have concurrent recovery in Disha. This eliminates the Token required for implementing mutual exclusion in sequential recovery. It also allows recovery from cycles simultaneously. K-ary n-cube meshes require just a single Deadlock Buffer while toroidal k-ary n-cubes require two Deadlock Buffers. 66 6 Performance Evaluation This section of the paper evaluates the performance of Disha under sequential recovery using a modified version of FLITSIM 2.0 All simulations are run on a 16 by 16 two dimensional torus with four virtual channels per physical channel. Consequently, equal clock cycle time was assumed for all schemes simulated. Messages are 32 flits long. A buffer depth of two was selected (shallow buffers keep the routers simple and reduce clock cycle time). All algorithms simulated used one injection and reception channel per node. This paper is premised on the notion that deadlocks are rare. Figure 41a verifies this for two widely varying time-out thresholds, 4 and 64. Load-Rate here and in the remainder of the graphs is a fraction of full load, defined as the load at which all channels in the network are used simultaneously (maximum network capacity). The figure shows the number of packets that seized the Token normalized with respect to the number of packets that were delivered by the network. The number of deadlocks are closely related to the number of packets that seize the Token. Based on these results, it is safe to infer that deadlocks are extremely rare upto the point at which the network saturates. This confirms earlier measurements by Kim et al, [21], which showed that deadlock situations are generally rare. * FLITSIM was developed by Patrick Gaughan (currently at the University of Alabam a) et at,, at G eorgia Institute o f Technology. 67 •5 a is Disha (M=3) aos 0.2 0.3 0.4 Fig. 41a: Frequency o f deadlocks. 0 .5 Load-Rate U 500 Disha (M=3) 8 400 - c 3 0 0 y 200 0 2 0 . 5 0.6 0.0 Load-Rate Fig. 41b: Selection o f time-out. A proper choice of the time-out interval is crucial to Disha's performance. A value of Tout that produces the least number of Token captures is not necessarily the optimum value from a performance point of view. The performance of Disha for two different time outs is presented in Figure 41b. We find that small time-outs trigger false deadlock detection while large ones unduly delay deadlock detection. Time-outs o f 8 and 16 are found to be appropriate, although simulations show that the optimum threshold value depends on the traffic pattern, the message length and topology. To adapt to different network conditions and topologies, Tout could be programmable to vary dynamically. This is an area of future research. The default time-out assumed for the remainder of our simulations is 8 clock cycles. A few words regarding the routing algorithm assumed in our simulations are in order. The selection function selects a free output channel from the set supplied by the 68 routing function. When more than one permissible channel is free there are several criteria on which to base the selection of the next channel. Random (any free channel) is the easiest to implement but does not necessarily produce the best results always. Minimum congestion wherein the channel in the direction in which most virtual channels are free is selected, provides a substantial improvement over a random choice. In all our simulations, Dally (Dally & Aoki) was the only one simulated with a minimum congestion selection function. We allowed a random choice (first free lane) for all other schemes. For Turn (Turn-M odel), we used the Negative-First algorithm (which supposedly gives the best results among those derived using this model [25j). Duato is based on the conditions derived in [14], Even with the relaxed set of conditions in [15] we do not expect the resulting performance to reach Disha. The performance of both Duato and Disha can be substantially improved by using a minimum-congestion selection function as was done in Daily’s. U niform T raffic: i/) 600 ~ u > » U 500 J5 U e 4 0 0 u e « 2 300 Disha (M=3) ' Dally 200 Disha (M=0) 100 0.2 0.3 0.4 0.5 0.6 0.7 Fig. 42: Com parison for uniform traffic (latency vs. load) 69 Under uniform traffic, each node sends packets to all other nodes with equal probability. Figure 42 compares Disha with preventive schemes under uniform traffic. Disha outperforms all schemes and more than doubles the saturation load. W ith no misrouting (M=0), latency increases linearly with load. It will be interesting to evaluate the performance with multiple injection channels to see when the network becomes a bottleneck. With a misroute of up to three hops (M=3), Disha saturates at 0.65, and Duato is a distant second at 0.35. The fact that dim ension-order outperforms partially adaptive schemes is not unexpected as it preserves the traffic’s uniformity (others [2 2 ] have recorded similar results). However, the performance of dim ension-order relative to the other schemes is expected to deteriorate as the number of dimensions increase. Figure 43 evaluates the throughput in various schemes. The peak throughput for Disha (M =0) is 35% greater than Duato. M oreover the network can sustain the peak throughput. For other schemes, messages block faster than they drain and can remain blocked for long periods of time. Throughput drops off accordingly. aaa U 600 - 500 DOR 400 300 200 - Disha (M =0) 100 0.04 0.06 008 Throughput Fig. 43: Com parison for uniform traffic (latency vs. throughput) Non-Uniform Traffic: We next compare Disha with these schemes for various other (regular) traffic patterns. Although the default time-out ( 8 clock cycles) was assumed for D isha, the performance would only improve if we fine-tuned the time-out interval. Simulations show that with certain non-uniform traffic patterns, the number o f Token seizures increases, especially close to the point at which the network saturates. We assume that there exist some mechanisms that do not allow the load to go beyond the saturation point like throttling, proposed by Dally [12] and Duato [16]. The injection limitation scheme like the one proposed by Duato [16] could be used to fine-tune the performance o f Disha so that the probability of deadlocks is extremely rare. We could also use more reception channels per node. Even without such extensive modifications, Disha consistently outperforms all other schemes for different traffic patterns as can be seen by the simulation results presented in this chapter. Some simulations were also done by varying the network load. The load was raised to a point above saturation for some time and then reduced (useful to simulate locality). Even under such cases, Disha performed well, indicating that recovery from the increased number of deadlocks after saturation is quick. The first among such non-uniform traffic patterns is bit-reversal and is compared in Figure 44. Here, a node with coordinates an_|, an_ 2 ,..., a |, ag sends messages to the node with co-ordinates ao, a , an.2, a ^ j. With this traffic, nodes along a given row send messages to nodes along a given column. This traffic pattern is ideally suited for fully adaptive algorithms. Disha, as expected, significantly outperforms all other algorithms. Disha (M =0) saturates at around 0.7 and with M=3 at around 0.45. Even after saturation, latency increases at a much slower pace. 71 « 8 0 0 M (j CJ 700 ■ 8 t j 600 ^ 5 0 0 400 3 0 0 ■ 200 ■ 100 0 ~~1 i t I ,'iyn r i i Q / t DCtR i Dilato D i s h a (M=0) 0 1 0 2 0 . 3 0 . 4 0 5 0 6 0 . 7 0 0 Load-Rate Fig. 44: Com parison for bit-reversal (latency vs. load). « to o o JJ u 9 0 0 8 b o o C . = 7 0 0 g 6 0 0 3 5 0 0 D^lly 400 DOR 300 200 Disha (M =0) 100 0.06 0.12 Throughput Fig. 45: Com parison for hit-reversa! (latency vs. throughput) The peak throughput for Disha (Figure 45) is 50% greater than Duato and can be sustained. 72 For flip-bit traffic pattern, a node with coordinates an.|, a , ^ ..... a t , ag sends messages to node (-an. (), (-an_ 2 ).....(~a(), (-ag), where (-aj) = 1 - a;. Here, nodes along the same row send packets to nodes along different columns within a different row. The plot of latency versus load for this traffic pattern is shown in Figure 46. It appears that for message patterns of this nature dimension-order is best suited. DOR saturates at around 0.3 and Disha (M =3) at around 0.2. Dally and Duato saturate at around 0.15. Turn performs very poorly and saturates at around 0.05. 1000 JJ 900 -X jj eo o u = 700 £ 600 3 500 Turn 300 DOR zo o 10 0 : 0.4 0.5 Load-Rate 0 .3 Fig. 46: Com parison for flip - bit (latency vs. load) Peak throughput in Disha is measured to be about 15% greater than DOR but is not sustained. The plot of latency versus throughput is shown in Figure 47. 73 ^ 1000 _ 4 > u 300 -* g 0 0 0 G .5 700 >. y g B O O 5 5 0 0 40(3 300 2 0 0 100 0.01 0.03 0.04 0.05 0.02 0.06 0.07 008 0.09 T h ro u g h p u t Fig. 47: C o m p a riso n fo r flip - bit (la te n c y vs. th ro u g h p u t) For the matrix transpose traffic pattern, a node with coordinates (x,y) sends to node (y,x). The 2D graph can be visualized as a number of concentric squares on which the source and destination pairs lie. The comparisons among various schemes with varying load rate is shown in Figure 48. Disha with no misrouting offers the best results, saturating at around 0.7. Duato and Dally saturate at about 0.3 which is half the load at which Disha saturates. Turn and DOR fair very poorly with saturation loads less that 0.15. Here too like for the traffic patterns compared up to now, misrouting was found to be detrimental in Disha. Figure 49 compares the schemes for latency versus throughput. The peak throughput for Disha is 50% greater than Duato, but this peak value is not sustained. 74 i n Clock 600 Djlato ■* 500 D9 R Dally 300 urn ! sha (M =3) Disha (M=0) 100 0 2 0.5 0 1 0 3 0 7 Load-Rate 0.6 0.8 Fig. 48: Com parison Tor matrix transpose (latency vs. load) 500 o Ddalo e 400 Turn 200 ' Disha (M =0) 100 - 0.06 0 08 0.1 Throughput 0.12 Fig. 49: Com parison for matrix transpose (latency vs. throughput). 75 For the perfect shuffle traffic pattern, a node with coordinates an_|, aj, ao sends messages to the node with coordinates a ^ , an-3 .--, a i> ^ ^ -i- F °r such patterns, it is expected that the different schemes should merge with respect to the saturation point. Figure 50 has the comparison for varying load. Disha saturates at 0.25. Other schemes, with the exception of Turn which follows Disha asymptotically, saturate at around 0.15. DOR fairs poorly, saturating at 0.1. For this traffic pattern only, Turn provides results comparable to the best; otherwise, it is outperformed significantly for all other traffic patterns. y 600 500 400 3 0 0 D O k 200 Disha (M =3) 100 0 , 0 5 0.3 0,35 0 15 0,2 0.45 Fig. 50: Com parison for perfect shuffle (latency vs. load) Latency versus throughput for the different schemes is shown in Figure 51. Peak throughput in Disha is about 20% greater than that in Duato and is not sustained. 76 700 6 0 0 .5 500 Turn 3 0 0 Dal^y 'is ha (M=3) D O R D isha (M =()) 100 D 0 1 0 0 2 0 0 3 0 . 0 4 0 . 0 5 0 . 0 6 0 0 7 0 . 0 8 0 . 0 9 0 Throughput Fig. 51: Com parison for perfect shuffle (latency vs. throughput) Finally, a comparison is made for hot spot traffic conditions in Figures 52 and 53. In these simulations, we assumed that upto 5% of the network traffic is hot spot in nature, and the number of locations at which these patterns exist is one. Hot spot traffic causes early saturation for all schemes (Figure 52). Nevertheless, Disha provides the best results. The performance of Disha with M=3 is only slightly better than that o f Duato. W ith M=0, Disha is outperformed by all other schemes. The result that misrouting is beneficial for hot spots is interesting. With hot spots and no misrouting, it is observed that the deadlock count increases exponentially. M isrouting therefore aids in routing around deadlock congestion for our recovery scheme. 77 8 u 700 C " 600 8 U 500 Duato , - D O R / 7 Dally - %Tn * - ^ 1 » . ' V ' , 'V 300 X T 200 too 0 05 Q15 02 0 3 Load-Rate Fig. 52: Comparison for hot spot (latency vs. load) 500 Dally / Disha (M=()) 300 Duato 200 Disha (M=3) too 0.01 0.02 0.03 0.04 0 05 0.06 0.07 Throughput Fig. 53: Comparison for hot spot (latency vs. throughput) Peak throughput in Duato is 25% greater than Disha with (M=3) but comes at the expense o f very high latencies (Figure 53). Peak throughput in any scheme is not sustained. 78 Summary: Disha provides excellent performance. On average, it doubles the saturation load and improves throughput by 50% over the best deadlock avoidance scheme. Even after saturation, latency increases at a much slower rate when compared to the other schemes. The performance increase is largely due to the fact that in Disha, more virtual channels are available to packets at each node than allowed in other schemes. There is no classification of virtual channels nor is there any ordering among them. The Deadlock Buffer is the escape path and is not a bottleneck due to deadlocks being rare. These results are obtained without any fine-tuning of Disha to improve results. The frequency of deadlocks can be reduced by increasing the number o f reception channels at nodes to quickly drain packets. Injection limitation schemes such as throttling that maintain load below the saturation point could be used to reduce deadlocks even further. Simulation results demonstrate that with misrouting, the performance of Disha deteriorates except when there are hot spots in the network. Performance degradation with misrouting could be attributed to increased network resources being held by blocked packets in wormhole switching. This increases the probability that packets deadlock and takes up valuable network bandwidth. Because recovery from deadlocks is sequential in our simulations, significant drop in performance results. We therefore conclude that misrouting is not beneficial and should be restricted to by-passing faults or hot spot congestion. Intelligent routers could adjust routing flexibility (degree of adaptivity, misrouting, etc.) based on local conditions or perhaps user-defined traffic hints. This remains as future work. 79 The performance of Dally and Duato are very similar. Duato eliminates cycles from the extended channel dependency graph whereas Dally eliminates cycles in the packet wait-for graph. Duato allows packets on escape channels to come back onto the adaptive channels whereas Dally does not, giving Duato more flexibility in how escape paths are used. On the other hand, in Dally blocked packets resort to the heavily congested escape channels only when absolutely necessary, giving it more flexibility in when escape paths are used. Also, Dally was simulated with minimum congestion. Perhaps these factors play a role in the differences and/or similarities of their performance results. 80 7 Conclusions and Future Work The performance of interconnection networks can be improved by pipelining packet transmission and using adaptive routing. Wormhole switching, which is an efficient pipelined switching technique is unfortunately prone to deadlock because each packet usually holds several channels simultaneously. Until now, most proposals have focussed on deadlock avoidance. As deadlocks are relatively rare, deadlock recovery techniques can be used as a viable alternative to deadlock avoidance. The main advantage of recovery techniques is that no resources are “wasted” on avoiding situations that generally occur infrequently at best. Previously proposed recovery techniques are based on abort and retry, thus increas ing latency considerably. Disha is a unique recovery technique in that it does not kill pack ets on deadlocks. Instead, it supplies alternative paths to deliver deadlocked packets. Unlike deadlock avoidance techniques, these paths are not dedicated ones. This results in increased sharing of network resources for enhanced efficiency and improved perfor mance as confirmed in simulations. Disha provides the means for safely incorporating fully adaptive wormhole routing to support efficient and fault-tolerant communication in a parallel processing environment. The scheme is universal in that it can be applied to any arbitrary network topology. It em ploys deadlock recovery as opposed to prevention with the objective of making the common case faster. Consequently, it does not require virtual channels for deadlock 81 freedom. If, on the other hand, virtual channels are implemented to increase flow-control, they are used efficiently for both adaptive routing and deadlock avoidance. Hence, hardware devoted to recovery is minimal, enabling the routers to be fast. This novel idea proves to be very effective as confirmed by exhaustive simulation. Simulations show that the number of potential deadlock situations is very small upto the point at which the network saturates. Simulations also show that our deadlock recovery scheme, which uses all of its virtual channels for fully adaptive routing, performs significantly better than deadlock preventive schemes, which must devote some virtual channels for deterministic routing. We also find that if enough virtual channels are provided to match the traffic flow demands of the network, then Disha provides excellent performance. This is due to the fact that all virtual channels are used for adaptive routing, not deadlock prevention, and the routing complexity is minimal, resulting in a fast router clock. Disha also performs well under bursty and hot spot traffic loads and provides maximum fault tolerance capability. There are many directions that can be explored in extending this work. With two central buffers, it is possible to further extend this technique to be applicable to toroidal n- cubes. It might also be possible to accomplish the same with no additional buffer resources. Another interesting area to explore in the future is to increase the fault-toier- ance capabilities of Disha so that, in the presence of faults, there is at least one alternate path on the deadlock free lane from any node to any other node. (Note: this is already true for non-deadlock-free lanes and deadlock-free lanes for which the current node and desti nation node are not along the same line). 82 It would also be interesting to see by how much performance is enhanced with concurrent recovery [3], which allows recovery from a number of simultaneous deadlock occurrences. Dynamically varying the degree of adaptivity, time-out and misroute count depending on the source and destination addresses and local traffic conditions could also be an interesting topic. Reducing the probability of deadlocks is another related area of interest. It has been observed that the number of deadlocks depends on topology. Also, some routing algorithms are more sensitive to deadlocks than others. Varying the adaptivity depending on the source and destination addresses could also prove to be very beneficial. The source router, based on the destination address (type of traffic is not important), could appropriately set the misroute count. A full analysis of these issues will be explored in future work. Finally, a complete study on the characterization of deadlocks is necessary to extract maximum performance from a recovery scheme, such as Disha. As has been shown in this paper, the performance of Disha depends on the selection of a suitable time-out interval. The optimum value of time-out will not be the same for different message lengths, buffer depths, traffic patterns, topology, etc. Such a study will help in fine-tuning the system to obtain peak performance. 83 References [ 1 ] J. D. Allen et al. Ariadne*— An Adaptive Router for Fault-tolerant M ulticomputers. In Proceedings o f the 21st Annual International Symposium on Computer Architecture, IEEE Computer Society, pages 278-288, April, 1994. [2] Anjan K. V. and Timothy Mark Pinkston. DISHA: A Deadlock Recovery Scheme for Fully Adaptive Routing. In Proceedings o f the 9th International Parallel Processing Symposium, IEEE Computer Society, April 1995. [3] Anjan K. V., Timothy Mark Pinkston, and Jose Duato. Concurrent Deadlock Recovery in Disha. Submitted to the International Conference on Computer Design, October 1995. [4] Anjan K. V. and Timothy Mark Pinkston. An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA. In Proceedings o f the 22nd Annual International Symposium on Computer Architecture, IEEE Computer Society, June 1995. [5] Andrew A. Chien and J.H. Kim. Planar-Adaptive Routing. Low-Cost Adaptive Networks for Multiprocessors. In Proceedings o f the 19th International Symposium on Computer Architecture, IEEE Computer Society, pages 268-277, May 1992. [6 ] Andrew A. Chien. A Cost and Speed Model for k-ary n-cube W ormhole Routers. In Proceedings o f symposium on Hot Interconnects, IEEE Computer Society, August 1993. [7] K. Aoyama. Design Issues in Implementing an Adaptive Router. M aster's Thesis, University o f Illinois, Department of Computer Science, 1304 W. Springfield Avenue, Urbana, Illinois., January 1993. [8 ] P. Berman, L. Gravano, G. Pifarre, and J. Sanz. Adaptive deadlock and livelock free routing with all minimal paths in torus networks. In Proceedings o f the Symposium on Parallel Algorithms and Architectures, 1992. [9] R. V. Boppana and S. Chalasani. A Comparison of Adaptive W ormhole Routing Algorithms. In Proceedings o f 20th International Symposium on Computer Architecture, IEEE Computer Society, pages 351-360, May 1993. 84 [10] W. Dally and C. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Transactions on Computers, 36(5): 547-553, May 1987. [11] W. Dally. Virtual Channel Flow control. IEEE Transactions on Parallel and Distributed Systems, 3(2): 194-205, March 1992. [12] W. Dally and H. Aoki. Deadlock-free Adaptive Routing in M ulticomputer Networks using Virtual Channels. IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 4, pages 466-475, April 1993. [13] J. Duato. On the design of deadlock-free adaptive routing algorithms for multicomputers: design methodologies. In Proceedings o f Parallel Architectures and Languages Europe, pages 390-405, June, 1991. [ 14] J. Duato. A New Theory of Deadlock-Free Adaptive Routing in W ormhole Networks, IEEE Transactions on Parallel and Distributed Systems, 4( 12): 1320-1331, December 1993. [15] J. Duato. A Necessary and Sufficient Condition for Deadlock-Free Adaptive Routing in W ormhole Networks. In Proceedings o f the International Conference on Parallel Processing, CRC Press, pages 1142-1149, August 1994. [ 16] J. Duato. Deadlock-Free Adaptive Routing Algorithms for the 3D-Torus: Limitations and Solutions. In Proceedings o f Parallel Architectures and Languages Europe 93, June 1993. [17] J. Duato. Improving the Efficiency of Virtual Channels with Time-Dependent Selection Functions. In Proceedings o f Parallel Architectures and Languages Europe 92, June 1992. [18] Patrick T. Gaughan and Sudhakar Yalamanchili. Adaptive Routing Protocols for Hypercube Interconnection Networks. IEEE Computer, pages 12-22, May 1993. [ 19] K. D. Gunther. Prevention o f Deadlocks in Packet-Switched data Transport Systems. IEEE Transactions on Communications, Vol. Com-29, No. 4, pages 512-524, April 1981. IEEE Com puter Society, pages 289-300, April, 1994. [20] Parviz Kermani and Leonard Kleinrock. Virtual cut-through: A new computer communication switching technique. Computer Networks 3 - North Holland Publishing Company, pages 267-286, 1979. 85 [21] J. Kim, Z. Liu, and A. Chien. Com pressionless Routing: A Fram ework tor Adaptive and Fault-tolerant Routing. In Proceedings o f the 21st International Sym posium on Com puter Architecture, IE E E C om puter Society, pages 289-300, April 1994. [22] Z. Liu and A. Chien. Hierarchial Adaptive Routing: A Fram ework for Fully Adaptive and Deadlock-Free W orm hole Routing. To appear in the Sym posium on Parallel and D istributed Processing, 1994. [23] S. Konstantinidou and L. Snyder. Chaos Router: architecture and perform ance. In Proceedings o f the 18th International Symposium on Computer Architecture, pages 212- 221, May 1991. [24] D. Linder and J. Harden. An Adaptive and Fault Tolerant W orm hole Routing Strategy for k-ary n-cubes. IE E E Transactions on Computers, 40( 1 ):2 -12, January 1991. [25] L. Ni and C. Glass. The Turn M odel for Adaptive Routing. In Proceedings o f the 19th International Sym posium on Com puter Architecture, IEEE Com puter Society, pages 278- 287, M ay 1992. [26] L. Ni and P.K. M cKinley. A Survey o f W orm hole Routing Techniques in Direct Networks. IEEE Computer, pages 62-76, February 1993. [27] Douglas S. Reeves, Edward F. Gehringer, and Anil Chandiram ani. Adaptive Routing and D eadlock Recovery: A Sim ulation Study. In Proceedings o f the 4th Conference on H ypercube Concurrent Computers and Applications, M arch 1989. 86 INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. Hie quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorised copyright material had to be removed, a note w iD indicate the deletion. Oversize materials (e.g^ maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. A Ben & Howell information Company 300 North ZeeO Road. Ann Arbor M i 46106*1346 USA 313/761-4700 800:52nD600 UMI Number: 1376540 □HI Hicrofora 1376540 Copyright 1995, by UMI Coapany. All rights reserved. This aicrofora edition is protected against unauthorized copying under Title 17, Onited States Code. UMI 300 NOrth Zeeb Road Ann Arbor, HI 48103
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Parallel STAP benchmarks and their performance on the IBM SP2
PDF
Thermally-driven angular rate sensors in standard CMOS
PDF
Fault tolerant characteristics of artificial neural network electronic hardware
PDF
Cross-correlation methods for quantification of nonlinear input-output transformations of enural systems using a Poisson random test input
PDF
Flourine-19 NMR probe design for noninvasive tumoral pharmacokinetics
PDF
A kinetic model of AMPA and NMDA receptors
PDF
A computational model of NMDA receptor dependent and independent long-term potentiation in hippocampal pyramidal neurons
PDF
Estimation of upper airway dynamics using neck inductive plethysmography
PDF
A physiologic model of granulopoiesis
PDF
Irradiation effects on the rheological behavior of composite polymer systems
PDF
Adaptive execution: improving performance through the runtime adaptation of performance parameters
PDF
A statical analysis and structural performance of commercial buildings in the Northridge earthquake
PDF
Design and synthesis of novel second order nonlinear optical materials
PDF
Comparison of evacuation and compression for cough assist
PDF
Complementarity problems over matrix cones in systems and control theory
PDF
Statistical analysis of the damage to residential buildings in the Northridge earthquake
PDF
The relationship of stress to strain in the damage regime for a brittle solid under compression
PDF
Superinsulation applied to manufactured housing in hot, arid climates
PDF
Effects of memory consistency models on multithreaded multiprocessor performance
PDF
Auditory brainstem responses (ABR): quality estimation of auditory brainstem responsses by means of various techniques
Asset Metadata
Creator
Venkatramani, Anjan Kumar
(author)
Core Title
Disha: a true fully adaptive routing scheme
School
School of Engineering
Degree
Master of Science
Degree Program
Computer Engineering
Degree Conferral Date
1995-05
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Pinkston, Timothy M. (
committee chair
), Hwang, Kai (
committee member
), Saavedra, Rafael H. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c18-6721
Unique identifier
UC11356847
Identifier
1376540.pdf (filename),usctheses-c18-6721 (legacy record id)
Legacy Identifier
1376540-0.pdf
Dmrecord
6721
Document Type
Thesis
Rights
Venkatramani, Anjan Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science
engineering, electronics and electrical