Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
(USC Thesis Other)
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DYNAMIC PACKET FRAGMENTATION FOR INCREASED VIRTUAL CHANNEL
UTILIZATION AND FAULT TOLERANCE IN ON-CHIP ROUTERS
by
Young Hoon Kang
______________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2011
Copyright 2011 Young Hoon Kang
ii
Acknowledgements
This dissertation could not have been written without Dr. Jeff Draper who not only
served as my advisor but also encouraged and challenged me throughout my academic
program. The other dissertation committee members, Dr. Aiichiro Nakano and Dr.
Timothy Pinkston, also guided me and gave me precious feedbacks through the
dissertation process. Additionally, Dr. Sandeep K. Gupta and Dr. Alice C. Parker served
on my guidance committee and I appreciate their invaluable advice.
I would like to acknowledge and show my appreciation to the past/present group
members at ISI who gave me valuable comments on my qualifier and defense dry runs:
Bilal Zafar, Riaz Naseer, Rashed Z. Bhatti, Sumit Mediratta, Mahta Haghi, Spundun
Bhatt, Lihang Zhao, Aditya Deshpande, Fatemeh Kashfi, Gopi Neela, Jeff Sondeen, and
Tim Barrett.
I would like to thank to all my USC/ISI friends who supported and encouraged me
through the duration of my studies: TJ Kwon, Woojin Choi, Yongjin Cho, Junsoo Kim,
Kyujin Oh, Hyunmin Cho, Jaehan Koh, Jaejun Lee, Wooyung Lee, Taewoo Kim,
Sooyong Park, Seokchul Kwon, Chris Lee, and many others for being a good friend.
Most especially to my family: my mother, father and my younger brother who have
always been a tremendous source of encouragement, I express my sincere gratitude. Their
patient love enabled me to complete this dissertation. I dedicate this dissertation to them.
iii
Table of Contents
Acknowledgements ............................................................................................................. ii
List of Tables ..................................................................................................................... vi
List of Figures ................................................................................................................... vii
Abstract ........................................................................................................................ ix
Chapter 1 Introduction ..................................................................................................... 1
1.1 Overview 1
1.2 Introduction 4
1.3 Research contribution and impact 7
1.4 Organization 8
Chapter 2 Related Work ................................................................................................... 9
2.1 Introduction 9
2.2 Flow control basics 9
2.2.1 Wormhole router 9
2.2.2 Virtual-channel router 10
2.3 Recently proposed schemes 10
2.3.1 Schemes for enhanced resource utilization 10
2.3.2 Schemes for hardware-based multicast routing 13
2.3.3 Schemes for soft-error handling in NoC routers 17
Chapter 3 Dynamic Packet Fragmentation for Increased VC utilization in
On-Chip Routers ............................................................................................ 20
3.1 Overview 20
3.2 Introduction 20
3.3 Related work 23
3.3.1 Flit-reservation flow control 23
3.3.2 Flow aware allocation 23
3.3.3 Unified buffer structure 24
3.3.4 Adaptive routing 24
3.4 Baseline router 25
3.4.1 Pipeline stages 26
3.5 Dynamic packet fragmentation 28
iv
3.5.1 Fragmentation at credit stall 31
3.5.2 Fragmentation at buffer empty stall 33
3.5.3 Proposed router 36
3.6 Analytical model 38
3.7 Simulation results 41
3.7.1 Synthetic traffic 42
3.7.2 Performance 43
3.7.3 Place and route 51
3.8 Conclusions 55
Chapter 4 Multicast Routing with Dynamic Packet Fragmentation .............................. 57
4.1 Overview 57
4.2 Introduction 57
4.3 Multicast routing schemes 59
4.4 Implementing tree-based multicast routing for write invalidation messages
in networks-on-chip 61
4.4.1 Invalidation schemes 64
4.4.2 Proposed router architecture 65
4.4.3 Evaluation 70
4.4.4 Conclusions 74
4.5 Router architecture and operation 74
4.5.1 Deadlock avoidance with dynamic packet fragmentation 76
4.6 Evaluation 80
4.7 Conclusions 83
Chapter 5 Fault-Tolerant Flow Control in On-Chip Networks ...................................... 84
5.1 Overview 84
5.2 Introduction 85
5.3 Soft error handling 87
5.3.1 Proposed router architecture 89
5.3.2 Fault-tolerant flow control 91
5.4 Evaluation 97
5.4.1 Performance 98
5.4.2 Link error analysis 100
5.4.3 Intra-router error analysis 104
5.4.4 Place & route 107
5.5 Related work 111
5.6 Conclusions 112
Chapter 6 Conclusions ................................................................................................. 114
v
References ..................................................................................................................... 118
vi
List of Tables
Table 3-1 Design evaluation parameters ........................................................................... 42
Table 3-2 Router synthesis results .................................................................................... 52
Table 4-1 Design evaluation parameters ........................................................................... 70
Table 4-2 Area and timing analysis .................................................................................. 71
Table 4-3 Latency and energy for synthetic workload ..................................................... 72
Table 4-4 Design evaluation parameters ........................................................................... 81
Table 5-1 Design evaluation parameters ........................................................................... 98
Table 5-2 Net percentages of datapath components ....................................................... 105
Table 5-3 Toggle coverages ............................................................................................ 105
vii
List of Figures
Figure 2-1 Multicast routing schemes............................................................................... 15
Figure 3-1 Pipeline stages of the baseline router .............................................................. 26
Figure 3-2 Bypassing in Flip-Flop based buffers ............................................................. 28
Figure 3-3 Example of flit stalls ....................................................................................... 29
Figure 3-4 Packet fragmentation at credit stall ................................................................. 32
Figure 3-5 Packet fragmentation at buffer empty stall ..................................................... 34
Figure 3-6 After the fragmentation of Figure 3-3(b) ........................................................ 35
Figure 3-7 Pipeline stages of the proposed router ............................................................ 38
Figure 3-8 Average latency ............................................................................................... 44
Figure 3-9 Fragmentation rate .......................................................................................... 46
Figure 3-10 VC utilization ................................................................................................ 47
Figure 3-11 Performance results of hotspot traffic ........................................................... 51
Figure 3-12 Fragmentation router layout .......................................................................... 52
Figure 3-13 Average power consumption......................................................................... 54
Figure 3-14 Energy consumption...................................................................................... 56
Figure 4-1 Three invalidation routing schemes: (a) unicast (base), (b) dual-path,
(c) tree-based (Figure format is brought from [43]) ................................................... 65
Figure 4-2 Head flit encoding formats with 128-bit flit size. A type field indicates
the flit as head, body, tail, and head-tail .................................................................... 66
viii
Figure 4-3 Proposed router pipeline (BW/RC: buffer write & routing
computation, SA/VA: switch allocation & virtual channel allocation, ST:
switch traverse, LT: link traverse) ............................................................................. 67
Figure 4-4 Describes how the multidestination packet is copied and forwarded to
different directions, considering a multicast message from router 5 to router 3,
4, 10 and 15 in 4x4 mesh network ............................................................................. 69
Figure 4-5 Implemented router layout .............................................................................. 73
Figure 4-6 Input unit ......................................................................................................... 75
Figure 4-7 Deadlock ......................................................................................................... 77
Figure 4-8 Deadlock avoidance in asynchronous replication ........................................... 79
Figure 4-9 Deadlock avoidance in synchronous replication ............................................. 80
Figure 4-10 (a) Performance (b) Multicast router layout (c) Relative energy
consumption ............................................................................................................... 83
Figure 5-1 Proposed router microarchitecture and pipeline stages (preRC: pre
routing computation, SAVA: switch & VC allocation, ST: switch traversal) ........... 90
Figure 5-2 Flow control schemes ...................................................................................... 93
Figure 5-3 Performance .................................................................................................... 99
Figure 5-4 Performance in link errors ............................................................................. 101
Figure 5-5 Fragmented packets ....................................................................................... 103
Figure 5-6 Error coverage in datapath components ........................................................ 107
Figure 5-7 Router layouts ............................................................................................... 108
Figure 5-8 Power consumption ....................................................................................... 110
Figure 5-9 Energy per packet .......................................................................................... 111
ix
Abstract
Networks-on-Chip (NoCs) have been suggested as a scalable communication solution for
many-core architectures. As the number of System-on-Chip (SoC) cores increases, power
and latency limitations make conventional buses increasingly unsuitable. Buses are
appropriate for small-scale designs but cannot support scaled performance as the number
of on-chip cores increases. In contrast, NoCs offer fundamental benefits of high
bandwidth, low latency, low power and scalability.
NoCs have evolved providing high performance routers with good resource sharing,
multicast routing, and fault tolerance through various techniques. Although many prior
research efforts have suggested viable techniques for tackling challenges in NoC design,
none have proposed a simple underlying technique that addresses resource sharing,
multicast routing, and fault tolerance. This Ph.D. dissertation proposes dynamic packet
fragmentation, a technique, that covers multiple NoC research domains and serves as an
enabler for viable solutions for challenging issues in on-chip interconnection networks
with minimum hardware overheads. Dynamic packet fragmentation addresses a broad
range of subjects from performance to fault handling. A proposed router using this
technique is shown to increase virtual channel (VC) utilization for performance
improvement, provide deadlock avoidance in tree-based multicast routing, and support
fault-tolerant flow control for fault handling.
x
Using this technique, a packet is fragmented when certain blocking scenarios are
encountered, and the VC allocated to the blocked packet is then released for use by other
packets. The resulting efficient VC utilization provides more flexible flow control,
preventing a blocked VC from propagating congestion to adjacent routers. In tree-based
multicast routing, fragmentation enables deadlock-free tree-based multicast routing since
it resolves cyclical dependencies in resource allocation through packet fragmentation.
Fragmentation frees resources that may be required by blocked branches of other
multicast packets. In fault-tolerant flow control, packet fragmentation helps to recover
faulty flits through a link-level retransmission. The proposed fault-tolerant scheme
ensures an error-free transmission on a flit-basis, while using dynamic packet
fragmentation at error detection. Fragmentation renews the state information in control
planes through a VC reallocation, preventing corrupted states from affecting the rest of
the flits. Thus, the proposed router disengages flits from the faulty flit and safeguards the
flits following a faulty flit.
The implemented fragmentation router is evaluated through various simulation
experiments with synthetic workloads. Performance benefits are demonstrated compared
to a baseline router, and accurate power and area measurements are analyzed from a
placed & routed layout. The result demonstrates that the fragmentation router shows
performance improvement in terms of latency and throughput up to 30% and 75%,
efficiently utilizing VCs and saves energy as well. In error sensitive environments, the
fragmentation router provides a remarkable level of reliability and is observed to perform
well, gracefully degrading while exhibiting 97% error coverage in datapath elements.
xi
Thus, the result of packet latency reduction and increased throughput justifies the
fragmentation router as a suitable choice for future NoC design.
1
Chapter 1 Introduction
1.1 Overview
Modern microprocessor designs have continued to increase performance, overcoming
limitations of uniprocessor designs. Uniprocessor design was popular for decades,
exploiting frequency scaling, instruction level parallelism, out-of-order issue, and
aggressive branch prediction [58] to accomplish incredible performance improvements
with each technology generation. However, the uniprocessor design field has recently
experienced major challenges in terms of power dissipation, limited amount of
parallelism, and design complexity. Chip multiprocessors have emerged as solutions to
these challenges, utilizing more parallelism as the number of processing cores is
increasing.
Transistor feature size, which shrinks with technology trends, enables integration of
multiprocessors on a single chip and a corresponding communication fabric. Bus based
communication structures have been widely adopted in System-on-Chip (SoC) and
multiprocessor environments, but the bus based structures quickly become a
communication bottleneck. as future CMPs are projected to contain hundreds of
processor cores in a single chip [11], [12]. Arbitration of multiple agents is not trivial as
more units are attached to the bus and buses are unable to deliver the required bandwidth,
thereby increasing communication latency.
2
Networks-on-Chip (NoCs) which have been suggested as a scalable communication
solution provide connectivity to all processor cores and deliver communication messages
with less latency in many core architectures. High demand on-chip bandwidth is
accomplished with NoCs as messages can be routed with different paths. Thus, improved
latency and capacity, which are two common measures used for network comparison, are
promised with NoCs as the future CMP interconnect fabric.
NoCs have evolved providing high performance routers with good resource sharing,
multicast routing, and fault tolerance through various techniques. Although many prior
research efforts have suggested viable techniques for tackling challenges in NoC design,
none have proposed a simple underlying technique that addresses resource sharing,
multicast routing, and fault tolerance. This Ph.D. dissertation proposes dynamic packet
fragmentation, a technique, that handles a broad range of subjects from performance to
fault handling. A proposed router using this technique is shown to increase VC utilization
for performance improvement, provide deadlock avoidance in tree-based multicast
routing, and support fault-tolerant flow control for fault handling.
In conventional packet-switched on-chip routers, a virtual channel (VC) is allocated on a
per-packet basis and held until the entire packet exits the VC buffer. This sometimes
leads to inefficient use of VCs at high network loads. A blocked packet can affect
adjacent routers, resulting in a congestion propagation effect. Dynamic packet
fragmentation releases the held VC buffers by fragmenting packets and allows other
packets to use the freed VC buffers. Thus, fragmentation increases VC utilization which
3
improves load balancing and provides more flexible flow control. Simulation
experiments show performance improvement in terms of latency and throughput up to 30%
and 75%, respectively.
Adaptive routing schemes [96], [40] are also considered as solutions which provide
increased resource utilization. By re-routing around congestion, adaptive routing can
balance the load and usually provides better performance than oblivious routing.
However, adaptive routing alone does not solve the fundamental problem of inefficiently
used VCs. Rather, it moves the problem to a new set of links [10]. Since adaptive routing
is bounded to the path diversity, VC utilization is independent of the dynamic routing
characteristic of adaptive routing. So, dynamic fragmentation can be applied to adaptive
routing schemes, and the combined techniques can provide synergetic performance
enhancements through efficient resource utilization of both VCs and links.
In a multicast routing environment, deadlock free multicast routing is achieved by
fragmenting packets in particular blocking situations. Normally, deadlock avoidance in
multicast routing is accomplished through the provision of extra VCs, but from the strict
constraints of area and power, an increased number of VCs is not necessarily an ideal
solution. The proposed dynamic packet fragmentation offers deadlock avoidance without
additional VCs and increases VC utilization. From circuit simulation with synthetic
workloads, the results show that a fragmentation multicast router is superior to a baseline
unicast design, reducing latency by 38.6% and energy consumption by 9% and also
providing 30% more throughput.
4
Moreover, dynamic packet fragmentation is exploited for a fault-tolerant flow control in
soft error handling for on-chip networks. The fault-tolerant flow control recovers errors at
a link-level by requesting retransmission and ensures an error-free transmission on a flit-
basis with the incorporation of dynamic packet fragmentation. The fault-tolerant flow
control disengages flits from the fault-containment and recovers the faulty flit
transmission. Thus, the proposed router provides a high level of dependability at the link-
level for both datapath and control planes. In simulation with injected faults, the proposed
router is observed to perform well, gracefully degrading while exhibiting 97% error
coverage in datapath elements. The proposed router has been implemented using a TSMC
45nm standard cell library. As compared to a router which employs triple modular
redundancy (TMR) in datapath elements, the proposed router occupies 58% less area and
consumes 40% less energy per packet on average.
1.2 Introduction
Networks-on-Chip (NoCs) have been suggested as a scalable communication solution for
many-core architectures. As the number of System-on-Chip (SoC) cores increases, power
and latency limitations make conventional buses increasingly unsuitable. Buses are
appropriate for small-scale designs but cannot support scaled performance as the number
of on-chip cores increases. In contrast, NoCs offer fundamental benefits of high
bandwidth, low latency, low power and scalability.
5
Prior literature [86], [77], [78], [63] proposed highly performance-driven router
architectures with dynamic channel sharing. The packet-switched virtual channel (VC)
router exploits multiple VCs, achieving high throughput with dynamic allocation of
resources. A high utilization of the physical link is accomplished through multiple VCs
by allowing blocked packets to be bypassed by other packets, but little effort has
addressed VC utilization. Especially given the limited number of VCs in on-chip routers,
better VC utilization techniques should be considered.
As adopted in most VC routers, a VC is allocated on a per-packet basis and de-allocated
at the time that the entire packet exits the VC buffer. This characteristic sometimes leads
to inefficient use of VCs at high traffic loads. With a trend of shallow flit buffers per VC,
a blocked packet easily propagates congestion to neighboring routers and can cause
congestion to spread to the overall network. This exacerbates back-pressure for traffic
flow, while preventing other flows from utilizing idle VCs on different routes. Several
studies [93], [17], [72], [90] characterized buffer utilization with various traffic models
and proposed power optimization by placing idle buffers in sleep mode. Buffer and VC
usage have been mostly optimized in terms of power savings in prior work. This
dissertation, however, proposes a technique to efficiently utilize VCs to increase
throughput and reduce latency.
We propose dynamic packet fragmentation for efficient VC utilization. By fragmenting a
packet and reallocating the VC, dynamic fragmentation avoids inefficient VC holding
and prevents VC blocking from propagating congestion to adjacent routers. Once a
6
packet is fragmented, the blocking is localized by releasing the hold of empty VC buffers,
thereby allowing other packets to use the freed VC. Thus, congested upstream routers do
not force inputs of downstream routers to throttle to a reduced rate. The fragmentation
router, rather, provides more flexible flow control and VC utilization at high network
loads.
In addition to the flexible flow control for enhanced VC utilization, deadlock free
multicast routing is achieved by fragmenting packets in particular blocking situations.
Normally, deadlock avoidance in multicast routing is accomplished through the provision
of extra VCs, but from the strict constraints of area and power, an increased number of
VCs is not necessarily an ideal solution [15], [12]. Disha [102] supports deadlock
recovery and efficiently breaks deadlock cycles through a deadlock free lane in fully
adaptive routing. The lane is implemented with overhead of a central deadlock buffer in
routers which is accessible to all neighboring routers along the path. The proposed
dynamic packet fragmentation offers deadlock avoidance without additional VCs and
increases VC utilization.
This dissertation also addresses a fault-tolerant flow control scheme for soft error
handling in on-chip interconnection networks. Dynamic packet fragmentation is adopted
as a part of fault-tolerant flow control to disengage flits from the fault-containment and
recover the faulty flit transmission at error detection. The proposed router protects
packets at a link-level for both datapath and control planes with little hardware overhead
and the proposed fault-tolerant scheme ensures an error-free transmission on a flit-basis.
7
The implemented fragmentation router is evaluated through various simulation
experiments with synthetic workloads. Performance benefits are demonstrated compared
to a baseline router, and accurate power and area measurements are analyzed from a
placed & routed layout. The result demonstrates that the fragmentation router
outperforms the baseline while it consumes less energy.
1.3 Research contribution and impact
The research contributions of this PhD program are as follows:
Concept of dynamic packet fragmentation
This dissertation introduces a new technique of dynamic flow control which
fragments a packet according to network traffic conditions and develops the
concept of dynamic packet fragmentation. Analyzing various blocking scenarios,
we address how the packets are fragmented and the benefits of fragmentation.
Increased VC utilization and verified performance impact through various
simulations
The dynamic packet fragmentation prevents VC blocking scenarios from
propagating congestion to adjacent routers, thereby allowing other packets to use
freed VCs. The fragmentation router provides flexible flow control and increases
VC utilization at high network loads. Performance benefits are demonstrated
compared to a baseline router, and accurate power and area measurements are
analyzed from a layout.
Deadlock avoidance in tree-based multicast routing
8
The proposed research offers deadlock avoidance without additional VCs in tree-
based multicast routing environments. Deadlock conditions are resolved by
fragmenting packets in blocking situations.
Fault-tolerant flow control for soft error handling
Fault-tolerant flow control ensures an error-free transmission on a flit-basis,
while using dynamic packet fragmentation upon error detection. After the
fragmentation, the faulty flit is retransmitted, thus, the proposed router
disengages flits from the faulty flit and safeguards the flits following a faulty flit.
1.4 Organization
The remainder of this thesis dissertation is organized as follows. Chapter 2 presents
background, and Chapter 3 describes dynamic packet fragmentation for increased VC
utilization. Chapter 4 elaborates on tree-based multicast routing using dynamic packet
fragmentation. Chapter 5 discusses the use of dynamic packet fragmentation in the fault-
tolerant flow control for soft error handling. Finally, Error! Reference source not found.
concludes the dissertation.
9
Chapter 2 Related Work
2.1 Introduction
This chapter describes related work from commonly used buffered flow controls
(wormhole and VC routers) to recently proposed schemes which try to solve flow-based
inefficiency problems, explore multicast routing, and fault-tolerance. Conventional flow
controls are covered first and then, recent work is outlined with brief comparison to the
dynamic packet fragmentation.
2.2 Flow control basics
Since the fragmentation router takes advantage of wormhole and VC router concepts, this
section briefly overviews these two major types of routers.
2.2.1 Wormhole router
A wormhole router has a similar operating mechanism to virtual cut-through except that
the buffer is allocated on a flit basis. Once a header flit sets up a path through the
crossbar, the subsequent body flits just follow the same path until the tail flit de-allocates
the channel. This makes it easy to control channel state by allowing the channel to be
arbitrated on a per-packet basis and holding it throughout the duration of a packet. The
assigned channel to the packet is never interrupted by other packets, reducing the number
of pipeline stages traversed by body and tail flits. Thus, a wormhole router attains
10
minimal packet latency in low traffic conditions [86], [12]. However, packet stalling is
expensive due to how channels are held on a per-packet basis. Given limited buffer space,
a stalled packet spans multiple routers, blocking all links that the packet occupies.
2.2.2 Virtual-channel router
A virtual-channel router [23] introduces multiple VCs to overcome the blocking problem
of the wormhole router. Unlike the wormhole router, the VC router arbitrates the crossbar
passage on a flit-by-flit basis allowing blocked packets to be bypassed by non-blocked
packets. So, the routed flit streams may be interleaved with other packets according to the
switch arbitration. The costs of such a mechanism are that flits must compete for access
to the physical channel through another level of arbitration (the switch arbitration), and
the flexible VC sharing of the physical channel incurs latency and power overheads.
However, these costs are easily justified by the improved throughput that the scheme
enables.
2.3 Recently proposed schemes
2.3.1 Schemes for enhanced resource utilization
This section covers a few examples of work which have shown performance
enhancement based on efficient use of resources and flow control. Previous work has
addressed increasing buffer utilization [85], controlling flow by limiting VC allocation
[103], [10], maximizing the number of VCs available [80], and dynamically routing
around congested paths [96], [40], but none of the papers focuses on utilizing limited
11
VCs. This dissertation proposes efficient use of VCs and thoroughly evaluates dynamic
fragmentation in terms of throughput improvement in NoCs.
Flit-reservation flow control
Peh and Dally proposed flit-reservation flow control [85], in which control flits race
ahead and reserve buffers for data flits ahead of time. This advance scheduling yields
efficient use of buffers, allowing buffers to be reused immediately rather than waiting
until a credit is returned. For a fixed amount of buffer space, the saturation throughput is
increased because of the immediate buffer turnaround.
Flow-aware allocation
Flow-aware allocation [10] allows VCs allocated to flows rather than individual packets.
Once a packet of a flow has been allocated a VC within a router, other packets of the
same flow are not allocated parallel VCs. This limits the number of VCs required for the
congested flow and attempts to prevent regular traffic from being affected by the traffic
of congested flow. As flows are identified by their destination, however, the VC
requirements in the traffic distribution are more than a conventional configuration of 4
VCs. The authors specified 8 VCs per link to cover nearly 80% of traffic in an 8x8 2D
mesh network.
12
Unified buffer structure
ViChaR [80] is a unified buffer structure, called the dynamic Virtual Channel Regulator,
which dynamically allocates VCs and buffer resources according to network traffic
conditions. ViChaR is an on-chip implementation of the unified buffer organization
which has been originally presented as the Dynamically Allocated Multi-Queue (DAMQ)
[101] buffer in the macro-network. Unlike other implementations, ViChaR can dispense a
variable number of VCs at each input port between v VCs (when each VC occupies the
maximum of k flits) and vk VCs (when each VC occupies the minimum of 1 flit).
Dynamic packet fragmentation in this dissertation can emulate the unified buffer
structure closely in that the fragmented packet which was originally a single packet can
be stored in multiple VC buffers. Therefore, the fragmentation router provides efficient
buffer usage with less VCs. As VCs are efficiently re-allocated, many VCs are not
necessarily required and the control logic can also be simplified.
Adaptive routing
Adaptive routing schemes [96], [40] can be an alternative solution to flow-based control.
By re-routing around congestion, adaptive routing can balance the load and usually
provides better performance than oblivious routing. However, adaptive routing alone
does not solve the fundamental problem of inefficiently used VCs. Rather, it moves the
problem to a new set of links [10]. Since adaptive routing is bounded to the path
diversity, VC utilization is independent of the dynamic routing characteristic of adaptive
13
routing. So, dynamic fragmentation can be applied to adaptive routing schemes, and the
combined techniques can provide synergetic performance enhancements through efficient
resource utilization of both VCs and links.
Bufferless routing
Bufferless routing [74] exploits the packet fragmentation, but the fragmentation is used
for avoiding packet dropping in a deflecting bufferless router. To avoid a packet dropping
in a bufferless router, a packet can be injected only when not all 4 input ports are busy. In
such a case when new packets arrive on all 4 input ports while the source node is
injecting a packet, the injected packet should be truncated to avoid any of packet
dropping at 4 input ports. The truncated packet is, then, re-injected when one of the input
ports becomes free. The bufferless routing depends on availability of the output port
when the packet or flit comes to the input port. Since the packet should be routed in some
other directions if the desired port is not allowed, the deflection can have a negative
impact on packet latency. Moreover, there are no such results how the packet
fragmentation affects the performance improvement. The fragmentation is just used as a
means of avoiding a packet dropping not for a flow control method as used.
2.3.2 Schemes for hardware-based multicast routing
This section reviews prior hardware-based multicast routing schemes in off-chip and on-
chip routers. Several recent proposals have suggested circuit-switched multicast routing
schemes for NoCs. In Virtual Circuit Tree Multicasting (VCTM) [35], multicast data flits
14
follow paths set up by multiple unicast+setup packets. VCTM outperforms a unicast
packet-switched baseline in various environments, but multicast routing is achieved with
the cost of a CAM-based virtual circuit tree. Lu, et al., suggested connection-oriented
multicasting in wormhole switched networks [69]. This scheme exhibits better
throughput than unicast routing, but the multicast group establishment is a significant
overhead.
In packet-switched multicast routing, two major techniques are suggested: path-based and
tree-based. In path-based routing, the source node partitions the destinations into several
ordered lists. The multiple packet worms then visit destinations in a predefined order
through disjoint paths. Figure 2-1(a) shows dual-path based routing. The home node
sends the multicast packets in two groups along directed Hamiltonian paths. Each path
has destinations that can be reached from the source node without cyclic directed
dependency. However, the multicast packet visits all intermediate nodes until it reaches
the last destination; therefore, path-based routing typically results in high latency. Figure
2-1(b) shows tree-based multicast routing. In tree-based routing, the multidestination
packet is routed along a common path, and the packet is copied into different channels at
branches. However, tree-based multicast routing is susceptible to packets blocking at
branch nodes, where such blocking increases the probability of deadlock in wormhole
routed networks.
15
Source Destinations
(a) dual-path (b) tree-based
Figure 2-1 Multicast routing schemes
Prior literature suggested viable solutions to avoid deadlock by partitioning and
systematically allocating virtual channels [15]. Kumar proposed a hardware tree-based
multicast routing algorithm (HTA) with deadlock detection and a recovery mechanism
[60]. When a potential deadlock is detected, the packet is routed to the deadlock queue
and reinjected into the network after a predetermined amount of time. Tree-based
multicast with branch pruning [70] solved deadlocks by controlling multicast through a
pruning mechanism. An auxiliary buffer associated in each input channel holds data flits
until the tail flit exits a node. The auxiliary buffer makes it possible for multidestination
worms to branch since it prunes off the tree when any branch is blocked. The tree-based
multicast with branch pruning performs well, but the auxiliary buffer size is a limitation.
16
Tree-based multicast routing for write invalidation message
This section covers tree-based multicast routing schemes for write invalidation message.
In distributed shared memory (DSM) systems, directory-based protocols are often used as
a scalable design due to their point-to-point communication nature. The directory
protocol maintains the possible sharers associated with an entry of its home node. When a
node issues a write request to the home node for a write miss, the home node sends
invalidation request messages to all sharers. Once the invalidation acknowledgements are
received from the sharers, the home node replies granting ownership of the block to the
requester. Even though the above transactions are intended to be performed serially, the
protocol allows invalidation requests from a write miss delivered concurrently. However,
at the time of packet generation, each request at the home node may serialize the
transaction. This packet serialization exacerbates the miss latency and causes inefficient
use of bandwidth by generating unnecessary multiple messages in the network. In
addition, the multiple packets increase the contention at the injection port of the router
and occupancy of the home node and thus creating hot-spots [22].
To reduce the latency of the write operation, several previous works suggested multicast
communication for write invalidation. [22] proposed multidestination-based reservation
and gather worms in 2D networks. While the reserve worm delivers the invalidation
message, it reserves an acknowledgement entry in each router interface. The gather worm
then collects the acknowledgements passing through the routers. The set of the
destinations of a multicast message is partitioned as an up-and-down column grouping;
17
thus, a home node requires at most two reserve worms in each column for invalidation
requests.
2.3.3 Schemes for soft-error handling in NoC routers
This section reviews prior schemes of handling errors in NoC routers. The MIT reliable
router [24] implements a link-level fault-tolerant protocol, the Unique Token Protocol.
This flow control keeps at least two copies of a packet in the network at all times. If the
network is broken between two copies of the packet, each copy follows multiple paths to
the destination, and the token is changed to replica for all copies of the packet. The
destination node then knows a packet is duplicated by receiving a packet with a replica
token and deletes duplicates.
The reliable router is somewhat close to the proposed router in that it reconstructs a
fragmented packet, but the flow control scheme is different because the proposed router
maintains a single copy of the packet in the network. If the packet is fragmented, the
severed packets, but originally a single packet, are delivered to the destination as
separate. On the other hand, in the reliable router, if the packet is split, it resends the
packet from the header, not from the flit that the error is detected. Since the reliable
router keeps two complete copies of the packet in the network, packet level recovery is
maintained by sacrificing network throughput more than the proposed scheme.
Park, et al., suggested a flit-based hop-by-hop (HBH) retransmission upon link or intra-
router errors [83]. They assumed additional retransmission buffers per VC to perform the
18
link-level retransmission. The retransmission buffer is implemented as a barrel-shift
register; therefore a flit is stored at the back of the buffer upon transmission on the link,
and it moves to the front by the time a possible NACK signal arrives from the receiving
node. Although they utilize the retransmission buffers for more purposes than just
providing link protection, such as deadlock recovery, the retransmission buffers are still
overhead, and the control logic is complicated.
Kohler suggested distributed on-line fault diagnosis and deflection routing, which takes
the detailed fault status of NoC crossbar connections into account [56]. The fault status is
diagnosed on-line distinguishing between permanent and transient faults, and the router
equips augmented architecture with error detection units using CRC checksums.
Diagnosis results are then stored in on-chip structures and used for determining fault-
adaptive deflection routing to avoid defective parts of the switch.
Soft-error handling
In on-chip interconnection networks, soft errors could be categorized into two types
based on the location and probability that errors occur: link errors and intra-router errors
[83]. Links are purely wires and affected by cross-talk, whereas the individual router
components are logic circuits and errors occur in the form of transient faults, which may
also cause erroneous states. The transient faults may not affect the results of a
computation since masking effects (logical masking, electrical masking, and latching-
window masking [95]) reduce the manifested error probability. Although the errors in
19
links and intra-routers are distinct in terms of the probability of the occurrence, they can
be tackled together in the proposed router.
Intra-router errors can further be divided into errors in datapath components and control
planes. As datapath components handle storage and movement of flits in input pipelines,
MUXes, buffers, a crossbar, and output pipelines, any errors in datapath components are
manifested at the output; thus, they can be handled together with link errors. If an error
occurs in any of the datapath elements, the erroneous flit is detected by the error detection
code at the next hop, just as if a link error has occurred. So, the location of an error
occurrence in the datapath is not important (except control parts of the datapath elements
such as selection bits of a crossbar). The significance of errors in the datapath is that
errors are reflected to the output. Therefore, we divide soft errors in the network as errors
in datapath components and control plane. In this dissertation, we focus on how the errors
in the datapath and links are detected.
As packet flows are controlled by handshaking between routers in on-chip networks,
faulty flits can be easily retransmitted at a link-level. The proposed router implements
retransmission of faulty flits at a link-level and reconstructs a discontinued packet due to
a fault while preserving the flit order. Based on the fragmentation scheme defined in [46],
we develop a new fault tolerant router, exploiting dynamic packet fragmentation for
handling a discontinued packet due to a detected error. Fragmentation renews the state
information in control planes through a VC reallocation, preventing corrupted states from
affecting the rest of the flits.
20
Chapter 3 Dynamic Packet Fragmentation for Increased
VC utilization in On-Chip Routers
3.1 Overview
Conventional packet-switched on-chip routers provide good resource sharing while
minimizing latencies through various techniques. A virtual channel (VC) is allocated on a
per-packet basis and held until the entire packet exits the VC buffer. This sometimes
leads to inefficient use of VCs at high network loads. A blocked packet can affect
adjacent routers, resulting in a congestion propagation effect. This chapter proposes a
dynamic packet fragmentation technique which releases the held VC buffers by
fragmenting packets and allowing other packets to use the freed VC buffers. Thus,
fragmentation increases VC utilization which improves load balancing and provides more
flexible flow control. Simulation experiments show performance improvement in terms
of latency and throughput up to 30% and 75%, respectively.
3.2 Introduction
Networks-on-Chip (NoCs) have been suggested as a scalable communication solution for
many-core architectures. As the number of System-on-Chip (SoC) cores increases, power
and latency limitations make conventional buses increasingly unsuitable. Buses are
appropriate for small-scale designs but cannot support scaled performance as the number
21
of on-chip cores increases. In contrast, NoCs offer fundamental benefits of high
bandwidth, low latency, low power and scalability.
Prior literature [86], [77], [78], [63] proposed highly performance-driven router
architectures with dynamic channel sharing. A packet-switched virtual channel (VC)
router exploits multiple VCs, achieving high throughput with dynamic allocation of
resources. A high utilization of the physical link is accomplished through multiple VCs,
but little effort has focused on VC utilization. Especially given the limited number of
VCs in on-chip routers, better VC utilization techniques must be considered.
As adopted in most VC routers, a VC is allocated on a per-packet basis and de-allocated
at the time that the entire packet exits the VC buffer. This characteristic sometimes leads
to inefficient use of VCs at high traffic loads. With a trend of shallow flit buffers per VC,
a blocked packet easily propagates congestion to neighboring routers and can cause
congestion to spread to the overall network. This exacerbates back-pressure for a specific
traffic flow, while preventing other flows from utilizing idle VCs on different routes.
Several studies [93], [17], [72], [90] characterized buffer utilization with various traffic
models and proposed power optimization by placing idle buffers in a sleep mode. Buffer
and VC usage have been mostly optimized in terms of power savings. This section,
however, proposes a technique to efficiently utilize VCs to increase throughput and
reduce latency.
We propose dynamic packet fragmentation for efficient VC utilization. By fragmenting a
22
packet and reallocating the VC, dynamic fragmentation avoids inefficient VC holding
and prevents VC blocking from propagating congestion to adjacent routers. Once a
packet is fragmented, the blocking is localized by releasing the hold of empty VC buffers,
thereby allowing other packets to use the freed VC. Thus, congested upstream routers do
not force inputs of downstream routers to throttle to a reduced rate. The fragmentation
router, rather, provides more flexible flow control and VC utilization at high network
loads.
The implemented fragmentation router is evaluated through various simulation
experiments with synthetic workloads. Performance benefits are demonstrated compared
to a baseline router, and accurate power and area measurements are analyzed from a
placed & routed layout. The result demonstrates that the fragmentation router
outperforms the baseline while it consumes less energy. This result of packet latency
reduction and increased throughput justifies the slightly more complex flow control
required.
The remainder of this chapter is organized as follows. Section 3.3 presents background.
Section 3.4 describes a baseline router used for comparison in the evaluation
experiments. Section 3.5 elaborates on the dynamic packet fragmentation technique and
the scheme which is applied to the proposed model, and the analytical mode is described
in Section 3.6. Section 3.7 discusses the simulation results, and Section 5 concludes.
23
3.3 Related work
This section covers a few examples of work which have shown performance
enhancement based on efficient use of resources and flow control. Previous work has
addressed increasing buffer utilization [85], controlling flow by limiting VC allocation
[10], maximizing the number of VCs available [80], and dynamically routing around
congested paths [96], [40], but none of the papers focuses on utilizing limited VCs. This
section proposes efficient use of VCs and thoroughly evaluates dynamic fragmentation in
terms of throughput improvement in NoCs.
3.3.1 Flit-reservation flow control
Peh and Dally proposed flit-reservation flow control [85], in which control flits race
ahead and reserve buffers for data flits ahead of time. This advance scheduling yields
efficient use of buffers, allowing buffers to be reused immediately rather than waiting
until a credit is returned. For a fixed amount of buffer space, the saturation throughput is
increased because of the immediate buffer turnaround.
3.3.2 Flow aware allocation
Flow-aware allocation [10] allows VCs allocated to flows rather than individual packets.
Once a packet of a flow has been allocated a VC within a router, other packets of the
same flow are not allocated parallel VCs. This limits the number of VCs required for the
congested flow and attempts to prevent regular traffic from being affected by the traffic
24
of congested flow. As flows are identified by their destination, however, the VC
requirements in the traffic distribution are more than a conventional configuration of 4
VCs. The authors specified 8 VCs per link to cover nearly 80% of traffic in an 8x8 2D
mesh network.
3.3.3 Unified buffer structure
ViChaR [80] is a unified buffer structure, called the dynamic Virtual Channel Regulator,
which dynamically allocates VCs and buffer resources according to network traffic
conditions. ViChaR is an on-chip implementation of the unified buffer organization
which has been originally presented as the Dynamically Allocated Multi-Queue (DAMQ)
[101] buffer in the macro-network. Unlike other implementations, ViChaR can dispense a
variable number of VCs at each input port between v VCs (when each VC occupies the
maximum of k flits) and vk VCs (when each VC occupies the minimum of 1 flit).
Dynamic packet fragmentation in this section can emulate the unified buffer structure
closely in that the fragmented packet which was originally a single packet can be stored
in multiple VC buffers. Therefore, the fragmentation router provides efficient buffer
usage with less VCs. As VCs are efficiently re-allocated, many VCs are not necessarily
required and the control logic can also be simplified.
3.3.4 Adaptive routing
Adaptive routing schemes [96], [40] can be an alternative solution to flow-based control.
By re-routing around congestion, adaptive routing can balance the load and usually
25
provides better performance than oblivious routing. However, adaptive routing alone
does not solve the fundamental problem of inefficiently used VCs. Rather, it moves the
problem to a new set of links [10]. Since adaptive routing is bounded to the path
diversity, VC utilization is independent of the dynamic routing characteristic of adaptive
routing. So, dynamic fragmentation can be applied to adaptive routing schemes, and the
combined techniques can provide synergetic performance enhancements through efficient
resource utilization of both VCs and links.
3.4 Baseline router
To evaluate the concept of dynamic fragmentation, we must first define a baseline router.
This section discusses key attributes of a baseline router used in our evaluation
experiments. A single cycle baseline router has been implemented based on the presented
parameter selections, and the same baseline router is used for incorporating the dynamic
fragmentation scheme.
Figure 3-1 shows the pipeline stages and steps of a baseline router. The basic steps are
comprised of buffer write (BW), pre-routing computation (preRC), virtual-channel
allocation (VA), switch allocation (SA), switch traversal (ST), and link traversal (LT)
[26]. The following sections describe each of the basic steps adopted in the baseline
router.
26
Figure 3-1 Pipeline stages of the baseline router
3.4.1 Pipeline stages
BW & preRC stages
When a flit is written into an input buffer, routing pre-computation (look-ahead routing
[38]) is concurrently performed to reduce the number of pipeline stages. This pre-
computation eliminates the RC stage from the critical path by overlapping it with the BW
stage.
SAVA stage
With the information of the routing computation, the VA for a free VC and SA for the
crossbar time slot are the next steps. Peh and Dally [86] suggested speculation techniques
for improving latency. Mullins et al. [77], [78] proposed low-latency routers by
predetermining the arbitration decisions one cycle earlier than they are requested. In this
baseline router, we followed the scheme suggested in [63]. VA and SA are done in the
27
same cycle, but VA is performed sequentially after the SA for a combined SAVA stage.
A SA winner is first selected through two separate stages of arbitration (local arbitration
and global arbitration). VA is then accomplished simply by finding a free output VC
from the requested output port of the SA winner.
ST stage
Bypassing the input buffer [104] is another technique to optimize performance. In this
case, upon successful arbitration, a flit goes directly to the crossbar without delaying one
cycle for input buffering. Figure 3-2 shows how bypassing is implemented. Since the
depth of the FIFO buffers in this design is assumed to be small (i.e. 5-entry buffer), the
input buffers are implemented with synthesizable flip-flops. Each entry of the flip-flop
buffers can act as an input pipeline eliminating the separate input pipeline overhead. The
read operation is done in parallel with local arbitration of the SA. The VC winner from
the local arbitration selects a request among the VCs and makes the selected flit ready at
the crossbar. While the local VC winner requests global arbitration, the flit from the input
port can cross the crossbar half-way to the output port, awaiting the final arbitration result.
If the SA request to the global arbitration is granted and the VA succeeds by finding a
free output VC, the header flit traversal is complete. At the LT stage, the header flit is
finally transferred to the next router through the link. The rest of the body and tail flits
use the same VC allocated to the header flit. Only the SA is performed while they
traverse the crossbar as shown in Figure 3-1.
28
Figure 3-2 Bypassing in Flip-Flop based buffers
3.5 Dynamic packet fragmentation
Now that a baseline router definition has been established, this section introduces the
concept of dynamic packet fragmentation to increase VC utilization. Given the trend of a
small number of VCs in NoCs to minimize router overhead, VC utilization is a key factor
for performance improvement. We first analyze blocking scenarios to show where poor
VC utilization occurs, motivating the need for dynamic fragmentation.
The baseline router described in the previous section is adequate in the absence of
blocking situations. Flits proceed in an orderly fashion through router pipelines when
there is little network contention. However, at high traffic loads router pipelines often
stall, propagating congestion throughout the network. Router pipeline stalls are generally
classified into two types: packet stalls and flit stalls. The packet stalls are related to the
packet processing functions of the pipeline whereas the flit stalls occur when a flit cannot
29
complete the switch allocation [26]. Flit stalls can further be subdivided into credit stalls,
buffer empty stalls, and stalls due to a failing SA for the switch time slot.
These flit stalls cause a packet to span multiple routers holding the VCs, as shown in
Figure 3-3. In Figure 3-3(a), the blocked packet in router 3 generates a credit stall to
router 2, and the credit stall propagates to upstream routers. The packet is stuck in the
VCs until the credit is returned from router 3 to router 1. The situation is even worse with
longer packets. Longer packets span more routers, so the chance that the flow is
congested by the holding of multiple network resources is increased.
(a) Credit stall
(b) Buffer empty stall
Figure 3-3 Example of flit stalls
30
Figure 3-3(b) shows that mid-packet blocking in a delayed packet leads to a buffer empty
stall. While packet 1 is being forwarded to router 2, router 1 may become congested due
to other conditions. The flits of packet 1 that are already forwarded to router 2 before
congestion occurs are routed to router 3 without any delay because router 2 is not
congested. The delayed packet in router 1 causes the VCs in the downstream routers to be
throttled to a reduced rate. Router 2 and 3 may experience empty buffer stalls, preventing
the VC, although empty, from being assigned to packet 2. These stalls lead to a
bandwidth loss and increased network congestion.
In fact, the VC router design mitigates the effect of packet blocking by sharing a physical
channel among several VCs. Although the VC router design increases physical link
utilization with dynamic sharing, it does not address VC utilization. Efficient utilization
of the limited number of VCs can be a key factor to further performance improvement.
Since the number of VCs cannot be increased without limit due to area, power and
latency constraints, a better utilization technique of VCs should be considered.
Dynamic packet fragmentation provides a means for VCs to be utilized more efficiently
by preventing blocking from propagating in some situations. Since the router dynamically
fragments packets in these blocking situations, the frequency of fragmentation is
dynamically adjusted to network traffic. Indeed, this router design avoids fragmentation
overhead at low traffic rates where there is little contention. The following sections
describe how fragmentation is applied to the blocking situations we address.
31
3.5.1 Fragmentation at credit stall
Since a credit stall is induced by the blocking of a downstream router, the stall is released
when a flit of the downstream router leaves the input buffer and a credit is returned. The
returned credit experiences latencies due to transmission and updating of the credit count.
The buffer can then be assigned to a new flit. At the time of a credit stall, even though
there may be idle buffers in other comparable VCs in the downstream router, the current
packet cannot use the empty buffers because of VC allocation restrictions. However, if
the packet is fragmented and considered as a different packet, the stalled packet of the
current node can be assigned to a new VC and be routed without credit stalls.
Figure 3-4 shows the process of dynamic packet fragmentation in the case of a credit stall.
The VC controller in the router senses the lack of credits when the VC has 1 credit left
and no more credits are returned (Figure 3-4(a)). When the router uses the last available
credit by forwarding a body flit, the VC controller fragments the packet to avoid credit
stalls. The fragmentation starts by changing the type field of the sending body flit to a
virtual-tail flit. The VC controller then regards the packet forwarding to be complete and
releases the hold of the output VC (in this example, output VC 0 is released, Figure
3-4(b)).
32
(a) (b) (c)
Figure 3-4 Packet fragmentation at credit stall
Now, the remaining body flits in the input buffer are considered as a new packet and
attempt to acquire a new VC. A copy of the header flit from the original packet must be
maintained for fragmentation. This header flit copy serves as a virtual header flit for all
fragments of the original packet and is treated as a normal header flit by VC controllers.
With the buffered header information, the VC controller can treat the remaining body and
tail flits as a new complete packet. The router performs VA and SA steps again as if a
new packet is delivered. If the VA and SA succeed, the VC controller sends a virtual-
header flit first using the information of the buffered header (in this example, output VC
1 is acquired, Figure 3-4(c)). After that, the body and tail flits follow the same route setup
by the virtual-header at the success of SA. Since a new VC is allocated to the fragmented
packet, the credit count is also updated to the amount of available buffers at the new VC.
33
By regarding credit stalls as a congestion metric of downstream routers, a fragmented
packet could be routed to a different path in adaptive routing schemes. This enables
dynamic, flexible finer-grained flow control. We leave more details about coupling
fragmentation with adaptive routing as a future work.
3.5.2 Fragmentation at buffer empty stall
Buffer empty stalls are caused by upstream routers, and such a stall is released when a
new flit is delivered to the input buffer. A buffer empty stall prevents empty storage from
being used, as shown in Figure 3-3(b). To give other packets a chance to use the empty
buffer, packet fragmentation can be performed. This empty buffer fragmentation avoids
that the VC holding spreads to the network and causes lack of VCs at the downstream
routers. It helps to provide fast reuse of VCs, avoiding unnecessary VC holding.
Figure 3-5 shows the process for how a packet is fragmented in the case of a buffer
empty stall. From Figure 3-5(a), a router detects that it may encounter a buffer empty
stall when it has a single flit in the input buffer and no flit coming into this buffer. When
the last available flit in the input buffer is forwarded to the next router, the VC controller
fragments the packet by changing the type field of the sending flit to a virtual-tail (Figure
3-5(b)). With the fragmentation, the VC controller releases the output VC (in this
example, output VC 0 is released), regarding the packet forwarding as complete.
34
(a) (b) (c)
Figure 3-5 Packet fragmentation at buffer empty stall
When the delayed upstream router is released and a new flit arrives in the input buffer,
the VC controller treats the flit as a new packet. As described in the previous section,
since the newly arrived body flit does not have any of the routing information, the VC
controller retains a copy of the header flit and uses it for VA and SA steps for packet
fragments.
For the routing pre-computation step, the pre-computed routing information is already
available in the header buffer; thus, the routing computation process can be skipped for
trailing packet fragments. If the VC controller succeeds in VA and SA, the buffered head
flit is forwarded first as a virtual-header (in this example, output VC 1 is acquired, Figure
3-5(c)) similar to that shown for credit stall fragmentation. The rest of the newly arrived
body flits follow the same route of the virtual-header flit until the VC controller detects
another buffer empty stall or tail flit.
35
Fragmentation prevents the mid-packet blocking from propagating to multiple routers.
Figure 3-6 shows how fragmentation mitigates the example of Figure 3-3(b). Once the
empty buffer in router 3 is released by fragmentation from the empty buffer of router 2,
the buffer can be utilized by packet 2 as soon as the head part of the fragmented packet 1
leaves the buffer. Thus, fragmentation prevents upstream blocking from propagating to
downstream routers. Congested upstream routers do not force inputs of downstream
routers to throttle to a reduced rate; instead the input of a downstream router can be
utilized by another packet.
Figure 3-6 After the fragmentation of Figure 3-3(b)
A fragmented packet can be fragmented again if the same situation occurs in another
router. So the virtual-head flit overhead is increased according to the number of times
fragmentation occurs. Note that the virtual-head and virtual-tail flits are created during
routing at the point of fragmentation; intermediate routers treat them identically as
normal head and tail flits, respectively.
36
3.5.3 Proposed router
This section describes how dynamic packet fragmentation is applied to the baseline router
described in Section 3. Although fragmentation is introduced for better resource
utilization, unnecessary fragmentation should be avoided due to the overhead involved.
The proposed router uses the following techniques to address this trade-off.
In the case of credit stall fragmentation, fragmentation just because of the credit loop
latency needs to be avoided. The flit buffers are recycled when a credit is returned. If the
number of flit buffers is below the number of cycles of the credit loop, the VC will incur
a credit stall without any blocking. In this chapter, a single-cycle router latency and
single-cycle link latency are assumed. So, a 5-entry buffer per VC is sufficient to cover
the credit loop of 5 cycles. Therefore, the router does not experience a credit stall just
because of the credit loop latency.
Another fragmentation which could occur in the case of buffer empty stalls can also be
avoided with the priority scheme of arbitration. With a fair arbitration scheme, the VCs
interleave their flits on a single physical channel, potentially causing downstream router
input buffers to be throttled even when there is no blocking. If the next downstream
router is not congested, the forwarded flit may be transferred to the next router
immediately. The first downstream router would then fragment the packet since it
experiences a lack of flits in the input buffer.
37
On the other hand, a winner-take-all arbitration [26], [63] for the SA solves the problem.
Since this scheme allocates all the bandwidth to a single packet until it sees a tail flit or
the packet is blocked, the input buffer of the downstream router does not see any empty
slots between consecutive flits forwarded. Once the stream is disconnected because of the
stall, the packet is fragmented and the next packet gets the priority to be forwarded until
the end of the stream.
From the techniques mentioned above, the input buffers of the VC always receive a
complete packet stream from header (virtual-header) to tail (virtual-tail). Even though the
packet is fragmented, the fragmented packet is treated as a complete packet. Forwarding
a complete packet makes it possible to emulate a wormhole router in the proposed model.
In fact, body and tail flits do not require the SA step in a wormhole router. They just
follow the same path set up by a header flit. The proposed router performs similarly to a
wormhole router, but with the use of fragmentation it does not have the problem of mid-
packet blocking due to how channels are held on a per-packet basis.
The pipeline stages of the proposed model can be seen in Figure 3-7. Unlike the baseline
router, the proposed model does not require a SA stage for body and tail flits. The
crossbar passage is set up by a header or virtual-header flit and is valid until the tail or
virtual-tail flit traverses the crossbar. The guaranteed single hop latency for body and tail
flits is achieved with the circuit switching quality of the data streaming. The scheme is
beneficial in terms of power consumption since the number of activities for the SA stage
is reduced to a per-packet basis.
38
Figure 3-7 Pipeline stages of the proposed router
Although the presented router can be applied to various routing algorithms, we assume
dimension-order (XY) routing with a two-dimensional mesh network for the simplicity of
design. Since intermediate routers forward the packet fragments based on the arrival
sequence at the input port, the fragmented packets are delivered in-order with
deterministic routing. The packet reassembling process at the destination is
straightforward and can be left to a network interface controller.
3.6 Analytical model
To anticipate the performance impact of dynamic packet fragmentation, we derived an
analytical model of VC utilization and throughput for both baseline and fragmentation
routers. By characterizing the nature of VC utilization (VCU) proposed in [77], VCU is
defined as
,
) (
1
H
s L
VCU
H
s
∑
=
= 1 0 ≤ ≤ VCU
Eq. 3.1
39
where ) (s L indicates whether the VC is occupied in cycle s . ) (s L is set ( 1 ) ( = s L ) when
the corresponding VC is occupied during cycle s and reset ( 0 ) ( = s L ) if the VC is
released during cycle s . H is the window size which is sampled in router clock cycles.
As mentioned earlier, the conventional VC router allocates a VC on a per-packet basis
and holds it until the packet exits the VC buffer. The VC allocation time can be expressed
as the number of cycles that the VC is held ( 1 ) ( = s L ). Therefore, VCU as expressed in
Eq. 3.1 incorporates inefficiently used factors including blocking scenarios, such as
empty buffer stall and credit stall.
On the other hand, the proposed dynamic packet fragmentation scheme eliminates the
two blocking scenarios by fragmenting the blocked packet. The empty buffer stall and
credit stall never happen since the packet is fragmented when those conditions occur, but
the overhead of a virtual header flit is added by the fragmentation. For this scheme, the
VCU which involves flit forwarding only without blocking scenarios is defined as
( )
,
) ( ) ( ) ( ) (
1
H
s O s C s E s L
VCU
H
s
∑
=
+ − −
=
1 0 ≤ ≤ VCU
Eq. 3.2
where ) (s E , ) (s C , and ) (s O indicate empty buffer stall, credit stall, and virtual header
overhead, respectively. Assuming 0 ) ( ) ( ) ( ≥ − + s O s C s E since the overhead is just one
cycle upon fragmentation and the number of block cycles is proven to be at least one or
else fragmentation would not occur, VCU as expressed in Eq. 3.2 appears to be lower
than the VCU from Eq. 3.1. Moreover, the VCU in Eq. 3.2 is a true VC utilization,
avoiding unnecessary holding of the allocated VC.
40
Note that VCU is observed by counting the number of cycles VCs are occupied within
the window size H . The VCU can then be regarded as the probability a VC is occupied
in the input buffer. If there are multiple VCs and the probabilities of each occupied VC
are independent, the probability of all VCs being occupied is the product of the individual
probabilities. The probabilities of occupied VCs, thus, can be brought from Eq. 3.1 and
Eq. 3.2 for both baseline and fragmentation routers as
,
) (
1
v
H
s
VCO
H
s L
P
=
∑
=
1 0 ≤ ≤
VCO
P
Eq. 3.3
( )
,
) ( ) ( ) ( ) (
1
v
H
s
VCO
H
s S s C s E s L
P
− − −
=
∑
=
1 0 ≤ ≤
VCO
P Eq. 3.4
where
VCO
P is the probability that all VCs are occupied and v is number of VCs per input
port.
Based on the analysis proposed in [23], the average amount of time,
i
x , that a packet
waits to acquire a VC at node stage i is proportional to the probability of all VCs at a
subsequent node being occupied,
VCO
P . Since the fragmentation router has a probability
of all VCs being occupied that is lower than that of the baseline router, the fragmentation
router exhibits a lower average amount of waiting time at each node.
For the perspective of the throughput, the throughput,
max
λ , of a network using VCs can
be determined by solving the equation [23]
41
1 0
max
1
−
+
=
n
x t
λ
Eq. 3.5
where n is the number of stages in the network which a packet traverses from source
( 1 − n ) to destination ( 0 ), and
0
t is the service time at the destination. Since the
fragmentation router incurs a lower average waiting time at each node, the fragmentation
router enables more throughput than the baseline router.
The derived analytical performance model demonstrates that the dynamic packet
fragmentation reduces the VC occupation probability and packet waiting time at each
node. The reduced average waiting time results in increased throughput from Eq. 3.5.
3.7 Simulation results
This section describes performance and power analyses of the implemented designs. A
baseline router and the proposed dynamic packet fragmentation router were developed in
synthesizable VHDL code. The codes were synthesized using Synopsys Design
Compiler, and layouts were generated with Cadence SOC Encounter targeting the Artisan
standard cell library for IBM 90 nm technology. Table 3-1 summarizes the common
router features and network parameters for synthesis and circuit simulation. The
performance of the proposed router is evaluated in cycle-accurate simulations using
synthetic workloads generated with the specified traffic patterns.
42
Table 3-1 Design evaluation parameters
Topology Mesh 4x4 Routing Dimension-order (XY)
# of ports 5 Router uArch Two-stage
# of VCs 4 Link latency 1 cycle
Flit size 128 bits Packet length 15 flits
Buffer per port
32 (8-entry depth per
VC)
Traffic workload
Uniform random, Bit-
complement, Tornado, Hot-spot
3.7.1 Synthetic traffic
We evaluate the baseline and fragmentation routers using standard synthetic traffic
patterns of uniform random (UR), bit-complement (BC), and tornado (TN). The
workloads provide various congestion metrics with well balanced load or specific link
stress. Uniform random traffic, which is commonly used in network evaluation, sends the
packets to each destination with uniform distribution, whereas the other two patterns
concentrate load on individual source-destination pairs.
In addition to the standard synthetic traffic, hot-spot (HS) traffic was modeled for the
case where several processors simultaneously access a cache line on the same node.
Sometimes, memory controllers in a large scale on-chip network are regarded as hot-
spots based on their placement in the network [1]. Hot-spot traffic is frequently observed
in various applications. In some application domains, such as graphics, communication
processing, and scientific modeling, the majority of packets are reported to have pairs of
a single source and a small number of unique destinations. Having reflected this
43
increasing hot-spot traffic pattern, four of the center nodes are selected as destinations
which are five times more likely to be chosen than the other nodes in the background
random traffic [14], [96].
3.7.2 Performance
This section discusses performance results based on injected traffic patterns. In each of
the performance graphs, Baseline indicates the same baseline router mentioned in Section
3, and Fragment indicates the proposed router. In the following section, we can see how
the performance is affected by dynamic fragmentation and how many packets are
fragmented as a function of traffic load. To avoid unnecessary fragmentation for credit
stalls in a 6-cycle credit round trip time, an 8-entry buffer (7-entry regular buffer with an
extra head flit buffer) per VC has been specified. For the packets to be fragmented, 15-flit
packets which are longer than the depth of the input buffer are injected.
Figure 3-8 shows the average latency for a 15-flit packet simulation. The baseline and
fragmentation routers exhibit similar latency until the baseline saturation point over all
the traffic patterns. This is because blocking of long packets is dominant in the baseline
router. Moreover, the increased resource utilization and dynamic reaction capability of
the fragmentation router more than compensate for the added overhead of a virtual header
flit. For the throughput comparison, the fragmentation router shows a significant benefit,
achieving a throughput ranging from 37% to 75% higher than that of the baseline router
for each of the traffic patterns. As expected, the throughput improvement is significant,
44
and this indicates that fragmentation utilizes resources more efficiently for longer packets
at high traffic rates.
(a) Uniform random (UR) (b) Bit-complement (BC)
(c) Tornado (TN) (d) Hot-spot (HS)
Figure 3-8 Average latency
To assess the impact of the dynamic nature of the proposed fragmentation scheme, the
performance of dynamic fragmentation is compared to the performance of a static
fragmentation scheme where packets are fragmented at the point of injection. Although
the results are not shown in the figure, the fragmentation scheme is observed to have
45
better performance both in terms of latency and throughput. Since the static
fragmentation scheme incurs overhead processing more header flits than the dynamic
fragmentation scheme from source to destination, the static fragmentation suffers in
latency at low traffic load and reduces network throughput generating too many short flit
packets. This indicates that the dynamic and reactive aspect of our proposed router is an
important aspect of the design.
Figure 3-9 shows the fragmentation rate versus the injection rate of the proposed router.
The fragmentation rate is measured by counting the number of virtual header flits among
the total number of packets received at the destination node. So, a 100% fragmentation
indicates that on average every packet is fragmented in the 15-flit packet simulation. Note
that the size of the fragmented packet is determined by the number of buffer entries.
Therefore, with 8-entry input buffers (7-entry regular buffer + 1 header buffer), the 15-flit
packet is fragmented only once. From Figure 3-9, the fragmentation is gradually
increased as more traffic is applied, as expected. At low traffic loads, most packets are
delivered to destination nodes without fragmentation because blocking is less likely. Near
the saturation point, however, most packets approach 100% fragmentation due to the
congestion.
46
Figure 3-9 Fragmentation rate
Figure 3-10 shows the corresponding VC utilization of the four traffic patterns. VC
utilization is characterized by counting the cycles of each of the VC states, as captured
during the sampled 10,000 cycles for both designs. In the VC utilization graphs, active
indicates a state where flits are being forwarded, whereas empty and credit are states
involved in the blocking situations of empty stall and credit stall, respectively.
active+credit+empty represents the cumulative reservation, or holding time for an
occupied VC. In the fragmentation router, however, active specifies the only occupied
VC state since the fragmentation router does not have empty or credit states for an
allocated VC.
47
(a) Uniform random (UR) (b) Bit-complement (BC)
(c) Tornado (TN) (d) Hot-spot (HS)
Figure 3-10 VC utilization
As can be seen in Figure 3-10, the fragmentation router exhibits more active states for
occupied VCs. Although the baseline router occupies VCs longer than the fragmentation
router, the fragmentation router utilizes the occupied VCs more efficiently, avoiding
unnecessary holding of VCs. In addition, the VC occupation (active+credit+empty) in
the Baseline increases abruptly around the point of the baseline saturation points, whereas
active remains low. This implies that when the network traffic nears the congestion point,
most flits are just stalled in input buffers, occupying VCs. Only a small portion of flits are
48
being forwarded at their maximum rate. When the network load exceeds the saturation
point, the active state rate is slightly reduced due to increased resource contention.
On the other hand, the active state in Fragment increases past the baseline saturation
point and remains steady beyond the saturation point of the fragmentation router. The
increased active state in Fragment as compared to Baseline ranges from 46% to 116%.
This demonstrates that more throughput and lower contention at high traffic loads are
achieved by the fragmentation router, efficiently utilizing VCs without unnecessary VC
holding.
Flow based fragmentation
Previous sections described packet fragmentation at blocking scenarios of credit stall and
buffer empty stall in a greedy fashion, regardless of congestion. As the packets are
fragmented dynamically over the traffic condition, an extra virtual header flit is an
overhead which increases the packet latency and network load. Intelligence in the
decision of the fragmentation is, thus, necessary selectively fragmenting while
maintaining high network throughput.
In future many-core processors, NoCs are a critical shared resource which enables
multiple flows running concurrently. In a hot-spot traffic scenario, a couple of flows can
dominate the network by occupying VCs and block other flows preventing idle VCs on
different routes from being utilized. The interference of the heavy flow and light flow
creates early saturation and degrades the network throughput.
49
Previous research efforts have suggested various schemes of flow aware allocations such
as static VC allocation, circuit-switched network, and limiting VC allocation according to
the flow [10]. So, the multiple flows are routed to their destinations avoiding resource
contention. Some of the schemes provide QoS (Quality of Service) by guaranteeing
minimum required bandwidth, but overall throughput is degraded by limiting available
VCs allocated to a specific flow. Unlike the previous work, the flow-based fragmentation
allows flows to occupy any VC. There is no limitation to VC allocation, improving the
network throughput until a router sees local estimates of the congestion. The local router
state detects congestion to a specific flow and selectively fragments the packet which
belongs to the congested flow. The fragmented packet releases hold of an output VC,
thereby allowing other flows to use the freed VC. The non-congested flow is then
forwarded to the next router, avoiding the blocking.
For further details of flow-based fragmentation, the flow should be defined first. We
identify a flow as a packet stream which goes to the same destination. Although flows
could be identified explicitly at a finer level according to the applications from the high-
level system view, we limit the scope of the flow to packets’ final destination and do not
consider interference of the same flows which belong to different applications.
For a congestion detection metric, we use local router state which detects credit stalls at
high traffic load. When a certain flow fills up every available VC, which is a common
case in hot-spot traffic, the router enables fragmentation at the credit stall targeting the
congested flows which hold the output VCs. After fragmentation, the output VCs are
50
utilized by the other flow as soon as the output VCs are released, and the trailing part of
the fragmented packets gets the lowest priority for the next arbitration. With selective
fragmentation for the congested flow, the proposed router improves load balancing and
increases network throughput.
Figure 3-11 shows performance results of the flow-based fragmentation router. In hot-
spot traffic with three times more traffic destined to the hotspot nodes, both hotspot and
non-hotspot packet latency is measured. The hotspot nodes are arbitrarily located at node
0 and 63, assuming two memory controllers are located at the edge of the chip, and 5-flit
packets are injected into the 4-entry buffer routers. As can be seen in the figure, the
fragmentation router improves the network throughput for both hotspot and non-hotspot
flows. A hotspot flow is fragmented upon a credit stall and releases the output VCs to
allow a non-hotspot flow to use the freed VCs. The non-hotspot flow is forwarded to the
destination avoiding tree saturation of the hotspot flow; thus, less latency of the non-
hotspot flow helps to improve the throughput of the hotspot flow.
51
Figure 3-11 Performance results of hotspot traffic
3.7.3 Place and route
The Fragment and Baseline routers were synthesized with clock gating to minimize
dynamic power. Table 3-2 demonstrates a breakdown of the router areas and critical path
delays from the synthesis results. Placed & routed layouts were then generated from
standard cell netlists. A general back-end flow was followed using Cadence SOC
Encounter with IBM 90nm technology targeting an Artisan standard cell library. Figure
3-12 shows the layout picture of the fragmentation router. This section provides accurate
timing, area, power, and energy analyses based on the generated layouts.
52
Table 3-2 Router synthesis results
Baseline Fragment
Critical path delay 2.3 ns 2.3 ns
Total cell area 326100.1 µm² 328409.5 µm²
Flit buffer 60% 42%
Header buffer 0% 12%
Crossbar 7.4% 7.6%
VC ctrl 14% 18%
Other 18.6% 20.4%
Layout dimension
78.53% density in [680x680]
µm²
77.02% density in [680x680]
µm²
Figure 3-12 Fragmentation router layout
Critical path & area
The Fragment router is measured to have the same clock period as the Baseline router.
Although the Fragment router has slightly more complex control logic for fragmentation,
the wormhole like characteristic of the Fragment router, which simplifies the path setup
of the switch, amortizes the critical path of the SAVA stage. As can be seen in Table 3-2,
53
data-path elements (flit buffers, header buffer, & crossbar) dominate over the control
elements by taking around 60% of the entire area for both designs. These dominant data-
path elements justify the slightly more complex flow control, making the complex control
logic overhead trivial.
Based on the synthesized results with a 2.3 ns clock period target, both designs were
placed & routed. Since the Fragment router has almost the same area as the Baseline
router for a 2.3 ns clock period synthesis target, both designs could be placed & routed in
the same die area with almost the same density.
Based on the synthesized results with a 2.3 ns clock period target, both designs were
placed & routed. Since the Fragment router has almost the same area as the Baseline
router for a 2.3 ns clock period synthesis target, both designs could be placed & routed in
the same die area with almost the same density.
Power
To see the impact of traffic on the overall network power consumption, the network
power consumption is measured with the same workload (UR & TN) used in the
performance evaluation. The annotated switching activities are captured and reflected in
the power calculation using Synopsys Power Compiler with typical operating conditions
(1.0V, 25℃). The result of Figure 3-13 shows the power dissipation versus the various
traffic rates. Given that the main overhead of the Fragment router is due to the more
54
complex control logic, the power overhead can be regarded as more overall switching
activities by utilizing empty buffers.
(a) Uniform random (UR)
(b) Tornado (TN)
Figure 3-13 Average power consumption
Energy
Power alone can be a misleading metric. The correlation of power and performance must
be considered to evaluate the true merit of a design. The energy cost graphs in Figure
55
3-14, therefore, are a better reflection of the design efficiency. Energy per packet for the
applied synthetic workloads is indicated by multiplying the latency, clock period, and the
average power consumption [9]. As can be seen in Figure 3-14, the Fragment router
consumes slightly more energy than Baseline at low injection rates. However, at high
injection rates, the Fragment router exhibits much less energy than the Baseline router.
3.8 Conclusions
A dynamic packet fragmentation router has been presented and evaluated in terms of
performance, power, and energy. The fragmentation router increases performance in
terms of latency and throughput up to 30% and 75%, respectively. Moreover, simulation
results indicate an energy savings as well. Since dynamic fragmentation reacts to traffic
conditions, it increases VC utilization and relieves congestion. We believe that the
presented idea opens up another research area for flow control mechanisms in NoC router
design and have shown how it can be applied for deadlock avoidance in multicast
routing.
56
(a) Uniform random (UR)
(b) Tornado (TN)
Figure 3-14 Energy consumption
57
Chapter 4 Multicast Routing with Dynamic Packet
Fragmentation
4.1 Overview
Networks-on-Chip (NoCs) become a critical design factor as chip multiprocessors (CMPs)
and systems on a chip (SoCs) scale up with technology. With fundamental benefits of
high bandwidth and scalability in on-chip networks, a newly added multicast capability
can further enhance the performance by reducing the network load and facilitate
coherence protocols of many-core CMPs. This chapter proposes a novel multicast router
with dynamic packet fragmentation in on-chip networks. Packet fragmentation is
performed to avoid deadlock in blocking situations, releasing the hold of an output virtual
channel (VC) and allowing another packet to use the freed VC. From circuit simulation
of the design implemented with IBM 90nm technology, the proposed router reduces
latency by 38.6% and consumes 9% less energy than a unicast baseline router at the
baseline saturation.
4.2 Introduction
The increased capability of VLSI technology makes it possible to build many-core
architectures such as chip multiprocessors (CMPs) on a single die. In such chips, prior
research has shown that a well structured network-on-chip (NoC) can provide more
58
flexible and scalable communication with fundamental benefits of increased bandwidth
and reduced latency as compared to conventional buses or point-to-point links. However,
there are still critical challenges in terms of power and latency [81]. Prior work [69], [81],
[35] proposed that hardware multicast support could enhance performance further and
enable the development of promising protocols for large numbers of cores. A single
packet with multiple destinations avoids unnecessary packet replications traversing the
network. The reduced network load yields a decreased average packet latency and power
savings. Thus, hardware support for multicast routing is essential for on-chip networks to
overcome latency and power consumption limitations in architectures where the
frequency of multicast operations is significant.
In this chapter, we present a novel multicast router for networks-on-chip. Deadlock-free
multicast routing is achieved by fragmenting packets in particular blocking situations.
Normally, deadlock avoidance in multicast routing is accomplished through the provision
of extra virtual channels (VCs), but from the strict constraints of area and power,
increased VCs are not necessarily an ideal solution [15], [12]. The proposed dynamic
packet fragmentation offers deadlock avoidance without additional VCs and increases
VC utilization. Since fragmentation prevents a packet from holding a channel when it has
no flits available for traversal due to upstream blocking conditions, not only are deadlock
conditions resolved but there is also potential for increased channel utilization. Once a
packet is fragmented, the last traversing flit of the fragmented packet releases the hold of
a VC thereby allowing other packets to use the freed output VC. Thus, fragmentation
prevents upstream blocking from propagating to downstream routers. Congested
59
upstream routers do not force inputs of downstream routers to throttle to a reduced rate;
instead the input of a downstream router can be utilized by another packet.
The proposed router has been implemented with synthesizable VHDL code, and a
physical layout was generated. From circuit simulation with synthetic workloads, the
performance and average energy consumption were measured and analyzed, compared to
a baseline unicast router. The results show that the proposed multicast router is superior
to the baseline design, reducing latency by 38.6% and energy consumption by 9% and
also providing 30% more throughput.
This chapter is organized as follows. Section 4.3 describes various multicast routing
schemes. Section 4.4 covers implementing tree-based multicast routing for write
invalidation messages. Section 4.5 discusses the router architecture and operation of the
proposed multicast routing scheme with dynamic packet fragmentation. In section 4.6,
performance and energy analyses are presented, and section 4.7 summarizes the chapter.
4.3 Multicast routing schemes
This section reviews prior hardware-based multicast routing schemes in off-chip and on-
chip routers. Several recent proposals have suggested circuit-switched multicast routing
schemes for NoCs. In Virtual Circuit Tree Multicasting (VCTM) [35], multicast data flits
follow paths set up by multiple unicast+setup packets. VCTM outperforms a unicast
packet-switched baseline in various environments, but multicast routing is achieved with
the cost of a CAM-based virtual circuit tree. Lu, et al., suggested connection-oriented
60
multicasting in wormhole switched networks [69]. This scheme exhibits better
throughput than unicast routing, but the multicast group establishment is a significant
overhead.
In packet-switched multicast routing, two major techniques are suggested: path-based and
tree-based. In path-based routing, the source node partitions the destinations into several
ordered lists. The multiple packet worms then visit destinations in a predefined order
through disjoint paths. Since the multicast packet visits all intermediate nodes until it
reaches the last destination, path-based routing typically results in high latency. In tree-
based routing, the multidestination packet is routed along a common path, and the packet
is copied into different channels at branches. However, tree-based multicast routing is
susceptible to packets blocking at branch nodes, where such blocking increases the
probability of deadlock in wormhole routed networks.
Prior literature suggested viable solutions to avoid deadlock by partitioning and
systematically allocating virtual channels [15]. Kumar proposed a hardware tree-based
multicast routing algorithm (HTA) with deadlock detection and a recovery mechanism
[60]. When a potential deadlock is detected, the packet is routed to the deadlock queue
and reinjected into the network after a predetermined amount of time. Tree-based
multicast with branch pruning [70] solved deadlocks by controlling multicast through a
pruning mechanism. An auxiliary buffer associated in each input channel holds data flits
until the tail flit exits a node. The auxiliary buffer makes it possible for multidestination
61
worms to branch since it prunes off the tree when any branch is blocked. The tree-based
multicast with branch pruning performs well, but the auxiliary buffer size is a limitation.
4.4 Implementing tree-based multicast routing for write invalidation
messages in networks-on-chip
Prior to applying dynamic packet fragmentation to the multicast routing, a tree-based
multicast routing for write invalidation messages has been implemented in NoCs [48].
Common distributed shared memory systems using a directory-based protocol operate
with unicast messages for write invalidations. The unicast messages serialize the write
invalidation transactions, which leads to increased network traffic and latency. This
section discusses an efficient multicast router for a single-flit write invalidation message
in on-chip networks. A tree-based routing scheme is followed for multicast routing with a
bit-string multidestination encoding. We implemented the tree-based write invalidation
router targeting IBM 90nm technology. In network simulation, the proposed design
demonstrated 10.5% reduced latency and 3.2% less energy consumption than the unicast
and dual-path router.
The computing power available in a many-core chip is delivered through an effective
communication scheme. The on-chip interconnection network, which determines the
communication scheme, is critical to performance with respect to bandwidth utilization
and latency. The bandwidth demands are tightly coupled to the number of network
transactions generated, whereas the latency is governed by the number of these
62
transactions that are in the critical path [21]. These primary limiters of latency and
bandwidth to the performance can be mitigated with an efficient routing scheme which
supports multicast.
In distributed shared memory (DSM) systems, directory-based protocols are often used as
a scalable design due to their point-to-point communication nature. The directory
protocol maintains the possible sharers associated with an entry of its home node. When a
node issues a write request to the home node for a write miss, the home node sends
invalidation request messages to all sharers. Once the invalidation acknowledgements are
received from the sharers, the home node replies granting ownership of the block to the
requester. Even though the above transactions are intended to be performed serially, the
protocol allows invalidation requests from a write miss delivered concurrently. However,
at the time of packet generation, each request at the home node may serialize the
transaction. This packet serialization exacerbates the miss latency and causes inefficient
use of bandwidth by generating unnecessary multiple messages in the network. In
addition, the multiple packets increase the contention at the injection port of the router
and occupancy of the home node and thus creating hot-spots [22].
To reduce the latency of the write operation, several previous works suggested multicast
communication for write invalidation. [22] proposed multidestination-based reservation
and gather worms in 2D networks. While the reserve worm delivers the invalidation
message, it reserves an acknowledgement entry in each router interface. The gather worm
then collects the acknowledgements passing through the routers. The set of the
63
destinations of a multicast message is partitioned as an up-and-down column grouping;
thus, a home node requires at most two reserve worms in each column for invalidation
requests.
The presented idea in this section uses a tree-based multicast scheme with a single-flit
packet. The destination sets are partitioned by the four directions (NEWS), and all
destination nodes are reached through a shortest path, in contrast to path-based
approaches, which suffer from long path latencies. A single packet is injected to the
network to cover all destinations, and the packet is replicated to be forwarded to
requested outputs at router branches. The write invalidation message consists of a single-
flit packet using a bit-string encoding for the multidestination routing header. Even
though the prior tree-based multicast schemes assume small sized flits and multi-flit
packets for write invalidation messages, we pack the multidestination addresses in a
single flit. This capability is enabled in a NoC environment where wiring resources are
abundant. The bit-string encoding is a function of the number of reachable nodes from a
home node, and the length is independent of the number of multicast destinations. If the
destination node is out of range of a reachable network, a level of indirection is used
where the destination node is covered with a unicast packet which has a binary
destination ID. Therefore, the presented router in this section supports two types of
destination address encoding schemes: binary encoding of destination ID for unicast
routing and bit-string for multicast routing. Since a sender initiates the multidestination
packets, the bit-string with sharers’ information can be easily extracted from the
directory, making the communication start-up process simple.
64
In this section, we present an efficient routing decision scheme using bit-string encoding
and a mechanism of single-flit packet replication for multicast messages. The presented
idea then is compared to a dual-path based router and unicast router using the circuit
simulation. The above three designs are all implemented with synthesizable VHDL code,
and physical layouts are generated. From the circuit simulation, the average powers and
energy consumptions are measured with a small synthetic workload. The result shows
that the tree-based write invalidation multicast is superior to the other two designs by
10.5% and 3.2% for latency and energy consumption, respectively.
4.4.1 Invalidation schemes
In this section, we briefly review the three invalidation schemes mentioned in the
previous section. As shown in Figure 4-1(a), unicast routing (called baseline routing in
the rest of the section) sends multiple packets to cover all the sharers. The baseline
routing scheme spends more time to create multiple packets than the other two designs.
Figure 4-1(b) shows dual-path based routing. The home node sends the multicast packets
in two groups along directed Hamiltonian paths. Each path has destinations that can be
reached from the source node without cyclic directed dependency. However, dual-path
based routing visits all intermediate nodes until it reaches the last destination; therefore, it
results in long latency. Figure 4-1(c) shows tree-based multicast routing. The
multidestination packet is routed along a common path and the packet is copied into
different channels at branches. Tree- based routing is susceptible to the deadlock in
wormhole- routed networks because of the branching. [70] proposed a tree-based
65
multicast that solved deadlocks through a pruning mechanism. Our proposed idea
implements tree-based multicast, but we assumed single-flit packets for the multicast
message since wire availability in NoC enables the use of single-flit packets. The single-
flit multidestination packet never requests multiple channels in a single cycle in the
proposed scheme, so it is not susceptible to deadlock.
Figure 4-1 Three invalidation routing schemes: (a) unicast (base), (b) dual-path, (c) tree-based
(Figure format is brought from [43])
4.4.2 Proposed router architecture
Header flit format
A good multi-address encoding scheme is needed to minimize the message header length
overhead and ease routing decisions. Bit-string encoding is a good option to limit the size
of a header in a small network, but it increases the overhead when the number of
destinations is large. The proposed idea uses bit-string encoding for a reachable network
66
limited to 16 nodes. For networks larger than this, a level of indirection is needed where a
unicast packet must be sent. Figure 4-2 shows header flit formats with the bit-string
encoding for multicast packets and binary destination encoding for unicast packets. The
bit-string is a simple bit vector, where each bit corresponds to a destination of the bit
position. A type field determines the position of a flit in a packet as head, body, tail and
head-tail. The head-tail flit indicates a single-flit packet which is used as a write
invalidation in this section. Since the head flit carries the packet’s routing information,
the head flit is handled differently from the body and tail flits. The head flit allocates
channel state for the packet and places the acquired channel ID in the virtual channel ID
(VCID) field. Body and tail flits have no routing or sequencing information, so they
simply follow the head flit along its route in-order.
Figure 4-2 Head flit encoding formats with 128-bit flit size. A type field indicates the flit as head,
body, tail, and head-tail
67
Pipeline stages and routing computation
This section briefly summarizes the pipeline stages (Figure 4-3) of the presented router.
After arrival of the head flit, the routing computation decides an output port to which the
packet must be forwarded. The result of the routing computation is used for switch
allocation (SA) and virtual channel allocation (VA) at the next cycle. VA is accomplished
simply by finding a free output VC from a SA winner as described in [63]. The packet is
then finally transferred to the next router after traversing the crossbar.
Figure 4-3 Proposed router pipeline (BW/RC: buffer write & routing computation, SA/VA:
switch allocation & virtual channel allocation, ST: switch traverse, LT: link traverse)
At the routing computation stage, to decode the output port for the destination nodes, a
router should be aware of the network topology and the location of itself. The presented
router partitions the bit-string according to the direction based on the router location.
Figure 4-4 shows how the multidestination packet is copied and forwarded to different
directions, considering a multicast message from router 5 to routers 3, 4, 10 and 15 in a
4x4 mesh network. A single packet that includes all destination nodes is generated in
source node 5 and injected to the local port of the router. At the time of a flit arrival in the
input buffer, the routing computation process partitions the destinations by directions.
68
Each router is required to know the topology and the location inside the network. From
Figure 4-4(a), source router 5 separates the direction easily based on the bit position of
node 5 in the bit- string. All the right side nodes are hard-wired to the east output port
encoding, left nodes to west, upside nodes to north and downside nodes to south. After
the routing computation, the produced multiport bit-string information is fed to a route
field of input VC state fields per cycle. From Figure 4-4(b), the first output port encoding
(10000) is forwarded in the route field of the input VC state at cycle 2. When SA and VA
complete, a flit is duplicated and forwarded to the east port at the ST stage with updated
destination nodes (Figure 4-4(c)). At the next cycle, the last flit is forwarded to the west
port based on the information of the output port encoding (01000). The multidestination
packet is forwarded to the proper output port one by one in a pipelined fashion until all
requested output ports are satisfied.
69
(a) Partitioned bit-string and example of source and destination routers in 4x4 mesh
network
(b) Pipelined routing of a multidestination packet
(c) Block diagram of multidestination packet forwarding
Figure 4-4 Describes how the multidestination packet is copied and forwarded to different
directions, considering a multicast message from router 5 to router 3, 4, 10 and 15 in 4x4 mesh
network
70
4.4.3 Evaluation
In the previous section, we discussed the pipeline stages of the proposed router
architecture. This section describes performance and power analyses of the implemented
designs. A unicast baseline, dual-path multicast and the proposed tree-based multicast
router were developed in synthesizable VHDL code. The codes were synthesized using
Synopsys Design Compiler, and layouts were generated with Cadence SOC Encounter
targeting the Artisan standard cell library for IBM 90 nm technology. Table 4-1
summarizes the common router features and network parameters for synthesis and circuit
simulation. Each of the entries for the Routing row corresponds to baseline, dual-path and
tree-based router, respectively.
Table 4-1 Design evaluation parameters
Topology Mesh 4x4 Routing X-Y / Hamiltonian / X-Y
# of ports 5 Buffer per
port
16 (4-entry depth per VC)
# of VCs 4
Flit size 128 bits Temp/VDD 25℃ /1.0v
Table 4-2 presents the area and timing results of the above three designs. The tree-based
router takes 16.6% more cell area than the unicast baseline router and 2.2% less area than
the dual-path router. The dual-path router suffers from a complicated routing computation
and related input VC control logic as compared to the other two designs. The encoded
bit- string is more difficult to decode for routing, especially for path-based routers.
Therefore, the dual-path router needs a predefined ordered list for destinations, but the
71
home node needs an extra preparation phase to set up the order of the destinations. On the
other hand, the tree-based router decodes the bit-string relatively easily as mentioned in
the previous section. The two types of destination address encoding schemes causes
slightly more area and clock period than the baseline router, but this overhead is trivial
when overall network energy consumption and performance benefits are considered, as
shown in the following simulation results.
Table 4-2 Area and timing analysis
Baseline Dual-Path Tree-Based
Input Unit (µm²)
(Routing comp + input VC ctrl)
276169
(24448 )
352508
(85852)
343367
(78146)
Output Unit (µm²) 27149 25440 26481
Crossbar (µm²) 9523 9502 9517
Total Cell Area (µm²) 322551 395398 386924
Layout Dimension (µm²) 654x654 724x724 714x714
Cell Density 77.6% 77.3% 77.8%
Clock Period (ns) 3.0 3.6 3.3
The performance of the three router designs is evaluated through a small synthetic
workload. The inserted network loads are a mix of unicast and multicast packets by 94%
and 6%, which is close to a 5.1% multicast rate in the directory protocol [35]. The unicast
packet is assumed to have 8 flits and multicast packets cover all destinations in the
network for targets. A total of 256 packets are inserted in each node to the network. In the
baseline router, all multicast packets are translated into multiple individual unicast
packets; thus, a higher number of packets are inserted in the baseline router. Table 4-3
describes various performance results of the three designs with the parameters presented
72
in Table 4-1. The tree- based router outperforms the baseline and dual-path router in
terms of latency and throughput. At the saturation point, the tree-based router shows
10.5% latency reduction and 13.5% at no-load when compared to the baseline. The dual-
path router exhibits the least latency at no-load because it takes the shortest occupancy of
time by sending multicast packets in two disjoint directions in this workload. However,
the dual- path router shows poor performance on latency at the maximum injection rate.
Stopping by all intermediate destination nodes leads to longer latency than the baseline
router. Even though the workload used in this evaluation is synthetic, it possesses many
of the qualities we expect in application-driven workloads. We leave the router
evaluation for an application-driven workload as a future work.
Table 4-3 Latency and energy for synthetic workload
Baseline Dual-Path Tree-Based
Maximum injection rate 43.53% 41.46% 48.64%
Latency at max injection rate (cycles) 44.60 47.94 39.91
Latency at no-load (cycles) 25.75 22.83 23.08
Avg power consumption of each router
(mW)
22.6 23.5 23.2
Total network energy consumption (µJ) 5.97 7.75 5.78
Using the same workload, the average power consumption of each router and total
network energy consumption for the 4x4 mesh was measured. Power analysis was
performed using Synopsys Nanosim based on layout extraction including wire delays. As
can be seen in Table 4-3, the tree-based router consumes more average power than the
baseline by 2.7%. However, the power metric alone does not imply the least average
73
power design consumes the least energy for overall simulation. From the observed energy
consumption metric, the tree-based router consumes less energy than the baseline and
dual-path routers. The presented total energy consumption reflects pure network energy
under the workload, and it should be considered as a primary design constraint in on-chip
networks. With the current trend that energy is becoming a more important factor in chip
design, the proposed tree-based router is an attractive solution for on-chip routing. The
tree- based multicast router leads to less latency, and the reduced latency yields
substantial energy savings on overall chip power under the workload.
Finally, Figure 4-5 shows the layout picture of the three implemented routers. The
pictures are sized proportional to the dimension shown in Table 4-2.
Figure 4-5 Implemented router layout
74
4.4.4 Conclusions
In this section, we presented a tree-based write invalidation multicast router for networks-
on-chip. The scheme reduces the write invalidation latency and number of packets
generated. We implemented the router targeting IBM 90nm technology, and the analysis
shows that the proposed idea outperformed other designs both in performance and energy
consumption by 10.5% and 3.2%, respectively.
4.5 Router architecture and operation
This section proposes an efficient tree-based multicast routing scheme, which can be
implemented with minimum hardware. Deadlock is avoided through a dynamic packet
fragmentation technique. The followings present more details of the architecture and
operation of the proposed router.
From the pipeline stages, a routing function is different from the unicast routing by
returning multiple output ports. The router discriminates between a unicast and multicast
packet based on the number of returned output ports. For unicast packet routing, the
result of the routing computation is used for SA and VA at the next cycle. VA is
accomplished simply by finding a free output VC from the requested output port and
assigning it to a SA winner as described in [63].
However, if the routing computation returns multiple output ports, the router considers
the inserted packet a multi-destination packet. The multi-destination packet is handled as
75
multiple unicast packets from the SA/VA stage, except that each flit is deleted from the
input buffer after all the requested output ports are satisfied. To track the state of each
forwarded multi-destination flit, the input VC maintains a separate VC state per output
port. Therefore, the flits of the packet are asynchronously replicated and forwarded
independently without coordination with other branches. Figure 4-6 shows an example of
VC states and input buffer status at an input unit. The decoded output ports from the
routing computation are latched and connected to five separate VC state units. Each VC
state corresponds to the output port to which the flit needs to be forwarded and controls
the pipeline stage of the forwarded packet to that output port. If the decoded output port
from the routing computation produces a set of valid output ports (north and south in
Figure 4-6), the corresponding VC state advances to SA/VA. In contrast, the output ports
that are not valid maintain idle VC states.
Figure 4-6 Input unit
76
Each input VC may now generate multiple output VC and switch allocation requests
from multiple VC state units, but another level of arbiter in the input VC grants a single
request at a time. The granted request is forwarded to the input unit until all requests from
the multiple VC states are satisfied. Since each VC state has private pointers of head and
tail into the input buffer, the VC state keeps track of forwarded and not-forwarded flits to
the corresponding output ports. At the end of the packet, when the tail flit leaves the input
buffer at the ST stage, the VC states for multicast go back to the idle state.
4.5.1 Deadlock avoidance with dynamic packet fragmentation
The above described operation works fine in the absence of any blocking situation. Flits
proceed in a pipelined manner without stalls in the ideal case where there is no
contention. However, the router pipeline can be easily blocked in the following deadlock
situation with multiple contending multicast packets.
The entire packet cannot be forwarded if any of branches is blocked and output VCs will
not be released until the entire packet has been forwarded. So, deadlock is caused when
any of branches is blocked due to either of buffer empty stall or credit stall, while the
entire packet is not stored in the buffers. Since dynamic packet fragmentation cuts off the
blocked branches, releases hold of output VCs and give other blocked packets a chance to
use the released VCs, packet fragmentation resolves deadlock in tree-based multicast
routing environment.
77
Figure 4-7 shows a situation where two multicast packets are requesting same channels
[1,0] and [2,3] but not granted both channels at the same time [47]. Packet A is granted
channel [1,0] and [1,2] but not [2,3]. Packet B is granted channel [2,3] and [2,1] but not
channel [1,0]. Part of packet A and B is replicated and forwarded to each of the granted
output channels. Since the multiple VC state units maintain the separate states of the
forwarded packet to each output channel, flits can be replicated and forwarded to the
granted output channels. After the last available flit in the buffer is forwarded, the packets
cannot progress anymore because another branch which is packet A in router 2 and
packet B in router 1 is blocked. The blocking of the contending packets leads to a
deadlock situation with a resource dependency cycle. Conventional solutions include
deadlock avoidance by either using large enough buffers to hold entire packets or
increasing the number of VCs. However, both solutions are not ideal for an area-
constrained on-chip router [15], [12].
Figure 4-7 Deadlock
78
The proposed router avoids deadlock by fragmenting packets and releasing held output
VCs at either of blocking scenarios. Figure 4-8 shows how deadlock is broken down after
the fragmentation. Following the fragmentation process at the blocking situations, the
branches of the packet A and B at router 1 and 2 are pruned and freed to be forwarded to
their destination nodes. According to asynchronous replication which is assumed in this
section, the packet A and B toward to router 2 and 1 are fragmented by credit stall
because the packets blocked in router 2 and 1 cannot return credits to router 1 and 2. The
same packets toward to router 0 and 3 are fragmented by buffer empty stall because no
more flits are left to be replicated and forwarded in the input buffer in router 1 and 2. The
output channels [1,0] and [2,3], then, are released and the fragmented packets on router 0
and 3 are delivered to destination nodes as separate packets. Now, the blocked packets in
dashed lines on router 1 and 2 can be granted to the requested output channels, resolving
the deadlock situation.
79
Figure 4-8 Deadlock avoidance in asynchronous replication
The same deadlock avoidance mechanism can be applied in synchronous replication as
shown in the Figure 4-9. In synchronous replication, the branches of a multi-destination
packet can progress only when all requested output channels are granted. Since flits can
only be forwarded when there are available buffers in all granted output channels, the
packet A and B experiences blocking due to credit stall. Although credits from router 0
and 3 are returned, the branches of the packet A and B at router 2 and 1 cannot return the
credits because they are blocked. As credits across all granted outputs are ANDed, the
total credit count cannot be increased with any of the blocked branches. Therefore, both
branches of packet A and B at router 1 and 2 are fragmented by the credit stall. The
released output channels [1,0] and [2,3], then, can be granted to other blocked packets by
a virtual channel allocation policy of round robin and the deadlock is resolved.
80
Host
1
Host
2
B A
B A
Acquired channel
Requested channel
Input buffer
Output buffer
0 1 2 3
credit stall credit stall
Figure 4-9 Deadlock avoidance in synchronous replication
The proposed fragmentation scheme provides deadlock freedom in tree-based multicast
routing. Unlike previous solutions [70], [29], it does not limit the packet size and require
more buffering capacity. The fragmentation just makes resources free that may be
necessary by blocked branches of other multicast packets.
4.6 Evaluation
This section describes performance and energy analyses of the implemented designs. A
unicast baseline and the proposed tree-based multicast router with dynamic packet
fragmentation were developed in synthesizable VHDL code. The codes were synthesized
using Synopsys Design Compiler, and layouts were generated with Cadence SOC
Encounter targeting the Artisan standard cell library for IBM 90 nm technology. Table
4-4 summarizes the common router features and network parameters for synthesis and
circuit simulation.
81
Table 4-4 Design evaluation parameters
Topology Mesh 4x4 Routing Dimension-order (XY)
# of ports 5
Buffer per port 16 (4-entry depth per VC)
# of VCs 4
Flit size 128 bits Packet length 8-flit packet
The performance of the proposed router is evaluated through synthetic workloads
generated with uniform random traffic. The inserted network loads are a mix of unicast
and multicast packets. A uniform random traffic generator adjusts overall network loads
and injection sequences of multicast packets within the injected loads. For a multicast
packet, the number of target destinations varies from 4 to 12 with randomly selected
targets. Multicast packets are translated into multiple individual unicast packets in the
baseline router case; thus, the baseline router incurs a severe serialization penalty.
Figure 4-10(a) shows the average packet latency versus injected loads at a 10% multicast
rate. From prior literature [70], [35] assuming various multiprocessor environments, we
selected a 10 % multicast rate for all simulation, but maintaining coherence for large
numbers of nodes in future coherence protocols may produce a much larger multicast rate.
From the performance graph, the proposed multicast router outperforms the unicast
baseline in terms of latency and throughput. The latency is reduced by 46% near the
saturation point and 14% at zero-load. A 30% throughput improvement is also observed
from the multicast routing. With multicast routing, even with fragmentation increasing
the overhead with virtual head flits, the proposed router shows better performance
because the number of packets in the network is reduced. To get some sense of how often
82
fragmentation occurs, we logged the number of virtual head flits received at destination
nodes. We recorded an average of 1.8 virtual head flits per multicast packet received;
thus, each received multicast packet was fragmented an average of almost 2 times along
its traversal. In other experiments we observed more performance benefit at higher
multicast rates, as expected. The average packet latency is reduced by 53% at a 20%
multicast rate at the baseline saturation point. From the presented performance result, the
multicast router with dynamic packet fragmentation shows much lower latency as the
multicast rate increases, so the router is very suitable for applications with high multicast
rates.
Figure 4-10(b) shows a generated layout of the proposed router. The critical path is
measured as 3.2ns including wire delays. From a layout comparison between the baseline
and multicast router, the proposed design takes slightly more die area and dynamic power
due to more control logic to support a multicasting capability. However, from the
observed relative energy consumption in Figure 4-10(c), the proposed router starts to
require less energy than the baseline around a 33% injection load and consumes 9% less
energy at the unicast saturation point (35%). The energy crossover point would move to
the left if a workload with more than a 10% multicast rate is applied.
83
0
20
40
60
80
100
120
0 20 40 60
Avg. Latency (cycles)
Injected Load (% of capacity)
baseline
multicast
0
0.5
1
1.5
28% 30% 32% 34%
Injected Load (% of capacity)
30%
38.6%
Figure 4-10 (a) Performance (b) Multicast router layout (c) Relative energy consumption
4.7 Conclusions
This chapter presented an on-chip router with hardware support for multicast using
dynamic packet fragmentation. The scheme enables deadlock-free tree-based multicast
routing possible since it resolves cyclical dependencies in resource allocation through
packet fragmentation. The proposed router reduces latency by 38.6% and consumes 9%
less energy than a unicast baseline router at the baseline saturation point. Although we
introduced dynamic packet fragmentation for resolving deadlock situations in multicast
scenarios, in future work we will explore how it can be used to improve performance
even in unicast operations.
84
Chapter 5 Fault-Tolerant Flow Control in On-Chip
Networks
5.1 Overview
Scaling of interconnects exacerbates the already challenging reliability of on-chip
networks. Although many researchers have provided various fault handling techniques in
chip multi-processors (CMPs), the fault-tolerance of the interconnection network is yet to
adequately evolve. As an end-to-end recovery approach delays fault detection and
complicates recovery to a consistent global state in such a system, a link-level
retransmission is endorsed for recovery, making a higher-level protocol simple. In this
chapter, we introduce a fault-tolerant flow control scheme for soft error handling in on-
chip networks. The fault-tolerant flow control recovers errors at a link-level by requesting
retransmission and ensures an error-free transmission on a flit-basis with incorporation of
dynamic packet fragmentation. Dynamic packet fragmentation is adopted as a part of
fault-tolerant flow control to disengage flits from the fault-containment and recover the
faulty flit transmission. Thus, the proposed router provides a high level of dependability
at the link-level for both datapath and control planes. In simulation with injected faults,
the proposed router is observed to perform well, gracefully degrading while exhibiting
97% error coverage in datapath elements. The proposed router has been implemented
using a TSMC 45nm standard cell library. As compared to a router which employs triple
85
modular redundancy (TMR) in datapath elements, the proposed router takes 58% less
area and consumes 40% less energy per packet on average.
5.2 Introduction
As process technology scales, the integration of billions of transistors comes with an
increased likelihood of failures. Smaller dimension circuits are getting more sensitive to a
particle strike, increasing the probability that the charge due to a high-energy particle
flips the value in a cell [95]. With technology trends in device scaling, high clock
frequencies, and supply voltage decreases, fault rates are increasing, which makes a
reliable design a real challenge. Previous literature [5], [98] has suggested various fault
handling mechanisms in complicated processor cores, but fault tolerance of the
underlying interconnection networks is yet to adequately evolve. Especially, soft error
handling in on-chip networks has not been significantly addressed, with prior work
assuming that an end-to-end recovery approach works well.
Previous chip multi-processor (CMP) research [98], [84] has assumed that the CMP
network is unreliable. As the packet can be misrouted or corrupted during transmission,
end-to-end recovery is enforced by examining an error detection code at the destination.
However, end-to-end recovery involves additional recovery messages and delays fault
detection, complicating recovery to a consistent global state in a CMP. For example, if a
packet is lost, the system triggers recovery after a given amount of time to avoid deadlock.
86
This increases the fault detection latency of the packet, while requiring all nodes to
coordinate their validations for checkpoints.
Unlike end-to-end recovery, an approach of link-level recovery between routers has many
benefits as presented in [24], [26], [83]. Link-level retransmission does not require large
retransmission buffers to account for timeout latency. Note that the timeout latency can
be defined as the round-trip time between source and destination plus some slack for
contention delay [98]; the required storage overhead is not trivial. Moreover, as link-level
retransmission eventually guarantees end-to-end delivery, no explicit acknowledgment is
necessary, enabling a simple higher-level protocol and reducing the network traffic.
This chapter addresses a fault-tolerant flow control scheme for soft error handling in on-
chip interconnection networks. The importance of this study is that it presents and
evaluates a method for protecting packets at a link-level for both datapath and control
planes with little hardware overhead, rather than relying on end-to-end recovery. The
proposed fault-tolerant scheme ensures an error-free transmission on a flit-basis, while
using dynamic packet fragmentation at error detection. Thus, the proposed router
disengages flits from the faulty flit and safeguards the flits following a faulty flit.
Dynamic packet fragmentation in networks-on-chip (NoCs) was first introduced in [44],
[47] for enhanced virtual channel (VC) utilization and deadlock avoidance in multicast
routing. The proposed router exploits fragmentation as a response to error detection and
87
renews state information by reallocating a new VC. Upon fragmentation, trailing flits can
then avoid a faulty VC which has corrupted states.
The proposed router is evaluated in erroneous environments by fault injection
mechanisms. Statistical fault injection (SFI) [76] is applied at each bit of a network link
with independent probability. At the various fault rate levels, the proposed router is
observed to perform well, gracefully degrading while it shows resilience to link errors
without missing any flits. In the intra-router error analysis, the router is observed to have
97% error coverage for datapath elements.
The proposed router and comparable designs are implemented using a TSMC 45nm
standard cell library. As compared to a router [20] using triple modular redundancy
(TMR) for datapath elements, the proposed router occupies 58% less area and consumes
40% less energy per packet on average.
This chapter is organized as follows. Section 5.3 describes soft error handling in the
proposed router, and Section 5.4 evaluates the scheme in various simulation
environments. The related work is discussed in Section 5.5, and Section 5.6 concludes the
chapter.
5.3 Soft error handling
In on-chip interconnection networks, soft errors could be categorized into two types
based on the location and probability that errors occur: link errors and intra-router errors
88
[83]. Links are purely wires and affected by cross-talk, whereas the individual router
components are logic circuits and errors occur in the form of transient faults, which may
also cause erroneous states. The transient faults may not affect the results of a
computation since masking effects (logical masking, electrical masking, and latching-
window masking [95]) reduce the manifested error probability. Although the errors in
links and intra-routers are distinct in terms of the probability of the occurrence, they can
be tackled together in the proposed scheme, as described in the following discussion.
Intra-router errors can further be divided into errors in datapath components and control
planes. As datapath components handle storage and movement of flits in input pipelines,
MUXes, buffers, a crossbar, and output pipelines, any errors in datapath components are
manifested at the output; thus, they can be handled together with link errors. If an error
occurs in any of the datapath elements, the erroneous flit is detected by the error detection
code at the next hop, just as if a link error has occurred. So, the location of an error
occurrence in the datapath is not important (except control parts of the datapath elements
such as selection bits of a crossbar). The significance of errors in the datapath is that
errors are reflected to the output. Therefore, we divide soft errors in the network as errors
in datapath components and control plane. In the rest of this section, we focus on how the
errors in the datapath and links are detected.
As packet flows are controlled by handshaking between routers in on-chip networks,
faulty flits can be easily retransmitted at a link-level. The proposed router implements
retransmission of faulty flits at a link-level and reconstructs a discontinued packet due to
89
a fault while preserving the flit order. Based on the fragmentation scheme defined in [44],
we develop a new fault tolerant router, exploiting dynamic packet fragmentation for
handling a discontinued packet due to a detected error. Fragmentation renews the state
information in control planes through a VC reallocation, preventing corrupted states from
affecting the rest of the flits.
5.3.1 Proposed router architecture
This section describes architectural details of the proposed router. Figure 5-1 features the
proposed router pipeline stages and microarchitecture with added overhead for fault
handling in gray. A combinational Cyclic Redundancy Check (CRC) unit using an 8-bit
CRC polynomial is attached at the input pipeline (for fault detection) and output of the
input unit (for new CRC generation). Although a CRC is used in the proposed router for
error checking, Error Correcting Codes (ECC) [6] can be an alternative and also
applicable for correction as well as detection. There are trade-offs in design choices
between correction and retransmission, but the comparison of the two schemes is beyond
the scope of this chapter.
90
Output
unit
Input unit
C
R
C
Crossbar
VC ctrl
VC
C
R
C
head
pipeline
pipeline
VC
ctrl
Router pipeline stages
ST
preRC
SAVA
Figure 5-1 Proposed router microarchitecture and pipeline stages (preRC: pre routing
computation, SAVA: switch & VC allocation, ST: switch traversal)
The CRC unit at the input port checks errors when a flit is received at the input pipeline,
and at the same time the flit is stored in the corresponding VC buffer. So the CRC is
performed in parallel with flit processing, and it does not impact the critical path. From
timing analysis of the synthesized design, the critical path is measured to be the VC and
switch (SW) allocation stage (SAVA) in a two-stage router pipeline. Thus, the CRC unit
does not affect the clock period because the CRC update is performed in the SW traverse
(ST) stage. Note that the flits are updated for VC ID and route field at the input unit. The
new CRC for the modified flits can be attached just before the flits leave the input unit.
The modified flit with new CRC is transferred intact throughout the router, traversing the
MUX, crossbar and output pipeline. Therefore, the data-path from the output of the input
unit to the input pipeline at the next router is covered with the flit-level CRC.
91
In order to provide full coverage of the data-path, we assume the input buffer is protected
with ECC since input buffers are primarily memory elements comprised of either SRAM
or dense latches. As a result, the flits are protected while they are transferred in the router
along the solid lines as shown in Figure 5-1.
In addition to the CRC, a header buffer is included at every VC for re-establishing a
fragmented packet. Upon error detection, the proposed router fragments the packet to
contain the error and renews the VC states. After fragmentation, body and tail flits do not
have routing information to reconstruct the trailing part of the packet. Thus a copy of the
most recent header flit should be maintained at every router which the packet passes
through until the tail flit exits the buffer. Using the buffered header information, the
fragmented packet requests a new VC and SW allocation [44]. If both requests are
granted, packet forwarding starts by sending the buffered header (virtual header) and the
rest of the flits can follow the virtual header. As the header is used for reallocating a VC,
the critical information of the header must remain error-free. Therefore, we assume the
header buffer is also protected with ECC like the input buffers.
5.3.2 Fault-tolerant flow control
Now that a preliminary router infrastructure has been established for fault detection and
packet reconstruction, a new flow control scheme for fault handling is presented. A
conventional VC router, which adopts credit-based flow control, returns a credit when a
flit is forwarded to the next router and the corresponding buffer entry is free. The
92
upstream router which receives the credit then sends a next flit. This enhances buffer
recycling by signaling to the upstream router as soon as the buffer entry is empty.
However, if an error is detected at the downstream router and the buffer entry which held
the faulty flit in the current router is replaced by the next flit, the faulty flit cannot be
recovered unless retransmission is requested from the source node. So a new flow control
scheme, which keeps the sent flit until the safety of the flit is ensured at the downstream
router, is necessary.
Figure 5-2 illustrates cycles for credit return to the upstream router when the router frees
the associated buffer. In a conventional router (Figure 5-2(a)), the credit is sent at the
same time the flit is forwarded to the next router as described above. On the other hand,
the proposed router (Figure 5-2(b)) waits a turnaround time (5 cycles in the proposed
router, but turnaround time can differ based on the router pipeline stages) until the sent
flit is implicitly ACKed from the downstream router. The downstream router checks
errors as soon as a flit is received and sends a NACK signal in the case that an error is
detected. If no error is detected, the CRC unit just waits until the next flit is received. In
the current router, however, if the NACK signal is not received within the turnaround
time (5 cycles), the current router knows the sent flit is safely delivered to the next router
and can return the credit at the next cycle (cycle 6). If a NACK signal is received, the
router starts the recovery process from the faulty flit using the proposed fragmentation
technique.
93
flit
credit
pipe
line
(a) Credit-based flow control in conventional VC router
flit
credit
C
R
C
1 2
4
3
5
6
nack
(b) Fault-tolerant flow control in proposed router
Figure 5-2 Flow control schemes
The presented fault-tolerant flow control ensures error-free flit delivery, but it limits
throughput by holding the flits longer than the conventional case. At a low traffic load,
however, the proposed router does not impact performance for an error-free scenario with
short packets. If the number of buffer entries covers the credit round trip time, packets are
transferred without delay because pipeline stages are not increased for flit forwarding and
the CRC is performed in parallel with flit processing.
The flow control described above is for the input buffers. As the header buffer is
maintained separately and cannot be occupied by regular flits, the flow control of the
header buffer should be controlled independently from the regular input buffers. For
example, if a header flit is forwarded to the next router, a corresponding credit of the
header flit is returned after the implicit ACK. The upstream router then may send a next
94
body flit assuming there is an available buffer entry at the downstream router, but the
input buffers at the downstream router may be full since the forwarded flit was the header
flit. Thus, a flit buffer might be overwritten due to overflow if not handled properly.
Overflow of the flit buffers can be prevented if an entire credit count is communicated
between routers, but such a scheme incurs wiring overhead. Dedicating an extra credit
line for a header flit in each VC is a better solution. By returning a header credit upon
header flit forwarding, the upstream router can discriminate which type of flit is
forwarded and determine whether it can send a next body flit at the appropriate time.
However, in the case of virtual header forwarding, the header credit should not be
returned since a virtual header is not a flit received from the upstream router, but created
internally in a router. The NACK signal is not necessarily dedicated for the header flit
because the upstream router can differentiate which flit is diagnosed as faulty among the
sent flits by counting the turnaround time.
Another technique which prevents overflowing of the flit buffers is using a state machine
for the credit counts. When the upstream router receives the first credit after the header
flit forwarding, the upstream router knows that the header is safely forwarded at the
downstream router and waits until the next credit is received to send a next body flit.
Since using the state machine does not cost extra wires between routers, we adopt this
technique for the flow control of the header flit.
95
Containment
Having established a mechanism for initiating flow control which detects errors at the
downstream router, containment and recovery is now considered for the fault-tolerant
scheme. Dynamic packet fragmentation is adopted as a part of the fault-tolerant flow
control to disengage flits from the fault-containment VC and recover the faulty flit
transmission. If a VC is diagnosed as faulty during a packet forwarding, the
fragmentation technique severs the packet and steers clear of the erroneous VC by
requesting a new VC. So the rest of the following flits can avoid the corrupted states of
the VC and be transferred safely to the destination.
Errors in the control plane, such as states of the VC, can be detected via consistency
checks: protocol consistency (checking the sequence of the head, body, and tail flits),
credit consistency (checking the credits between the receiving end of the channel and the
sending end of the channel), and state consistency (checking the inputs received are
appropriate for the state) [26]. Sometimes the errors in the control plane are not
observable in the flits themselves and cannot be corrected unless the corrupted VC is
restarted and synchronized with an adjacent module. By signaling a fault detection upon a
consistency mismatch, dynamic packet fragmentation can release the hold of an output
VC and renew state information by reallocating a VC. So the fault-containment VC can
be avoided, and spreading of state corruption to adjacent modules can be prevented.
96
Without the fragmentation and reallocation capability, errors in the control planes could
be recovered by using triple modular redundancy (TMR) or checkpointing every state in
the VC. Note that TMR comes with a costly area overhead and checkpointing every state
requires complicated control logic. The proposed fault-tolerant scheme with dynamic
packet fragmentation efficiently handles erroneous states with little hardware overhead.
Moreover, dynamic packet fragmentation can be combined with various fault detection
schemes suggested in [54], [83], [37], [56] to build more reliable routers.
Recovery
The recovery process using fragmentation is as follows. Once an error is detected at the
downstream router, the faulty flit is used as a virtual tail flit, and the in-flight flits which
are already transmitted following the faulty flit are squashed at the downstream router. In
the upstream router, a VC is reallocated for retransmission, and the packet forwarding
starts from the virtual header using the buffered header information. The faulty flit is
retransmitted as a first body flit and the rest of the body and tail flits follow the
reallocated VC.
In the case of a faulty header, however, there is an exception. The body and tail flits have
no problem as they are used as virtual tail flits at the error detection, but the header
cannot be used as a virtual tail since it is a first flit of the packet and it should be dropped.
Therefore, the faulty header is not transferred to the output unit, and its allocated VC is
released immediately at the next cycle.
97
A fragmented packet can be refragmented if another error occurs for the same packet.
Note that the virtual header and virtual tail flits are created during error recovery;
intermediate routers treat them identically as normal header and tail flits, respectively. By
receiving the fragmented packets at the destination node, the destination node knows that
the packet encountered an error during routing, and a reassembly process can start. The
detailed operation of reassembly is not addressed in this chapter, and we leave it as a
future work.
End-to-end recovery causes the possibility of latent errors which are undetected during
transmission. The undetected errors may offset errors, making error detection at the end
node difficult. However, link-level retransmission in the proposed router can detect errors
quickly before another occurs, greatly increasing the error containment capability.
Therefore, the proposed router not only provides dependability in payloads of the packet,
but safeguards critical information of the header.
5.4 Evaluation
This section describes fault-tolerance analyses through a performance, area and power
evaluation. A baseline and the proposed fault-tolerant router were developed in structural
VHDL code and synthesized using Synopsys Design Compiler targeting a TSMC 45nm
standard cell library. Table 5-1 summarizes the common router features and
network parameters for synthesis and simulation.
98
Table 5-1 Design evaluation parameters
Topology Mesh 4x4 Routing Dimension-order (XY)
# of ports 5 Router uArch Two-stage
# of VCs 4 Link latency 1 cycle
Flit size 128 bits Packet length 8-flit packet
Buffer per port 16 (4-entry depth per VC)
Operating
conditions
0.9V, 25℃
5.4.1 Performance
Figure 5-3 shows the average latency for 5- & 8-flit packet simulations under uniform
random traffic in an error-free scenario. Baseline indicates a generic two-stage router
without a fault-tolerant scheme, whereas FTFC models the proposed router which implies
fault-tolerant flow-control scheme using dynamic packet fragmentation. The numbers in
parentheses of each scheme indicate a number of buffer entries per VC. As Baseline has
6-entry buffers per VC, the FTFC router is assumed to have 5-flit entries for the regular
buffers in addition to an extra header buffer for fairness of the comparison.
99
Figure 5-3 Performance
In the 5-flit packet simulation, the FTFC router performs well in an error-free scenario
without performance degradation at low traffic loads. The less flexible buffer utilization
in the FTFC by dedicating a single entry as a header copy buffer did not affect
performance. At the saturation point, however, the FTFC router suffers from an increased
latency by 11.4%. This is because the fault-tolerant flow control delays buffer recycling
and increases the congestion at high traffic loads.
In the 8-flit packet simulation, the FTFC router underperforms in terms of latency and
throughput by 16% and 8%, respectively. The given buffer entries do not cover the
increased credit round trip time or cannot buffer the entire packet; therefore, credit stalls
dominate the latency. If the number of flit buffers is over the credit round trip time or a
unified buffer structure is used for flexible buffer utilization, the FTFC router would
100
appear to have close latency to the Baseline at low traffic loads as shown in the 5-flit
packet simulation case. For the rest of the evaluation, we assume 5-flit packets to rule out
the effect of performance degradation due to credit stalls.
Some amount of throughput is indeed sacrificed to achieve the link-level retransmission
capability. How this cost compares to an end-to-end retransmission scheme and a
comparison of error coverage between the two approaches is left as future work.
5.4.2 Link error analysis
This section describes a link error analysis with statistical fault injection into the RTL
model. The analysis provides an expectation of the performance impact in erroneous
environments and verifies the behavior of the proposed router when link errors are
present. Figure 5-4(a) illustrates performance levels of the proposed router versus
different error rates. Errors are injected at each network link bit with statistically
independent probability. Although the crosstalk effect cannot be assumed to be accurately
modeled by statistical independence of errors on adjacent wires [108], the independent
error model simplifies the error modeling and is appropriate to reflect different types of
errors.
101
(a) Performance in various error rates
(b) Performance according to error rates for specific traffic injection rates
(c) 3D space of performance based on traffic injection and error rates
Figure 5-4 Performance in link errors
102
No error describes the case without errors in the FTFC router, whereas error rate
indicates soft errors get inserted in the same router with corresponding random
distribution. We modeled error insertions ranging from 1.0E-06 (errors/wire/cycle) to
1.0E-03 (errors/wire/cycle), which is an extreme case where transient errors occur at high
probability to significantly stress the network. Since there are around 10,000 links in the
network, the range of the error rate from 1.0E-06 to 1.0E-03 approximates 1 error in
every 100 cycles to 10 errors in every cycle on average in the network. Simulation is
performed under the above conditions until every inserted packet is delivered to its
destination. Among the network links in which errors are inserted, we assume that critical
control signals of the links (credit and NACK lines) and critical fields of the flit (e.g.
valid and VC ID of each flit) are protected in redundancy such as TMR. In such a
condition, routers are assumed to never miss returned credits, NACK, and valid flits.
From the Figure 5-4(a), the proposed router gracefully degrades in performance without
missing any packet in erroneous environments. At low error rates, the proposed scheme
shows almost the same performance with negligible difference in latency to the error-free
scenario, while it suffers more in latency and throughput at high error rates. Especially at
the 1.0E-03 error rate, almost every flit encounters at least one error while it traverses the
network. So the network saturates at a low injection rate of 0.25.
103
Figure 5-4(b) demonstrates the average packet latency according to the error rates for
specific traffic injection rates. As error rates are increased, the average packet latency
gradually increases until around the 1.0E-04 error rate and starts to saturate above that. A
higher error rate increases the probability that more flits are affected, causing more
packets to be fragmented. This increases the network congestion since the fragmented
packets generate virtual header flits. Figure 5-4(c) shows a three-dimensional space of
average latency.
Figure 5-5 shows the statistics of the fragmented packets, using a logarithmic scale on the
Y-axis. The number of fragmented packets increases with increasing error rate and is
independent of the traffic rate. Since the number of fragmented packets corresponds to
the number of packets generated, the network saturates early as more packets are
fragmented due to errors.
Figure 5-5 Fragmented packets
104
In this link error analysis, although we quantified the performance impact of erroneous
scenarios, a key qualitative outcome of these experiments was that the proposed router
provided full reliability in the presence of transient link errors with little hardware
overhead.
5.4.3 Intra-router error analysis
This section analyzes error coverage of the intra-router components and quantifies the
resilience of the overall design. In order to model all router states and logic accurately,
we generated a netlist of the proposed router. The netlist enables the investigation of error
propagation and provides guidance for the detectable errors to the output. So the error
coverage is performed by checking that erroneous net candidates are reflected to the
output. Note that errors in a flit are detected at the downstream router and recovered
during retransmission. A soft error reflected to the output flit implies a recoverable error
in the proposed router.
Among nets which connect the synthesized cells in the netlist, we listed every possible
net in which an error can occur. The net list is then analyzed to determine whether each
error on a net is reflected to the flit output of the router. From the list of the nets, the
portions in datapath elements are extracted in Table 5-2. Although the CRC units in the
proposed router are extra logic added for error detection and not present in the Baseline
router, the errors in the CRC fields are also correctable using retransmission. Hence, we
105
added the CRC units in the list of correctable datapath elements. As a result, 78.29% net
errors in the router can occur in the datapath elements and considered correctable with the
proposed scheme.
Table 5-2 Net percentages of datapath components
Input pipeline CRC Input MUXes Crossbar
Nets 2.93% 9.12% 8.5% 10.85%
Output Pipeline Flit buffers Datapath total Router
Nets 2.84% 44.05% 78.29% 47470
Before we measure the error coverage of the intra-router logic, we evaluated the test
vectors in terms of the toggle coverage. Table 5-3 shows the toggle coverage to both
Baseline and FTFC router netlists. From the table, over 95% of the netlists are toggled in
both routers. This indicates that the input vectors reach almost every area of the router
during on-line operation. The FTFC router, however, gets slightly less coverage than the
Baseline router. This is because the test vectors do not in themselves generate erroneous
scenarios, so the error handling parts of the proposed router are not triggered. Since the
toggle coverage implies the effectiveness of the test vectors, a 95% toggle coverage is a
good basis for drawing conclusions about the following error evaluation.
Table 5-3 Toggle coverages
Baseline FTFC
Toggle coverage 97.5% 95.4%
106
To quantify the error coverage of the proposed fault-tolerant scheme, we performed the
following procedures using Synopsys TetraMAX. At the beginning of the simulation,
every net is marked as an X state. As the simulation progresses, the X marked errors are
moved to either a detected or undetected state, depending on whether it is observed at the
output. So, a detected state is considered when the input vector reaches the error and the
value is propagated to the flit output for observation, whereas an undetected state implies
the error is not reflected to the outputs.
Figure 5-6 shows the error coverage profiling of each datapath element in the entire
router. Assuming the flit buffers are protected with ECC, 97% of the errors in the nets are
observed to be reflected in the given datapath elements. 3% of undetected errors on
average may be regarded as masked or latent under the given test vectors, but cannot be
confirmed as correctable errors. Upon analysis of the undetected errors, we concluded
that they were mostly parts connected in control units or not reached with the given test
vectors. Even if we exclude the ECC protected flit buffers from the experimented
datapath elements, 94% of datapath errors are observed to be covered.
Combining both results of link and intra-router error analysis, the proposed router
provides a good coverage of errors with little hardware overhead as analyzed in the
following.
107
Figure 5-6 Error coverage in datapath components
5.4.4 Place & route
This section provides accurate timing, area, power, and energy analyses based on the
implemented designs. In addition to the two designs, Baseline and FTFC routers, we
implemented a comparable alternative fault-tolerant scheme, a bit-level TMR router [20].
TMR is just applied to the datapath elements in the Baseline router to cover the same
functionality as the proposed router. Therefore, the area overhead of the TMR router is
three times of input pipeline, flit buffers, input MUXes, crossbar, and output pipelines of
the Baseline router plus voting MUXes. The voting MUXes are placed right after the
input pipeline, which is the same place the proposed router performs CRC. So the errors
of the intra-router and links are detected at the input port and corrected with forward
progress in the TMR router.
108
The Baseline, FTFC, and TMR routers were then synthesized and placed & routed from
standard cell netlists. To minimize dynamic power, clock gating is applied to the flit
buffers in the three routers. A general back-end flow was followed using Cadence SOC
Encounter for the generated layouts. Figure 5-7 shows proportionally sized layout
pictures of the three implemented routers.
Figure 5-7 Router layouts
The FTFC router is measured to have the same clock period as the Baseline router (0.7 ns)
while the TMR router takes more time (0.8 ns) in critical path delay due to the voting
MUXes. With Forward error recovery (FER), the TMR router makes forward progress
without performance degradation in packet latency, but it negatively affects critical path
delay and increases area overhead significantly.
109
The extra logic in the proposed router, however, takes only 13.7% more area overhead
than the Baseline router, but it does not affect the critical path. Rather, the area overhead
which is mostly dedicated to control logic and CRC units is much less than that of the
TMR router. In this area overhead computation, the link areas which require two times
more wires in the TMR router are not even counted. Moreover, if error resiliency is
considered, the FTFC router is even more superior to the TMR router. Since the 8-bit
CRC checksum detects packets with any odd number of bits in error and an error burst up
to checksum width [91], [56], the FTFC router provides a high level of resiliency
compared to the TMR. The TMR router can be resilient up to 1/3 errors in the links and
datapath components depending on the granularity of the TMR, but the increase in area
raises the probability that the error can occur. On the other hand, the proposed router with
8-bit CRC requires a negligible hardware overhead compared to TMR, providing the
probability to miss any random error less than 0.004 [56].
Combining the error coverage of the datapath elements with the area analysis, the error
coverage of the overall router area can be generated. Assuming that the error coverage of
the nets can be applied to the area, 97% error coverage of the datapath implies 75.9%
error coverage of the overall router area. Note that according to the techniques analyzed
in [20], the fault-tolerant router designs with various levels of protection typically result
in 2-3x more area overhead. Even though the fault-tolerant capability of the proposed
router as designed is limited to the links and datapath elements, the level of overall
protection achieved is still remarkable, especially given that it is accomplished with little
hardware overhead.
110
Figure 5-8 depicts a network power analysis of the presented routers. While the same
workload used in the performance evaluation is applied to the routers, a value change
dump (VCD) file which captures every switching activity is generated from routers in the
network. The VCD files are then applied to Synopsys PrimeTime PX, which analyzes
power dissipation and supports propagation of switching activity for accurate power
analysis. The acquired power values of the routers are aggregated together for the
network power. As can be seen in the figure, the FTFC router takes 35% more network
power than the Baseline router, but it consumes 41% less power than the TMR router on
average. This is mainly a consequence of less area overhead in the FTFC router versus
the TMR router. The injected error rate has little effect on the power consumption of the
FTFC router.
Figure 5-8 Power consumption
111
Figure 5-9 compares the energy efficiency of the three routers throughout the traffic
loads. The energy value indicates energy consumption per packet at each traffic load. The
proposed router shows 40% lower energy compared to the TMR router. Although the
TMR router provides FER which is advantageous to latency, it suffers from high energy
dissipation. From the observed analyses presented in this section, the proposed router is
outstanding compared to the TMR router and provides remarkable dependability with less
hardware overhead.
Figure 5-9 Energy per packet
5.5 Related work
This section reviews prior schemes of handling errors in NoC routers. The MIT reliable
router [24] implements a link-level fault-tolerant protocol, the Unique Token Protocol.
This flow control keeps at least two copies of a packet in the network at all times. If the
112
network is broken between two copies of the packet, each copy follows multiple paths to
the destination, and the token is changed to replica for all copies of the packet. The
destination node then knows a packet is duplicated by receiving a packet with a replica
token and deletes duplicates.
The reliable router is somewhat close to the proposed router in that it reconstructs a
fragmented packet, but the flow control scheme is different because the proposed router
maintains a single copy of the packet in the network. If the packet is fragmented, the
severed packets, but originally a single packet, are delivered to the destination as
separate. On the other hand, in the reliable router, if the packet is split, it resends the
packet from the header, not from the flit that the error is detected. Since the reliable
router keeps two complete copies of the packet in the network, packet level recovery is
maintained by sacrificing network throughput more than the proposed scheme.
Park, et al., suggested a flit-based hop-by-hop (HBH) retransmission upon link or intra-
router errors [83]. They assumed additional retransmission buffers per VC to perform the
link-level retransmission. Although they utilize the retransmission buffers for more
purposes than just providing link protection, such as deadlock recovery, the
retransmission buffers are still overhead, and the control logic is complicated.
5.6 Conclusions
This chapter presented a fault-tolerant NoC router, recovering faulty flits through a link-
level retransmission. We demonstrated how a faulty flit is fragmented and retransmitted
113
in a fault-tolerant flow-control scheme, and evaluated the design in various workload
environments. The result is that the proposed router performs well, gracefully degrading
in erroneous environments, providing a remarkable level of reliability with less hardware
overhead than an alternative TMR design. The datapath elements achieve 97% error
coverage with a 13.7% area overhead as compared to a Baseline router. As compared to a
bit-level TMR router, the proposed router occupies less than half of the area and
consumes 40% less energy per packet on average.
114
Chapter 6 Conclusions
Dynamic packet fragmentation has been explored for various uses including fault
handling and increasing VC utilization for improved performance in on-chip routers. This
fragmentation technique has been applied to multiple NoC research domains
demonstrated as an enabler for viable solutions for challenging issues in on-chip
interconnection networks with minimum required hardware overheads. Using this
technique, a packet is fragmented when certain blocking scenarios are encountered, and
the VC allocated to the blocked packet is then released for use by other packets. The
resulting efficient VC utilization provides more flexible flow control, preventing a
blocked VC from propagating congestion to adjacent routers. A dynamic packet
fragmentation router has been presented and evaluated in terms of performance, power,
and energy. The fragmentation router increases performance in terms of latency and
throughput up to 30% and 75%, respectively.
In tree-based multicast routing, fragmentation has been verified in solving the deadlock
problem without requiring more buffering capacity. The scheme enables deadlock-free
tree-based multicast routing since it resolves cyclical dependencies in resource allocation
through packet fragmentation. Fragmentation frees resources that may be required by
blocked branches of other multicast packets. The proposed router reduces latency by 38.6%
and consumes 9% less energy than a unicast baseline router at the baseline saturation
point.
115
Using packet fragmentation in fault-tolerant flow control helps to recover faulty flits
through a link-level retransmission. The proposed fault-tolerant scheme ensures an error-
free transmission on a flit-basis, while using dynamic packet fragmentation at error
detection. Fragmentation renews the state information in control planes through a VC
reallocation, preventing corrupted states from affecting the rest of the flits. Thus, the
proposed router disengages flits from the faulty flit and safeguards the flits following a
faulty flit. We demonstrated how a faulty flit is fragmented and retransmitted in a fault-
tolerant flow-control scheme, and evaluated the design in various workload environments.
The result is that the proposed router performs well, gracefully degrading in erroneous
environments, providing a remarkable level of reliability with less hardware overhead
than an alternative TMR design. The datapath elements achieve 97% error coverage with
a 13.7% area overhead as compared to a baseline router. As compared to a bit-level TMR
router, the proposed router occupies less than half of the area and consumes 40% less
energy per packet on average.
Based on the presented architectural model, the fragmentation router has been
implemented in synthesizable VHDL code and it has been simulated according to various
workloads. Our experimental results demonstrate that dynamic fragmentation improves
performance, efficiently utilizing VCs and saves energy as well. In erroneous
environments, the fragmentation router provides a remarkable level of reliability. The
result of packet latency reduction and increased throughput justifies the fragmentation
router as a suitable choice for future NoC design.
116
This Ph.D. dissertation proposed dynamic packet fragmentation, which handles multiple
issues pertaining to efficient VC utilization, deadlock free multicast routing, and fault-
tolerant flow control. The benefits of the fragmentation technique have been thoroughly
evaluated and analyzed based on the results of the various workloads. Throughout each of
the chapters, we have shown how the packet fragmentation can be exploited in blocking
situations and for fault detection, and observed that fragmentation resolves issues by
releasing held VC buffers. As there are many other research areas enabled by the concept
of dynamic packet fragmentation, we leave other issues, such as dynamic packet
reassembling, adaptive routing of fragmented packets, and more intelligence-based
fragmentation as future work.
Dynamic packet reassembling makes fragmented flows roll back to a single flow as if no
fragmentation ever occurred. Note that reassembling also adapts to the flow of network
traffic like fragmentation does; fragmentation with the potential for dynamic
reassembling may be able to better adjust network traffic than fragmentation alone can do.
In adaptive routing, after a packet is split with the proper fragmentation process, the
remaining part of the fragmented packet in the current router does not necessarily need to
follow the same path of a leading part of the fragmented packet. The remaining part of
the fragmented packet can avoid the faulty link or congested link by using adaptive
routing. Therefore, each of the fragmented packets can use multiple physical links for the
remainder of the paths toward the destination.
117
Lastly, further deeper evaluations of dynamic packet fragmentation need to be performed
in full system environments. The leading part of a fragmented packet can be delivered to
a destination with less latency compared to a network which does not support
fragmentation. If critical memory information requested by a processor is stored in the
leading part of a packet, the processor can start execution as soon as it receives the
critical information and does not need to wait until the trailing part of the packet is
delivered. Therefore, fragmentation can improve system performance. Such benefits of
dynamic packet fragmentation should be analyzed in full system environments.
118
References
[1] Dennis Abts, Natalie D. Enright-Jerger, John Kim, Dan Gibson, and Mikko H.
Lipasti, “Achieving Predictable Performance through Better Memory Controller
Placement in Many-Core CMPs”, In Proceedings of the 36
th
Annual International
Symposium on Computer Architecture, June 2009.
[2] Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha, “In-Network Snoop Ordering
(INSO): Snoopy Coherence on Unordered Interconnects”, In Proceedings of the
15
th
International Symposium on High-Performance Computer Architecture,
February 2009.
[3] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha, “GARNET: A
Detailed On-Chip Network Model inside a Full-System Simulator”, In
Proceedings of IEEE International Symposium on Performance Analysis of
System and Software, April 2009.
[4] Thomas William Ainsworth and Timothy Mark Pinkston, “On Characterizing
Performance of the Cell Broadband Engine Element Interconnect Bus”, In
Proceedings of the 1
st
International Symposium on Networks-on-Chip, May 2007.
[5] Todd M. Austin, “DIVA: A Reliable Substrate for Deep Submicron
Microarchitecture Design”, In Proceedings of the 32
nd
International Symposium
on Microarchitecture, November 1999.
[6] Michael A. Bajura, Younes Boulghassoul, Riaz Naseer, Sandeepan DasGupta,
Arthur F. Witulski, Jeff Sondeen, Scott D. Stansberry, Jeffrey Draper, Lloyd W.
Massengill, and John N. Damoulakis, “Models and Algorithmic Limits for an
ECC-Based Approach to Hardening Sub-100-nm SRAMs” IEEE Transactions on
Nuclear Science, Vol 54, p.935-945, August 2007.
[7] Ali Bakhoda, John Kim, and Tor M. Aamodt, “Throughput-Effective On-Chip
Networks for Manycore Accelerators”, In Proceedings of International
Symposium on Microarchitecture, December 2010.
[8] James Balfour and William J. Dally, “Design Tradeoffs for Tiled CMP On-Chip
Networks”, In Proceedings of the 20
th
Annual International Conference on
Supercomputing, June 2006.
[9] Arnab Banerjee, Robert Mullins, and Simon Moore, “A Power and Energy
Exploration of Network-on-Chip Architecture”, In Proceedings of the 1st
International Symposium on Networks-on-Chip, May 2007.
119
[10] Arnab Banerjee and Simon W. Moore, “Flow-Aware Allocation for On-Chip
Networks”, in Proceedings of the 3
rd
International Symposium on Networks-on-Chip,
May 2009.
[11] Luca Benini and Giovanni De Micheli, “Networks on Chips: A New SoC Paradigm”,
In IEEE Computer, January 2002.
[12] Tobias Bjerregaard and Shankar Mahadevan, “A Survey of Research and Practices
of Network-on-Chip”, ACM Computing Surveys, Vol 38, March 2006.
[13] Jason A. Blome, Shantanu Gupta, Shuguang Feng, Scott Mahlke, and Daryl Bradley,
“Cost-Efficient Soft Error Protection for Embedded Microprocessors”, In
Proceedings of International Conference on Compilers, Architecture, and Synthesis
for Embedded Systems, October 2006.
[14] Kevin Bolding, Melanie Fulgham, and Lawrence Snyder, “The Case for Chaotic
Adaptive Routing”, IEEE Transactions on Computers, Vol 46, p.1281-1292,
December 1997.
[15] Rajendra V. Boppana, Suresh Chalasani, and C. S. Raghavendra, “Resource
Deadlocks and Performance of Wormhole Multicast Routing Algorithms” IEEE
Transactions on Parallel and Distributed Systems, Vol 9, p.535-549, June 1998.
[16] Shekhar Borkar, “Thousand Core Chips: A Technology Perspective”, In Proceedings
of the 44
th
Annual Design Automation Conference, June 2007.
[17] Xuning Chen and Li-Shiuan Peh, “Leakage Power Modeling and Optimization in
Interconnection Networks”, In Proceedings of the 2003 International Symposium on
Low Power Electronics and Design, August 2003.
[18] Chi-Ming Chiang and Lionel M. Ni, “Multi-address Encoding for Multicast”, In
Proceedings of the 1
st
International Parallel Computer Routing and Communication
Workshop, May 1994.
[19] Sangyeun Cho and Lei Jin, “Managing Distributed, Shared L2 Caches through OS-
Level Page Allocation”, In Proceedings of the 39
th
Annual International Symposium
on Microarchitecture, December 2006.
[20] Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco,
Scott Mahlke, Todd Austin, and Michael Orshansky, “BulletProof: A Defect
Tolerant CMP Switch Architecture”, In Proceedings of the 12
th
International
Symposium on High-Performance Computer Architecture, February 2006.
[21] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, “Parallel Computer
Architecture”, Morgan Kaufmann Publishers, 1997.
120
[22] Donglai Dai and Dhabaleswar K. Panda, “Reducing Cache Invalidation Overheads in
Wormhole Routed DSMs using Multidestination Message Passing”, In Proceedings
of International Conference on Parallel Processing, August 1996.
[23] William. J. Dally, “Virtual-Channel Flow Control” IEEE Transactions on Parallel
and Distributed Systems, Vol 3, p.194-205, March 1992.
[24] William J. Dally, Larry R. Dennison, David Harris, Kinhong Kan, and Thucydides
Xanthopoulos, “The Reliable Router: A Reliable and High-Performance
Communication Substrate for Parallel Computers”, In Proceedings of the 1
st
International Parallel Computer Routing and Communication Workshop, May 1994.
[25] William J. Dally and Brian Towles, “Route Packets, Not Wires: On-Chip
Interconnection Networks”, In Proceedings of the 38
th
Annual Design Automation
Conference, June 2001.
[26] William J. Dally and Brian Towles, “Principles and Practices of Interconnection
Networks”, Morgan Kaufmann Publishers, 2003.
[27] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das, “Application-
Aware Prioritization Mechanisms for On-Chip Networks”, In Proceedings of
International Symposium on Microarchitecture, December 2009.
[28] Jose Duato, “Improving the Efficiency of Virtual Channels with Time-Dependent
Selection Functions”, In Proceedings of the 4
th
International Conference on Parallel
Architecture and Languages Europe, June 1992.
[29] Jose Duato, Sudhakar Yalamanchili, and Lionel Ni, “Interconnection Networks: An
Engineering Approach”, Morgan Kaufmann Publishers, 2003.
[30] Tudor Dumitras, Sam Kerner, and Radu Marculescu, “Towards On-Chip Fault-
Tolerant Communication”, In Proceedings of Asia and South Pacific Design
Automation Conference, January 2003.
[31] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik
Schauser, “Active messages: a mechanism for integrated communication and
computation”, In Proceedings of the 19
th
Annual International Symposium on
Computer Architecture, May 1992.
[32] Noel Eisley, Li-Shiuan Peh, and Li Shang, “In-Network Cache Coherence” In
Proceedings of the 39
th
International Symposium on Microarchitecture, December
2006.
121
[33] Noel Eisley, Li-Shiuan Peh, and Li Shang, “Leveraging On-Chip Networks for
Cache Migration in Chip Multiprocessors”, In Proceedings of the 17
th
International
Conference on Parallel Architectures and Compilation Techniques, October 2008.
[34] Natalie Enright-Jerger, Li-Shiuan Peh, and Mikko Lipasti, “Circuit-Switched
Coherence”, In Proceedings of the 2
nd
International Symposium on Networks-on-
Chip, April 2008.
[35] Natalie Enright-Jerger, Li-Shiuan Peh, and Mikko Lipasti, “Virtual Circuit Tree
Multicasting: A Case for On-Chip Hardware Multicast Support”, In Proceedings of
the 35
th
Annual International Symposium on Computer Architecture, June 2008.
[36] Natalie Enright-Jerger, Li-Shiuan Peh, and Mikko Lipasti, “Virtual Tree Coherence:
Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence”,
In Proceedings of the 41
st
International Symposium on Microarchitecture, November
2008.
[37] Arthur Pereira Frantz, Fernanda Lima Kastensmidt, Luigi Carro, and Erika Cota,
“Dependable Network-on-Chip Router Able to Simultaneously Tolerate Soft Errors
and Crosstalk”, In Proceedings of International Test Conference, October 2006.
[38] Mike Galles, “Spider: A High-Speed Network Interconnect”, IEEE Micro, Vol 17,
p.34-39, January 1997.
[39] Paul Gratz, Changkyu Kim, Robert McDonald, Stephen W. Keckler, and Doug
Burger, “Implementation and Evaluation of On-Chip Network Architectures”, In
Proceedings of International Conference on Computer Design, October 2006.
[40] Paul Gratz, Boris Grot, and Stephen W. Keckler, “Regional Congestion Awareness
for Load Balance in Networks-on-Chip”, In Proceedings of the 14
th
International
Symposium on High-Performance Computer Architecture, February 2008.
[41] Daniel Greenfield, Arnab Banerjee, Jeong-Gun Lee, and Simon Moore,
“Implications of Rent’s Rule for NoC Design and Its Fault-Tolerance”, In
Proceedings of the 1
st
International Symposium on Networks-on-Chip, May 2007.
[42] Yati Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, and Shekhar Borkar, “A
5-GHz Mesh Interconnect for a Teraflops Processor”, IEEE Micro, Vol 27, p.51-61,
September 2007.
[43] Hung-Chang Hsiao and Chung-Ta King, “An Application-Driven Study of Multicast
Communication for Write Invalidation”, The Journal of Supercomputing, Vol 18,
p.279-304, March 2001.
122
[44] Young Hoon Kang and Jeff Draper, “Design Trade-offs for Load/Store Buffers in
Embedded Processing Environment”, In Proceedings of the 50
th
International
Midwest Symposium on Circuits and Systems, August 2007.
[45] Young Hoon Kang and Jeff Draper, “Precise Exception Handling in Discontinuous
Control Flow Scenarios for Area-Constrained Systems”, In Proceedings of the 51
st
International Midwest Symposium on Circuits and Systems, August 2008.
[46] Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper, “Dynamic Packet
Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers”, In
Proceedings of the 3
rd
International Symposium on Networks-on-Chip, May 2009.
[47] Young Hoon Kang, Jeff Sondeen, and Jeff Draper, “Multicast Routing with Dynamic
Packet Fragmentation”, In Proceedings of the 19
th
ACM Great Lakes Symposium on
VLSI, May 2009.
[48] Young Hoon Kang, Jeff Sondeen, and Jeff Draper, “Implementing Tree-Based
Multicast Routing for Write Invaldation Messages in Networks-on-Chip”, In
Proceedings of the 52
nd
International Midwest Symposium on Circuits and Systems,
August 2009.
[49] Young Hoon Kang, Taek-Jun Kwon, and Jeffrey Draper, “Fault-Tolerant Flow
Control in On-Chip Networks”, In Proceedings of the 4
th
International Symposium
on Networks-on-Chip, May 2010.
[50] Young Hoon Kang and Jeff Draper, “Fault-Tolerant Flow Control for Control
Circuitry in On-Chip Routers”, TECHCON 2010, September 2010.
[51] John Kim, William J. Dally, and Dennis Abts, “Flattened Butterfly: A Cost-Efficient
Topology for High-Radix Networks”, In Proceedings of the 34
th
Annual
International Symposium on Computer Architecture, June 2007.
[52] John Kim, James Balfour, and William J. Dally, “Flattened Butterfly Topology for
On-Chip Networks”, In Proceedings of the 40
th
Annual International Symposium on
Microarchitecture, December 2007.
[53] John Kim, “Los-Cost Router Microarchitecture”, In Proceedings of International
Symposium on Microarchitecture, December 2009.
[54] Jongman Kim, Dongkook Park, T. Theocharides, N. Vijaykrishnan, and Chita R. Das,
“A Low Latency Router Supporting Adaptivity for On-Chip Interconnects”, In
Proceedings of the 42
nd
Annual Design Automation Conference, June 2005.
123
[55] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan
Narayanan, Mazin S. Yousif, and Chita R. Das, “A Gracefully Degrading and
Energy-Efficient Modular Router Architecture for On-Chip Networks”, In
Proceedings of the 33
rd
Annual International Symposium on Computer Architecture,
June 2006.
[56] Adan Kohler, and Martin Radetzki, “Fault-Tolerant Architecture and Deflection
Routing for Degradable NoC Switches”, In Proceedings of International Symposium
on Networks-on-Chip, May 2009.
[57] Michihiro Koibuchi, Hiroki Matsutani, Hideharu Amano, and Timothy Mark
Pinkston, “A Lightweight Fault-Tolerant Mechanism for Network-on-Chip”, In
Proceedings of the 2
nd
International Symposium on Networks-on-Chip, May 2008.
[58] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun, “Niagara: A 32-
Way Multithreaded Sparc Processor”, IEEE Micro, Vol 25, P.21-29, March 2005.
[59] Tushar Krishna, Amit Kumar, Patrick Chiang, Mattan Erez, and Li-Shiuan Peh,
“NoC with Near-Ideal Express Virtual Channel Using Global-Line Communication”
In Proceedings of Hot Interconnects, August 2008.
[60] Dianne R. Kumar, Walid A. Najjar, and Pradip K. Srimani, “A New Adaptive
Hardware Tree-Based Multicast Routing in K-Ary N-Cubes” IEEE Transactions on
Computers, Vol 50, p.647-659, July 2001.
[61] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen, “Interconnects in Multi-Core
Architectures: Understanding Mechanisms, Overheads, and Scaling” In Proceedings
of the 32
nd
Annual International Symposium on Computer Architecture, June 2005.
[62] Amit Kumar, Li-Shiuan Peh, Partha Kundu, and Niraj K. Jha, “Express Virtual
Channels: Towards the Ideal Interconnection Fabric”, In Proceedings of the 34
th
International Symposium on Computer Architecture, June 2007.
[63] Amit Kumar, Partha Kundu, Arvind P. Singh, Li-Shiuan Peh, and Niraj K. Jha, “A
4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm
CMOS” In Proceedings of International Conference on Computer Design, October
2007.
[64] Amit Kumar, Li-Shiuan Peh, and Niraj K. Jha, “Token Flow Control”, In
Proceedings of the 41
st
International Symposium on Microarchitecture, November
2008.
124
[65] Ying-Cherng Lan, Shih-Hsin Lo, Yueh-Chi Lin, Yu-Hen Hu, and Sao-Jie Chen,
“BiNoC: A Bidirectional NoC Architecture with Dynamic Self-Reconfigurable
Channel”, In Proceedings of the 3
rd
International Symposium on Networks-on-Chip,
May 2009.
[66] Xiaola Lin and Lionel M. Ni, “Multicast Communication in Multicomputer
Networks”, IEEE Transactions on Parallel and Distributed Systems, Vol 4, p.1105-
1117, October 1993.
[67] Xiaola Lin and Lionel M. Ni, “Deadlock-Free Multicast Wormhole Routing in 2D
Mesh Multicomputers”, IEEE Transactions on Parallel and Distributed Systems, Vol
5, p.793-804, August 1994.
[68] Zhonghai Lu and Axel Jantsch “Flit Ejection in On-chip Wormhole-switched
Networks with Virtual Channels”, In Proceedings of the IEEE Norchip Conference,
November 2004.
[69] Zhonghai Lu, Bei Yin, and Axel Jantsch, “Connection-oriented Multicasting in
Wormhole-switched Networks on Chip”, In Proceedings of the IEEE Computer
Society Annual Symposium on Emerging VLSI Technologies and Architectures”,
March 2006.
[70] M.P. Malumbres, Jose Duato, and Josep Torellas, “An Efficient Implementation of
Tree-Based Multicast Routing for Distributed Shared-Memory Multiprocessors”, in
Proceedings of the 8
th
International Symposium on Parallel and Distributed
Processing, October 1996.
[71] Radu Marculescu, Umit Y. Ogras, Li-Shiuan Peh, Natalie Enright-Jerger, and Yatin
Hoskote, “Outstanding Research Problems in NoC Design: System,
Microarchitecture, and Circuit Perspectives”, In IEEE Transactions on Computer-
Aided Design of Integrated Circuits and systems, Vol 28, January 2009.
[72] Hiroki Matsutani, Michihiro Koibuchi, Daihan Wang, and Hideharu Amano,
“Adding Slow-Silent Virtual channels for Low-Power On-Chip Networks”, In
Proceedings of the 2nd International Symposium on Networks-on-Chip, April 2008.
[73] George Michelogiannakis and Daniel Sanchez, “Evaluating Bufferless Flow-Control
for On-Chip Networks”, In Proceedings of the 4
th
International Symposium on
Networks-on-Chip, May 2010.
[74] Thomas Moscibroda and Onur Mutlu, “A Case for Bufferless Routing in On-Chip
Networks”, In Proceedings of the 36
th
Annual International Symposium on
Computer Architecture, June 2009.
125
[75] Shubhendu S. Mukherjee, Babak Falsafi, Mark D. Hill, and David A. Wood,
“Coherent Network Interfaces for Fine-Grain Communication”, In Proceedings of
International Symposium on Computer Architecture, May 1996.
[76] Shubhendu S. Mukherjee, Joel Emer, and Steven K. Reinhardt, “The Soft Error
Problem: An Architectural Perspective”, In Proceedings of International Symposium
on High-Performance Computer Architecture, February 2005.
[77] Robert. D. Mullins, Andrew. F. West, and Simon. W. Moore, “Low-Latency Virtual-
Channel Routers for On-Chip Networks”, In Proceedings of the 31st International
Symposium on Computer Architecture, June 2004.
[78] Robert. D. Mullins, Andrew. F. West, and Simon. W. Moore, “The Design and
Implementation of a Low-Latency On-Chip Network”, In Proceedings of the 11
th
Asia and South Pacific Design Automation Conference, January 2006.
[79] Srinivasan Murali, Luca Benini, Mary Jane Irwin, and Giovanni De Micheli,
“Analysis of Error Recovery Schemes for Networks on Chips”, IEEE Design and
Test of Computers, October 2005.
[80] Chrysostomos A. Nicopoulos, Dongkook Park, Jongman Kim, N. Vijaykrishnan,
Mazin S. Yousif, and Chita R. Das, “ViChaR: A Dynamic Virtual Channel Regulator
for Network-on-Chip Routers”, in Proceedings of the 39
th
International Symposium
on Microarchitecture, December 2006.
[81] John D. Owens, William J. Dally, Ron Ho, D.N. Jayasimha, Stephen W. Keckler,
and Li-Shiuan Peh, “Research Challenges for On-Chip Interconnection Networks”,
IEEE Micro, Vol 27, p.96-108, September 2007.
[82] Ruoming Pang, Timothy Mark Pinkston, and Jose Duato, “The Double Scheme:
Deadlock-free Dynamic Reconfiguration of Cut-Through Networks”, In Proceedings
of International Conference on Parallel Processing, August 2000.
[83] Dongkook Park, Chrysostomos Nicopoulos, Jongman Kim, N. Vijaykrishnan, and
Chita R. Das, “Exploring Fault-Tolerant Network-on-Chip Architectures”, In
Proceedings of International Conference on Dependable Systems and Networks,
June 2008.
[84] Ricardo Fernandez-Pascual, Jose M. Garcia, Manuel E. Acacio, and Jose Duato, “A
Fault-Tolerant Directory-based Cache Coherence Protocol for CMP Architectures”,
In Proceedings of International Conference on Dependable Systems and Networks,
June 2008.
126
[85] Li-Shiuan Peh and William J. Dally, “Flit-Reservation Flow Control”, In
Proceedings of the 6
th
International Symposium on High-Performance Computer
Architecture, pp. 73-84, January 2000.
[86] Li-Shiuan Peh and William J. Dally, “A Delay Model and Speculative Architecture
for Pipelined Routers”, In Proceedings of the 6
th
International Symposium on High-
Performance Computer Architecture, January 2001.
[87] Timothy Mark Pinkston, Ruoming Pang, and Jose Duato, “Deadlock-Free Dynamic
Reconfiguration Schemes for Increased Network Dependability”, IEEE Transactions
on Parallel and Distributed Systems, Vol 14, August 2003.
[88] Valentin Puente, Ramn Beivide, Jose Gregorio, J. M. Prellezo, Jose Duato, and Cruz
Izu, “Adaptive Bubble Router: A Design to Improve Performance in Torus
Networks”, In Proceedings of the International Conference on Parallel Processing,
September 1999.
[89] Yue Qian, Zhonghai Lu, and Wenhua Dou, “Analysis of Worst-case Delay Bounds
for Best-effort Communication in Wormhole Networks on Chip”, In Proceedings of
the 3
rd
International Symposium on Networks-on-Chip, May 2009.
[90] Amir-Mohammad Rahmani, Masoud Daneshtalab, Ali Afzali-Kusha, Saeed Safari,
Masoud Pedram, “Forecasting-based Dynamic Virtual Channels Allocation for
Power Optimization of Network-on-Chips”, In Proceedings of the 22
nd
International
Conference on VLSI Design, January 2009.
[91] Justin Ray and Philip Koopman, “Efficient High Hamming Distance CRCs for
Embedded Networks”, In Proceedings of International Conference on Dependable
Systems and Networks, June 2006.
[92] Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan
Desikan, Saurabh Drolia, M.S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson,
Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia
Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug burger,
“Distributed Microarchitectural Protocols in the TRIPS Prototype Processor”, In
Proceedings of the 39
th
Annual International Symposium on Microarchitecture,
December 2006.
[93] Li Shang, Li-Shiuan. Peh, and Niraj K. Jha, “Dynamic Voltage Scaling with Links
for Power Optimization of Interconnection Networks”, In Proceedings of the 9
th
International Symposium on High-Performance Computer Architecture, January
2003.
127
[94] Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, G.
Edward Suh, and Srinivas Devadas, “Static Virtual Channel Allocation in Oblivious
Routing”, In Proceedings of the 3
rd
International Symposium on Networks-on-Chip,
May 2009.
[95] Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and
Lorenzo Alvisi, “Modeling the Effect of Technology Trends on the Soft Error Rate
of Combinational Logic”, In Proceedings of International Conference on Dependable
Systems and Networks, June 2002.
[96] Arjun Singh, William J Dally, Amit K Gupta, and Brian Towles, “GOAL: A Load-
Balanced Adaptive Routing Algorithm for Torus Networks”, In Proceedings of the
30
th
Annual International Symposium on Computer Architecture, June 2003.
[97] Vassos Soteriou and Li-Shiuan Peh, “Exploring the Design Space of Self-Regulating
Power-Aware On/Off Interconnection Networks”, IEEE Transactions on Parallel and
Distributed Systems, Vol 18, p.393-408, March 2007.
[98] Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood, “SafetyNet:
Improving the Availability of Shared Memory Multiprocessors with Global
Checkpoint/Recovery”, In Proceedings of International Symposium on Computer
Architecture, May 2002.
[99] Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, and Anat Agarwal,
“Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned
Architectures”, In Proceedings of the 9
th
International Symposium on High-
Performance Computer Architecture, February 2003.
[100] Michael B. Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben
Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf,
Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant
Agarwal, “Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay
Architecture for ILP and Streams”, In Proceedings of the 31
st
International
Symposium on Computer Architecture, June 2004.
[101] Y. Tamir and G. L. Frazie, “High-performance multiqueue buffers for VLSI
communication switches”, in Proceedings of the 15
th
Annual International
Symposium on Computer Architecture, May 1988.
[102] Anjan K. V. and Timothy Mark Pinkston, “An Efficient, Fully Adaptive Deadlock
Recovery Scheme: DISHA”, in Proceedings of the 22
nd
Annual International
Symposium on Computer Architecture, May 1995.
128
[103] I. Walter, I. Cidon, R. Ginosar, and A. Kolodny, “Access Regulation to Hot-Modules
in Wormhole NoCs”, In Proceedings of the 1
st
International Symposium on
Networks-on-Chip, May 2007.
[104] Hangsheng Wang, Li-Shiuan. Peh, and Sharad Malik, “Power-driven Design of
Router Microarchitectures in On-chip Networks” In Proceedings of the 36
th
annual
International Symposium on Microarchitecture, December 2003.
[105] Lei Wang, Yuho Jin, Hyungjun Kim, and Eun Jung Kim, “Recursive Partitioning
Multicast: A Bandwidth-Efficient Routing for On-Chip Networks”, In Proceedings
of the 3
rd
International Symposium on Networks-on-Chip, May 2009.
[106] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and
Anoop Gupta, “The SPLASH-2 Programs: Characterization and Methodological
Considerations”, In Proceedings of the 22
nd
Annual International Symposium on
Computer Architecture, June 1995.
[107] Bilal Zafar, Jeff Draper, and Timothy Pinkston, “Cubic Ring Networks: A
Polymorphic Topology for Network On-Chip” In Proceedings of the 39
th
International Conference on Parallel Processing, October 2010.
[108] Heiko Zimmer and Axel Jantsch, “A Fault Model Notation and Error-Control
Scheme for Switch-to-Switch Buses in a Network-on-Chip”, In Proceedings of
CODES+ISSS Conference, October 2003.
Abstract (if available)
Abstract
Networks-on-Chip (NoCs) have been suggested as a scalable communication solution for many-core architectures. As the number of System-on-Chip (SoC) cores increases, power and latency limitations make conventional buses increasingly unsuitable. Buses are appropriate for small-scale designs but cannot support scaled performance as the number of on-chip cores increases. In contrast, NoCs offer fundamental benefits of high bandwidth, low latency, low power and scalability. ❧ NoCs have evolved providing high performance routers with good resource sharing, multicast routing, and fault tolerance through various techniques. Although many prior research efforts have suggested viable techniques for tackling challenges in NoC design, none have proposed a simple underlying technique that addresses resource sharing, multicast routing, and fault tolerance. This Ph.D. dissertation proposes dynamic packet fragmentation, a technique, that covers multiple NoC research domains and serves as an enabler for viable solutions for challenging issues in on-chip interconnection networks with minimum hardware overheads. Dynamic packet fragmentation addresses a broad range of subjects from performance to fault handling. A proposed router using this technique is shown to increase virtual channel (VC) utilization for performance improvement, provide deadlock avoidance in tree-based multicast routing, and support fault-tolerant flow control for fault handling. ❧ Using this technique, a packet is fragmented when certain blocking scenarios are encountered, and the VC allocated to the blocked packet is then released for use by other packets. The resulting efficient VC utilization provides more flexible flow control, preventing a blocked VC from propagating congestion to adjacent routers. In tree-based multicast routing, fragmentation enables deadlock-free tree-based multicast routing since it resolves cyclical dependencies in resource allocation through packet fragmentation. Fragmentation frees resources that may be required by blocked branches of other multicast packets. In fault-tolerant flow control, packet fragmentation helps to recover faulty flits through a link-level retransmission. The proposed fault-tolerant scheme ensures an error-free transmission on a flit-basis, while using dynamic packet fragmentation at error detection. Fragmentation renews the state information in control planes through a VC reallocation, preventing corrupted states from affecting the rest of the flits. Thus, the proposed router disengages flits from the faulty flit and safeguards the flits following a faulty flit. ❧ The implemented fragmentation router is evaluated through various simulation experiments with synthetic workloads. Performance benefits are demonstrated compared to a baseline router, and accurate power and area measurements are analyzed from a placed & routed layout. The result demonstrates that the fragmentation router shows performance improvement in terms of latency and throughput up to 30% and 75%, efficiently utilizing VCs and saves energy as well. In error sensitive environments, the fragmentation router provides a remarkable level of reliability and is observed to perform well, gracefully degrading while exhibiting 97% error coverage in datapath elements. ❧ Thus, the result of packet latency reduction and increased throughput justifies the fragmentation router as a suitable choice for future NoC design.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Theoretical and computational foundations for cyber‐physical systems design
Asset Metadata
Creator
Kang, Young Hoon
(author)
Core Title
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering (VLSI Design)
Publication Date
06/02/2011
Defense Date
04/11/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
fault tolerant routers,multicast routing,networks-on-chip,OAI-PMH Harvest,on-Chip Interconnection Networks,on-chip networks,on-Chip Routers
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey T. (
committee chair
), Nakano, Aiichiro (
committee member
), Pinkston, Timothy M. (
committee member
)
Creator Email
yhkangkr@gmail.com,yhkangkr@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c127-615767
Unique identifier
UC1380688
Identifier
usctheses-c127-615767 (legacy record id)
Legacy Identifier
etd-KangYoungH-14.pdf
Dmrecord
615767
Document Type
Dissertation
Rights
Kang, Young Hoon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
fault tolerant routers
multicast routing
networks-on-chip
on-Chip Interconnection Networks
on-chip networks
on-Chip Routers