Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Design of low-power and resource-efficient on-chip networks
(USC Thesis Other)
Design of low-power and resource-efficient on-chip networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DESIGN OF LOW-POWER AND RESOURCE-EFFICIENT
ON-CHIP NETWORKS
by
Lizhong Chen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
August 2014
Copyright 2014 Lizhong Chen
ii
Abstract
Many-core processors will continue to proliferate in the next decade across the entire
computing landscape. While on-chip networks provide a potentially scalable interconnec-
tion solution for many-core chips, they present serious challenges in achieving the power
and resource-efficiency needed for satisfying the constraints of future chip multiproces-
sors. This dissertation explores the opportunities, challenges and viable solutions at the
architecture-level in designing low-power and resource-efficient on-chip networks.
This research provides insight into the key factors affecting the effectiveness of pow-
er-saving techniques, particularly as it relates to power-gating on-chip network routers.
Two schemes are proposed that effectively decouple computational and communication
resources to maximize static power savings while also minimizing performance penalty
by dynamically powering on/off resources based on runtime application traffic load. The
two schemes are generally applicable to direct network and indirect network topologies,
respectively.
This research also addresses the challenging problems that remain in reducing the re-
source requirements of on-chip networks, which also affect power efficiency. This work
investigates different ways of conveying global information using only local resources to
solve a couple of difficulties that have hindered efficient flow control designs in virtual
iii
cut-through and wormhole-switched networks for over a decade. The proposed theoreti-
cal support and implementation schemes are applicable to a broad set of network designs
to ensure freedom from both routing-induced and protocol-induced deadlock, thus having
important theoretical value and practical applications.
This research also investigates the compelling opportunities for recognizing and ex-
ploiting the emerging regional traffic behaviors exhibited in on-chip networks. The pro-
posed schemes for enhancing resource utilization, along with the previous resource min-
imizing schemes, allows more on-chip resources to be powered off dynamically, thus
widening the entire spectrum of trade-offs among power, resource and performance.
iv
Dedication
This dissertation is dedicated to my son Daniel,
my wife Luyao, and my dearest parents.
v
Acknowledgments
I am deeply indebted to my advisor, Professor Timothy M. Pinkston, who has been a
true role model for interpreting what mentorship really is. His inspiring guidance, gener-
ous support, and consistent trust have facilitated tremendously the success of this re-
search. Professor Pinkston has not only exploited my potential in conducting impactful
research, but also offered me invaluable guidance on my pursuit of career and life goals,
helping me to achieve more and go further. It is my great honor and fortune to have him
as my advisor.
Much gratitude is owed to Professor Massoud Pedram for providing me the oppor-
tunity to participate in his research group meetings. His wisdom in the discussions has
inspired fruitful collaborative works, and the presentations by other students have broad-
ened my research vision. I am also grateful to Professor Kai Hwang for sharing his life-
time experiences, which I find very enlightening and helpful. I learned quite a lot through
taking his course, co-authoring a paper with him, and eventually contributing modestly to
his new book.
During my Ph.D. studies, I was very fortunate also to have received insightful advice
from Professors Murali Annavaram, Michel Dubois, and Jeffery Draper. It was my great
pleasure to collaborate with several outstanding graduate students, including Lihang Zhao,
vi
Ruisheng Wang, Di Zhu and Siyu Yue. In addition, I would like to thank Professor Li-
Shuian Peh’s group at MIT for providing timely assistance with the Orion simulator and
Dr. Yuho Jin (formerly a Postdoc in the SMART Interconnects group and now an Assis-
tant Professor at New Mexico State University) for the training he provided me on the
GEMS simulator during the early stages of my graduate study.
Many thanks go to other three members of my guidance and dissertation committees,
Professor John Slaughter, Professor Viktor Prasanna, and Professor Aiichiro Nakano (be-
sides the aforementioned Professors Pinkston, Annavaram and Draper), for their valuable
time, feedback, and suggestions.
Special thanks also goes to my numerous friends, including my fellow Ph.D. students
in the SMART Interconnects group, SPORT lab, SCIP group, as well as all my other
friends. Their presence has made my five-year graduate research life a pleasant and
memorable journey.
Finally, I would like to express my sincerest appreciation to my dear wife and parents
for their unconditional love and support. Without their understanding, encouragement and
sacrifice, this dissertation would not be possible.
vii
Contents
Abstract .............................................................................................................................. ii
Dedication ......................................................................................................................... iv
Acknowledgments ............................................................................................................. v
List of Tables ................................................................................................................... xii
List of Figures ................................................................................................................. xiii
1 Introduction ................................................................................................................. 1
1.1 Motivation ............................................................................................................ 2
1.2 Dissertation Question and Hypothesis ................................................................. 6
1.3 Research Contributions ........................................................................................ 6
1.4 Dissertation Outline .............................................................................................. 8
2 Background and Related Work ............................................................................... 10
2.1 On-chip Network Basics .................................................................................... 10
2.1.1 NoC Examples in Chip-Multiprocessors and MPSoCs ....................... 10
2.1.2 Basic operations ................................................................................... 12
2.2 NoC Power Problem and Previous Studies ........................................................ 14
2.3 NoC Resource-efficiency and Previous Studies ................................................. 18
2.4 Summary ............................................................................................................ 22
3 Reducing NoC Power with Effective Power-gating ............................................... 24
3.1 NoC Power-gating and Associated Challenges .................................................. 25
3.2 Node-Router Decoupling ................................................................................... 30
3.2.1 The Basic Idea ..................................................................................... 30
3.2.2 Decoupling Bypass .............................................................................. 33
3.2.3 Transition between Gated-on and Gated-off States ............................. 36
viii
3.2.4 Asymmetric Wakeup Thresholds ........................................................ 38
3.3 Evaluation Methodology .................................................................................... 41
3.3.1 Simulator Configuration ...................................................................... 41
3.3.2 Workloads ............................................................................................ 43
3.4 Results and Analysis .......................................................................................... 44
3.4.1 Wakeup Thresholds ............................................................................. 44
3.4.2 Impact on Static Energy ...................................................................... 45
3.4.3 Reducing Power-gating Overhead ....................................................... 46
3.4.4 Impact on Dynamic Energy ................................................................. 46
3.4.5 Impact on Performance ........................................................................ 49
3.4.6 Effects on Hiding Wakeup Latency ..................................................... 51
3.4.7 Behavior across Full Range of Network Loads ................................... 52
3.4.8 Discussion ............................................................................................ 55
3.5 Summary ............................................................................................................ 58
4 Performance-aware NoC Power Reduction ........................................................... 59
4.1 Motivation .......................................................................................................... 60
4.1.1 Need for Performance-aware Power Reduction .................................. 60
4.1.2 Limitations in Power-gating Mesh Networks ...................................... 61
4.1.3 Opportunities and Challenges in Power-gating Clos NoCs................. 64
4.2 Minimal Performance Penalty Power-gating ..................................................... 68
4.2.1 The Basic Idea ..................................................................................... 68
4.2.2 Guaranteed Connectivity ..................................................................... 69
4.2.3 Dynamic Traffic Diversion .................................................................. 72
4.2.4 Rapid Wakeup ..................................................................................... 76
4.2.5 Impact of MP3 on Performance and Energy ....................................... 78
4.3 Evaluation Methodology .................................................................................... 79
4.4 Results and Analysis .......................................................................................... 81
4.4.1 Impact on Performance ........................................................................ 81
4.4.2 Impact on Router Static Energy........................................................... 83
4.4.3 Comparison of Power-gating Overheads ............................................. 84
4.4.4 Impact on NoC Energy ........................................................................ 85
4.4.5 Comparison across Full Network Load Range .................................... 85
ix
4.4.6 Effect of Rapid Wakeup ...................................................................... 89
4.4.7 Wakeup Latency Tolerance ................................................................. 89
4.4.8 Discussion ............................................................................................ 90
4.5 Summary ............................................................................................................ 94
5 Reducing Resources for VCT-switched Networks ................................................. 95
5.1 Globally-Aware Flow Control ........................................................................... 96
5.1.1 Pros and Cons of Globally-aware Flow Control ................................. 96
5.1.2 Bubble Flow Control ........................................................................... 97
5.1.3 Inefficiency in Handling Protocol-Induced Deadlock ....................... 102
5.2 Critical Bubble Scheme .................................................................................... 104
5.2.1 The Basic Idea ................................................................................... 104
5.2.2 Detailed Description of the Critical Bubble Scheme ......................... 105
5.2.3 Impact on Implementation and Performance ..................................... 107
5.2.4 Deadlock Freedom ............................................................................. 109
5.3 Router Architecture .......................................................................................... 113
5.4 Evaluation ......................................................................................................... 116
5.4.1 Simulation Methodology Using Synthetic Loads .............................. 116
5.4.2 Effects of Critical Bubble Scheme under Synthetic Traffic .............. 117
5.4.3 Simulation Methodology Using Full System Simulation .................. 120
5.4.4 Effects of CBS in Handling Message-Dependent Deadlock ............. 121
5.5 Discussion ........................................................................................................ 123
5.5.1 Scalability of Critical Bubble Scheme............................................... 123
5.5.2 Implementing Occupancy-Based Global Flow Control .................... 124
5.6 Summary .......................................................................................................... 125
6 Reducing Resources for Wormhole-switched Networks ..................................... 126
6.1 Challenges in Extending BFC for Wormhole .................................................. 127
6.1.1 Atomic Buffer Allocation in Wormhole Switching .......................... 128
6.1.2 Small Buffer Size............................................................................... 130
6.2 Worm-Bubble Flow Control Theory ................................................................ 132
6.2.1 A Sufficient Condition for Deadlock Freedom ................................. 132
6.3 Worm-Bubble Flow Control Implementation .................................................. 134
x
6.3.1 The Basic Idea ................................................................................... 134
6.3.2 Combating the Small-buffer Problem ................................................ 135
6.3.3 Combating the Starvation Problem.................................................... 139
6.3.4 Formal Description of Injection Rules .............................................. 142
6.3.5 Deadlock Freedom of WBFC ............................................................ 143
6.3.6 Reducing Injection Delay .................................................................. 144
6.3.7 Modifications to Router Microarchitecture ....................................... 145
6.4 Evaluation ......................................................................................................... 147
6.4.1 Performance for 4x4 Torus Network ................................................. 149
6.4.2 Performance for 8x8 Torus Network ................................................. 151
6.4.3 Injection Delay .................................................................................. 152
6.4.4 Performance for PARSEC Benchmarks ............................................ 153
6.4.5 Area Comparison ............................................................................... 154
6.4.6 Energy ................................................................................................ 155
6.4.7 Impact of Buffer Size ........................................................................ 156
6.5 Applications and Extensions ............................................................................ 157
6.6 Summary .......................................................................................................... 159
7 Improving Resource Utilization by Interference Reduction ............................... 161
7.1 Regionalized NoC and Its Opportunities ......................................................... 162
7.1.1 Formation of Regionalized NoC (RNoC) .......................................... 162
7.1.2 Regional Behaviors of RNoC ............................................................ 164
7.1.3 Intra-region Traffic vs. Inter-region Traffic ...................................... 165
7.2 Problems with Existing Related Techniques .................................................... 167
7.2.1 Region-oblivious Techniques ............................................................ 167
7.2.2 Region-aware Techniques ................................................................. 169
7.3 Region-aware Interference Reduction .............................................................. 173
7.3.1 The Basic Idea ................................................................................... 173
7.3.2 Removing Restrictions ...................................................................... 174
7.3.3 Enforcing Prioritization ..................................................................... 176
7.3.4 Utilizing Load Heterogeneity ............................................................ 178
7.3.5 Avoiding Starvation and Deadlock .................................................... 181
7.3.6 Putting It All Together and Router Microarchitecture ...................... 182
xi
7.4 Evaluation ......................................................................................................... 183
7.4.1 Simulation Methodology ................................................................... 183
7.4.2 Impact of Multi-stage Prioritization .................................................. 184
7.4.3 Impact of Routing Algorithm ............................................................ 186
7.4.4 Impact of Dynamic Priority Adaptation ............................................ 187
7.4.5 Effects of RAIR on Synthetic RNoC ................................................. 189
7.4.6 Effects on Different Traffic Patterns ................................................. 192
7.4.7 Effects on Applications ..................................................................... 192
7.4.8 Discussion .......................................................................................... 194
7.5 Summary .......................................................................................................... 195
8 Conclusions and Future Research ......................................................................... 197
8.1 Conclusions ...................................................................................................... 197
8.2 Future Research ................................................................................................ 199
Bibliography .................................................................................................................. 205
xii
List of Tables
Table 3.1: Key parameters used in simulating Node-Router Decoupling. ................. 42
Table 5.1: Parameters used in full system simulation for CBS. ............................... 120
Table 6.1: Configuration for evaluating worm-bubble flow control. ....................... 147
Table 7.1: Full-system simulator configuration for RAIR. ....................................... 184
xiii
List of Figures
Figure 2.1: NoC example in CMP: Intel 80-core Teraflop research chip. ........................ 11
Figure 2.2: NoC example in MPSoC: Samsung Exynos Octa 5410 processor ................ 12
Figure 2.3: Depiction of on-chip network architecture. .................................................... 13
Figure 2.4: Static power vs. dynamic power of on-chip routers. ...................................... 15
Figure 2.5: Basics of power-gating technique. ................................................................. 16
Figure 3.1: Application of power-gating technique to on-chip routers. ........................... 26
Figure 3.2: Intermittent packet arrival breaks long idle period into fragments. ............... 27
Figure 3.3: Decoupling bypass (shaded components are not power-gated). .................... 32
Figure 3.4: Handshaking in NoRD (PG: power-gate, WU: wakeup, IC: incoming). ....... 37
Figure 3.5: Impact of powering-on routers. ...................................................................... 40
Figure 3.6: Determining wakeup threshold. ..................................................................... 44
Figure 3.7: Static energy comparison (normalized to No_PG). ........................................ 45
Figure 3.8: Reduction of power-gating overhead. ............................................................ 47
Figure 3.9: Overall NoC energy breakdown. .................................................................... 48
Figure 3.10: Impact of NoRD on Performance. ................................................................ 50
Figure 3.11: Impact of wakeup latency. ............................................................................ 51
Figure 3.12: Packet latency and power of 16-node for different load ranges. .................. 53
Figure 3.13: Packet latency and power for 64-node. ........................................................ 53
xiv
Figure 4.1: Normalized runtime vs. average packet latency. ............................................ 60
Figure 4.2: Direct network (mesh) vs. indirect network (Clos) for connecting 64 PEs. .. 63
Figure 4.3: Illustration of MP3 for guaranteeing connectivity, dynamically steering traffic,
and reducing transient contention by rapid wakeup. ........................................................ 70
Figure 4.4: Partially-on design for saving energy while maintaining minimal forwarding
functions. ........................................................................................................................... 71
Figure 4.5: Average packet latency comparisons. ............................................................ 81
Figure 4.6: Execution time comparisons. ......................................................................... 82
Figure 4.7: Router static energy comparisons. .................................................................. 83
Figure 4.8: Comparison of power-gating overhead. ......................................................... 84
Figure 4.9: Breakdown of NoC energy. ............................................................................ 87
Figure 4.10: Behavior across full range of network loads. ............................................... 88
Figure 4.11: Rapid wakeup. .............................................................................................. 89
Figure 4.12: Wakeup latency tolerance. ........................................................................... 89
Figure 4.13: Concentrated mesh and flattened butterfly with high-radix routers. ............ 91
Figure 5.1: Theoretical BFC requires global information in the ring to avoid deadlock.100
Figure 5.2: Simultaneous injection in Theoretical BFC requires global coordination. .. 100
Figure 5.3: Localized BFC decisions are not optimal across the network. ..................... 101
Figure 5.4: Localized BFC requires more free buffers than Theoretical BFC. .............. 101
Figure 5.5: The Critical Bubble Scheme avoids deadlock without requiring explicit global
coordination. ................................................................................................................... 104
Figure 5.6: Transfer of a Critical Bubble between routers. ............................................ 106
xv
Figure 5.7: Simultaneous injections have no uncertainty with Critical Bubble Scheme. 108
Figure 5.8: Simultaneous injection of multiple packets with the Critical Bubble Scheme.
......................................................................................................................................... 108
Figure 5.9: Typical virtual cut-throutgh router microarchitecture. The shaded areas are
modified to implement the proposed Critical Bubble Scheme. ...................................... 114
Figure 5.10: Effects of Critical Bubble Scheme under different synthetic traffic patterns.
......................................................................................................................................... 118
Figure 5.11: Effect of Critical Bubble Scheme on reducing buffer access delay. .......... 119
Figure 5.12: Comparison of normalized execution time for PARSEC benchmarks. ..... 121
Figure 5.13: Comparison of the breakdown of average packet latency. ......................... 122
Figure 6.1: Comparison of atomic and non-atomic buffer allocation. ............................ 128
Figure 6.2: Deadlock caused by atomic buffer allocation in wormhole switching. ....... 130
Figure 6.3: Problems with buffer size smaller than packet. ............................................ 131
Figure 6.4: Walk-through example of the use of white and black WBs in WBFC. ...... 138
Figure 6.5: : Walk-through example of avoiding starvation by the use of gray WB. ..... 141
Figure 6.6: Router microarchitecture for WBFC. ........................................................... 145
Figure 6.7: Performance comparison for 4x4 torus. ....................................................... 150
Figure 6.8: Performance comparison for 8x8 torus. ....................................................... 151
Figure 6.9: Injection delay comparisons. ........................................................................ 152
Figure 6.10: Execution time comparisons for PARSEC benchmarks. ........................... 154
Figure 6.11: Router area comparisons. ........................................................................... 155
Figure 6.12: Overall router energy breakdown. .............................................................. 156
xvi
Figure 6.13: Impact of buffer size. .................................................................................. 157
Figure 7.1: Example of performance criticality for regional and global traffic. ............. 166
Figure 7.2: Prioritization on per-application basis is possible in STC (left) but not on per-
region basis (right). ......................................................................................................... 168
Figure 7.3: (a) Valid mapping and (b) Invalid mapping in LBDR. ................................ 170
Figure 7.4: DBAR becomes ineffective when packets traverse outside the originating
region (more heavily loaded regions have darker shade). .............................................. 172
Figure 7.5: RAIR router microarchitecture. Light-shared blocks are added; dark-shared
blocks are modified. ........................................................................................................ 175
Figure 7.6: Arbitration steps in (a) VA; (b) SA stage. .................................................... 177
Figure 7.7: Hysteresis priority transition for native traffic. ............................................ 181
Figure 7.8: Impact of multi-stage prioritization. ............................................................. 185
Figure 7.9: Two applications with varying percentage of inter-region traffic. ............... 185
Figure 7.10: Impact of routing algorithm. ...................................................................... 186
Figure 7.11: Two contrasting scenarios to evaluate DPA. .............................................. 187
Figure 7.12: Impact of dynamic priority adaptation. ...................................................... 188
Figure 7.13: Six-application scenario with various global traffic patterns. .................... 190
Figure 7.14: Average packet latency comparison of different techniques under uniform
random global traffic. ...................................................................................................... 191
Figure 7.15: Reduction of average packet latency under different global traffic patterns.
......................................................................................................................................... 192
Figure 7.16: PARSEC simulation setup for RAIR. ........................................................ 193
xvii
Figure 7.17: Average packet latency slowdown on PARSEC workloads with adversarial
traffic. .............................................................................................................................. 194
1
Chapter 1
Introduction
Computing systems in various forms have influenced virtually every aspect of human
life: from mobile phones and tablets for wireless audio and video communications, to
numerous embedded computing systems in biomedical devices, automobiles, avionics,
and home entertainment, to general-purpose computing systems in various configurations
to satisfy the need of small business and professionals, to supercomputers for continued
scientific discoveries and datacenters for providing essential daily services such as email,
map, Internet search, and on-line shopping, to many other scientific, economic, and social
computing applications touching numerous disciplines and sectors. Clearly, computing
systems in past decades have contributed significantly in helping to advance science,
technology and society at large.
Going forward, the unrelenting progress in semiconductor technologies and continual
emergence of new application areas will spur unprecedented development of many-core
chip multiprocessors (CMPs) to satisfy increasing performance expectations and tighten-
ing power/resource constraints of future computing systems. Meanwhile, growing com-
mercialization of many-core CMPs has been demonstrating the viability and importance
of integrating tens [50] and even hundreds [112] of cores on a single chip. With this
many-core parallel processing paradigm comes many significant challenges for efficient-
2
ly interconnecting up to hundreds of cores within each CMP, and possibly many thou-
sands of CMPs comprising stand-alone or distributed systems. To meet the challenges,
on-chip interconnect solutions must be developed that provide scalable performance
while requiring minimal power demand and resource requirements. The research con-
ducted in this dissertation addresses the challenges by investigating innovative and viable
approaches that facilitate the design of low-power and resource-efficient on-chip net-
works and systems.
1.1 Motivation
Earlier on-chip interconnects made use of traditional buses (or segmented buses) and
dedicated wires for connecting cores. Those solutions have very limited scalability and,
thus, are unlikely to satisfy the needs of future large CMPs. Consequently, the notion of
an on-chip network or network-on-chip (NoC) [27] has been proposed as a more scalable
approach to meet the growing communication demand among the cores, and has been
adopted as the de facto architecture for many-core chips. Unfortunately, NoCs composed
of routers and links can draw a substantial percentage of a chip’s power (10% to 36%)
and area (17% to 25%), as shown in several industrial and research chips [48, 49, 110]. A
recent study also shows that, even incorporated with many energy-efficient micro-
architectural features, current NoC designs still consume significantly higher energy than
that of traditional wires, leaving a gap of up to 4X compared with ideal interconnects [56].
Therefore, it is imperative to devise new innovative techniques that substantially reduce
3
NoC power and resource consumption while, at the same time, satisfy performance con-
straints in order for NoCs to be suitable for future many-core CMPs.
Power has become one of the main constraints in chip design in recent years given
that the pace of lowering supply voltage can no longer keep up with the pace of shrinking
transistor lengths in accordance with Dennard scaling [32]. The power consumption of a
chip consists of two components, namely dynamic power and static power. Much re-
search has been conducted in the past to reduce dynamic power consumption. Neverthe-
less, while static power has been of some concern for many years, only since technology
scaling has started to penetrate deep into the nanometer regime has it become more of a
first-order design constraint. Take the router of an on-chip network as an example. In [19],
the impact of technology scaling on router static power is quantified by taking into ac-
count runtime statistics from executing a suite of representative benchmarks on a simu-
lated many-core processor. The study finds that the percentage of router static power con-
sumption increases considerably as the transistor size and operating voltage decrease,
from 17.9% at 65nm and 1.2V, to 35.4% at 45nm and 1.1V, to 47.7% at 32nm and 1.0V,
and so on. This trend clearly illustrates that the static power of on-chip routers has be-
come a significant part of the overall router power consumption, and the problem will
only worsen with each generation of technology scaling into the sub-15nm technology
nodes due to exponentially increasing leakage power.
An important aspect that has largely been neglected in previous works on reducing
NoC power is the lack of architectural support for applying circuit-level power-saving
techniques effectively. For example, power-gating is a promising circuit-level technique
4
to mitigate the static power of a circuit block by cutting off the power supply of that
block, which is the source of leakage currents in both subthreshold conduction and re-
verse-biased diodes. This technique can be applied to NoC routers. However, a funda-
mental requirement for effective use of power-gating is for the idle periods of the power-
gated block to be sufficiently long as to compensate for the power and performance over-
heads in transitioning between “on” and “off” power states. Our research study shows
that, although on-chip routers exhibit very good idleness on average due to network load-
ing conditions, more than 61% of the total number of idle periods has a length less than
the breakeven time (BET) needed to offset power-gating overhead [19]. Therefore, naïve-
ly applying power-gating to router blocks for those idle periods could actually lead to
energy penalty as opposed to energy savings. The primary reason for this counterproduc-
tive use of a promising technique is the coupling of the NoC router and processor node
(e.g., processor core, cache, and memory controller, collectively). Any packet sent, re-
ceived or forwarded must wakeup the router before being transported, thus breaking po-
tentially long idle periods of even a lightly loaded network into fragmented intervals.
Thus, novel architecture-level support remains to be investigated for effectively removing
the reliance of powered-on routers for transfer of packets, thereby maximizing the effec-
tiveness of circuit-level power-gating techniques. Note that reducing static power is not
an isolated issue in chip design. This is because power-gating may also have negative
impact on system performance (e.g., delay in waking up routers or elongated execution).
The increased execution time can potentially lead to increased overall energy consump-
tion even though both static and overall power are reduced. These issues have largely
5
been neglected in previous research and need to be addressed satisfactorily by perfor-
mance-aware power-saving techniques.
Additionally, unlike off-chip environments, on-chip designs are much more con-
strained by limited available resources (e.g., chip area). Hence, on-chip networks must be
designed using minimal resources needed to perform correct operations while maximiz-
ing resource utilization. Different from other system components, minimizing NoC re-
sources comes with several new challenges due to its unique function and characteristics.
One such challenge deals with ensuring freedom from routing-, protocol-, and reconfigu-
ration-induced deadlock at all times. Deadlock is the situation in which none of a set of
packets can make forward progress because they are waiting on each other to release
network resources, i.e., router buffers. In interconnection network designs, deadlock must
be handled in all cases due to its catastrophic consequences to the network and overall
computer system performance. Many techniques have been developed in the past to avoid
deadlock in interconnection networks, but the resources needed to realize those tech-
niques are not optimized for on-chip networks to safeguard only the rare cases of dead-
lock [99]. For example, many deadlock-avoidance techniques are based on the use of
virtual channels (VCs). Multiple VCs are employed to avoid routing-induced deadlock
within each message class and separate sets of VCs are used to avoid protocol-induced
deadlock across dependent message classes. Consider a typical chip multiprocessor with
a MOESI coherence protocol. Even with a relatively small buffer size of 3-flits per VC
and 3 VCs per message class, the percentage of VC cost already exceeds 43% of the total
router area [22]. This presents a significant obstacle for reducing NoC resources. These
6
VCs account for over 20% of the total router power and cause over a 30% increase in the
router critical-path delay (as compared to a minimal configuration) due to their more
complicated router arbitrators [12].
The importance of on-chip networks will continue to grow as technology scales. The
increasing core count and tightening power/area constraints will only push the NoC sub-
system to be designed with higher efficiency and lower overhead. These targets in current
and future many-core chips are the primary objectives of this research.
1.2 Dissertation Question and Hypothesis
Dissertation Question
How can on-chip networks be designed with low-power and high resource efficiency
while incurring little or no performance penalty?
Dissertation Hypothesis
Low-power and resource-efficient on-chip networks with performance-awareness can
be achieved by reducing the minimal resources required for maintaining correct NoC
operation and by dynamically powering on/off resources as needed via effective power-
gating schemes and resource utilization-enhancing schemes for upholding performance
guarantees.
1.3 Research Contributions
This research addresses the critical and looming issues of power and resource effi-
7
ciency in on-chip networks – the key communication subsystem in current and future
many-core processors. The main contributions of this dissertation are the following.
This research increases understanding of the key factors affecting the effectiveness of
power-gating on-chip network routers. The proposed node-router decoupling scheme
is one of the first known works that quantifies the power-gating opportunities of on-
chip routers and demonstrates the viability of achieving energy-proportional NoCs.
This research is among the first studies to explore power-gating trade-offs for indirect
networks, such as Clos networks, and demonstrates the viability of realizing minimal
performance penalty when applying power-gating to NoCs. The proposed methods
which include guaranteeing connectivity, dynamic traffic steering and wakeup-signal
relay provide valuable insights on how to address performance and energy issues in
power-gating techniques. They are generally applicable to a variety of interconnec-
tion networks.
This research investigates different ways of conveying global information using only
local resources to solve a couple of difficulties that have hindered efficient designs of
bubble-based flow control [100] for over a decade. The work serves as a foundation
on which new bubble-based flow control schemes can be based that reduce buffer re-
quirements down to the theoretical minimal amount and are applicable to a broader
set of network designs with varying buffer sizes and allocation assumptions.
8
This research also analyzes the formation of regional NoCs resulting from a series of
recent optimizations that leverage non-uniformity in many-core chips. The proposed
scheme identifies key regional traffic behaviors and exemplifies how these emerging
behaviors can be exploited to improve interference reduction. This opens up new op-
portunities in co-optimizing on-chip networks with techniques targeting system com-
ponents other than the NoC to increase resource utilization.
1.4 Dissertation Outline
The reminder of this dissertation is organized as follows. Chapter 2 reviews some
necessary basics of on-chip network to facilitate the discussion of this dissertation and
summarizes prior and ongoing research in the area of low-power and resource-efficient
NoC designs. Chapter 3 analyzes the fundamental and critical problems of applying pow-
er-gating techniques to on-chip networks and presents the Node-Router Decoupling ap-
proach to enable effective power-gating of on-chip routers. To further reduce the perfor-
mance impact of power-gating, Chapter 4 describes a useful scheme that allows power-
gating to be used in NoC with near-zero performance penalty. To reduce minimal NoC
resource requirement, Chapter 5 and Chapter 6 present the theoretical support and effi-
cient implementation of deadlock-free flow control mechanisms for VCT-switched and
wormhole-switched networks, respectively. In Chapter 7, a region-aware interference
reduction technique is described, which captures and exploits regional traffic behaviors in
improving the effectiveness of interference reduction and resource utilization. Finally,
9
Chapter 8 concludes this dissertation with a discussion of key results and directions for
future research.
10
Chapter 2
Background and Related Work
There is an emerging body of related research dedicated to improving on-chip net-
works by reducing power/energy consumption and by optimizing NoCs together with
some other parts of the system. In addition, much research has been carried out in reduc-
ing NoC resource requirements, especially in handling deadlocks. However, very little
research explores architecture-level support to enhance and extend the application of cir-
cuit-level power-saving techniques, and even less optimizes overall system metrics that
are affected by the interaction between dynamic application behaviors and the NoC.
Moreover, prior works have very limited ability to reduce minimal resource requirements
of on-chip networks which have a much tighter resource budget than their off-chip coun-
terparts. To facilitate the understanding and motivate the importance of our proposed
work on low-power and resource-efficient NoCs, this chapter reviews some key basics of
on-chip networks and summarizes prior and ongoing research in the related areas.
2.1 On-chip Network Basics
2.1.1 NoC Examples in Chip-Multiprocessors and MPSoCs
Figure 2.1 and Figure 2.2 illustrate examples of NoC-based architectures in many-
core CMP and MPSoC settings, respectively. As depicted in Figure 2.1, the Intel Tereflop
11
research chip-multiprocessor has 80-cores that are organized in a tile-based layout. In
each tile, there is the computation part consisting of floating-point units, register file, on-
chip cache/memory, etc., and there is the communication part, which mainly consists of
an on-chip router to connect this tile to other tiles. In general, each tile in a typical CMP
architecture may have a processing element (which can be a core, a cache bank, or a
memory controller), and the processing element connects to a router. All the routers then
connect with each other to form the on-chip network.
Figure 2.2 presents another NoC example, but for the case of an MPSoC. The left side
of the figure shows the layout of the Samsum Exynos Octa processor, which finds use in
the Samsung Galaxy and Samsung Note lineups. As shown in the figure, this particular
Samsum Octa 5410 processor integrates 4 small cores (i.e., A7) and 4 large cores (i.e.,
A15) along with various other functional blocks for audio, video, display, etc. In general,
an MPSoC may have a NoC that not only connects the homogenous/heterogeneous CPU
Figure 2.1: NoC example in CMP: Intel 80-core Teraflop research chip [48].
12
cores as in CMPs, but also connects GPU, DSP, IP blocks, and various controllers and
interfaces, as shown on the right hand side of Figure 2.2. With continued development of
semiconductor technologies, future MPSoCs will have even more components integrated
on each chip, and all those components will rely on the NoC to provide fast, efficient and
reliable communication. Therefore, on-chip networks, which serve as the primary com-
munication subsystem on chips, will gain increasing importance in many-core based
computer architectures.
2.1.2 Basic operations
At the architecture-level, on-chip networks, either in CMPs or in MPSoCs, can be
represented by the abstraction shown in Figure 2.3.
Figure 2.3(a) depicts a NoC architecture that employs a two-dimensional (2-D) mesh
topology. Each processing element (denoted by a circle) is attached to an on-chip router
(denoted by a square) through a network interface. Routers are connected to each other
Figure 2.2: NoC example in MPSoC: Samsung Exynos Octa 5410 processor [31].
13
through links. Collectively, all routers form a 4x4 mesh topology, maintaining the con-
nectivity of all processing elements.
Figure 2.3(b) shows the internal architecture of one of those routers (i.e., canonical
router architecture), assuming credit-based flow control and wormhole-switching. The
datapath of the router is comprised of the input units, crossbar and output units. Multiple
virtual channels (VCs) can be associated with each input port to reduce head-of-line
blocking and help prevent deadlock. The control logic of the router consists of route
computation and VC/switch allocators. A packet that arrives at the router is first buffered
at one of the VCs in the corresponding input port. The packet then goes through the rout-
ing computation to determine its output port, and arbitrates for the output VC in the VC
allocator and for the crossbar in the switch allocator. After that, the packet can traverse
the crossbar and be forwarded through the output port to the next router. Typically, these
operations are compacted in a 4-stage or 3-stage pipeline. Detailed descriptions on the
required microarchitecture to implement these functions can be found in [26].
(a) An example 2-D mesh network (b) Canonical Router architecture
R R R R
R R R R
R R R R
R R R R
····
Input
Unit
Switch Allocator
Route
Computation
VC Allocator
Output
Unit
Credit Credit
Figure 2.3: Depiction of on-chip network architecture.
14
2.2 NoC Power Problem and Previous Studies
While on-chip networks provide a more scalable interconnection solution for many-
core CMPs compared with traditional buses and point-to-point interconnects, the added
complexities of buffers, crossbars and control logic in the NoC greatly increase power
demand. Industrial and research chips have shown that on-chip networks can draw a sub-
stantial percentage of chip power [4, 48, 49, 90] (e.g., by 28% in the previous Intel Tera-
flops chip). In particular, a large percentage of the NoC power consumption comes from
static power which is trending upward as technology scales.
To illustrate, Figure 2.4(a) plots the percentage of static power of on-chip routers at
3GHz for various manufacturing generations and operating voltages. Results are obtained
from the Orion 2.0 [65] power model. To reflect realistic workloads, Orion is fed with
statistics from full system simulation – Simics [83] plus GEMS [84] – running multi-
threaded PARSEC 2.0 benchmarks [14]. As shown in the figure, the percentage of static
power consumption increases as the feature size and operating voltage decrease, from
17.9% at 65nm and 1.2V, to 35.4% at 45nm and 1.1V, to 47.7% at 32nm and 1.0V. This
trend clearly illustrates that the static power of on-chip routers has become a significant
part of the overall router power consumption, and only worsens for process technologies
beyond 45nm. Figure 2.4(b) further breaks down the total power consumption of on-chip
routers at 45nm with 1.0V into dynamic and various static components. As can be seen,
buffers consume 55% of the static power (21% of the total power) while other router
components consume 45% of the static power (17% of the total power). This indicates
that the static power consumption in router components other than buffers is significant
15
and that appropriate techniques need to be adopted to reduce all contributors to static
power.
Previously, a number of works have been proposed aimed at reducing dynamic and
static NoC power at the circuit-level and register-transfer level, using techniques from
low-swing signaling, multi-threshold CMOS, link encoding, clock-gating and operand
isolation [66, 77, 101, 109, 121].
At the micro-architecture level, dynamic voltage and frequency scaling (DVFS) has
been extensively studied as an effective means of reducing interconnection network pow-
er consumption [15, 47, 78, 118] by lowering the supply voltage and clock frequency of
less utilized components/links. As technology continues to scale down, DVFS gradually
loses its advantages as the dynamic range of supply voltage has been dramatically shrink-
ing from 2V in earlier technology generations to 0.3V in recent generations [68]. In addi-
tion, DVFS causes detrimental impact on the already worsening static power problem due
to the elongated computation interval.
(a) Router static power percentage (b) Router power decomposition
Figure 2.4: Static power vs. dynamic power of on-chip routers.
0%
20%
40%
60%
80%
100%
1.2V 1.1V 1.0V 1.2V 1.1V 1.0V 1.2V 1.1V 1.0V
65nm 45nm 32nm
Static power percentage
Buffer_static
21%
VA_static 7%
SA_static 2%
Xbar_static 5%
Clock_static 4%
Dynamic
62%
16
Some architecture-level techniques have also been proposed to save buffer power
consumption by dynamically and adaptively allocating buffer resources based on traffic
patterns and buffer utilization [24, 62, 95], while other aggressive techniques try to com-
pletely eliminate buffers in on-chip networks [37, 55, 93] at the cost of having to handle
flit-by-flit routing, livelock and packet reassembly. Although buffers are estimated to be a
major source of power consumption, the above techniques are limited in scope, applying
only to buffers and ignoring other power consumers in the routers.
One promising technique that can potentially reduce the static power of all compo-
nents in a block such as an entire on-chip router is the power-gating technique. As depict-
ed in Figure 2.5(a), power-gating is implemented by inserting appropriately sized header
or footer transistor(s) with high threshold voltage (non-leaky “sleep switch”) between
Vdd and the block, or the block and GND. By asserting the sleep signal when the power-
gated block is idle, the supply voltage to the block can be turned off, thus avoiding static
power consumption by removing the leakage currents in both subthreshold conduction
and reverse-biased diodes.
(a) Power-gating technique (b) Energy vs. time
Figure 2.5: Basics of power-gating technique.
17
Figure 2.5(b) depicts the key intervals of power-gating. At time t
0
, the sleep signal is
asserted and distributed to the sleep transistor with certain overhead energy. At t
1
, this
signal arrives at the sleep transistor and turns it off, so the virtual Vdd starts to drop. Cor-
respondingly, the leakage current also decreases and the cumulative energy savings start
to increase. From this moment, the block stays in the power-gated off state until t
2
when
the sleep signal is de-asserted and distributed again, initiating the wakeup process. From
t
2
to t
3
, another energy overhead is incurred in distributing the sleep signal and waking up
the gated-off block. The cumulative energy savings stop increasing at t
3
when the virtual
Vdd restores to full Vdd and the wakeup process concludes. Consequently, an important
parameter in power-gating is the “breakeven time” (BET), which is defined to be the
minimum number of consecutive cycles that a gated block needs to remain in idle state
before being awoken to offset power-gating energy overhead [79, 82]. Prior research us-
ing analytical modeling and simulation [51, 87] estimate the BET value to be around 10
cycles for on-chip routers under current technology parameters.
In previous research studies, power-gating as a circuit-level technique has been pro-
posed for some time and has been applied to cores and execution units in CMPs [51, 79,
82]. Several works have also applied power-gating to reduce link power consumption by
predicting link idleness and turning on/off links [107, 120]. However, only recently has it
been investigated for on-chip network routers [87, 88, 89]. These works directly, perhaps
naïvely, apply power-gating to routers without fully taking into account network commu-
nication behavior. Thus, their effectiveness is severely limited by the breakeven time
requirement, as will be described in detail in Section 3.1. In addition, due to router-node
18
coupling, these techniques incur large performance penalty as packets that encounter gat-
ed-off routers suffer additional transport latency in waiting for routers to wake up. If
routed over multiple hops, packets are likely to experience successive wakeup latencies
accumulated on the critical path. Unlike powering down other types of structures such as
computational units or cache ways, turning off routers/links in an on-chip network can
result in frequent routing algorithm reconfiguration, unroutable packets, disconnected
cores, routing deadlocks and livelocks, and other unwanted network anomalies which
either complicate the network design or limit the opportunity for powering down system
resources. These problems are non-trivial.
2.3 NoC Resource-efficiency and Previous Studies
In addition to the power/energy issue, another major design objective for on-chip
networks is high resource efficiency, as many-core processors are greatly constrained by
limited available on-chip resources (e.g., chip area, buffers and virtual channels). This
design objective demands on-chip networks to minimize resources that are required to
perform correct operations while maximizing resource utilization.
Resource requirements for interconnection networks are usually lower-bounded by
the amount of resources needed to implement deadlock avoidance or recovery schemes.
Several fundamental theories and methods for handling deadlock have been established
in [8, 33, 34, 36]. Message-dependent deadlock was studied in detail in [105], and the
formal model proposed in [116] can be used to analyze deadlocks. Several works have
investigated techniques to improve buffer utilization and reduce buffer requirements. Flit-
19
reservation flow control was proposed in [97] to use buffers efficiently and reduce laten-
cy by scheduling buffer usage ahead of time. In [106], unoccupied buffer slots or “bub-
bles” are drawn to heavily-congested spots within the network to relieve congestion and
maintain deadlock freedom. In [81], multiple short packets are allowed to coexist in a
buffer queue to improve buffer utilization provided that enough free buffers are available.
Numerous routing algorithms have been proposed with the aim of using buffer and chan-
nel resources efficiently. Although those theories, models, and schemes are applicable to
off-chip and on-chip networks alike, the resource requirements for on-chip networks is
much tighter and calls for even further reduction. This is increasingly important as scal-
ing of the number of cores per chip continues. However, it has proven very difficult to
devise viable ways of further reducing minimal resource requirements for on-chip envi-
ronments beyond that which has already been achieved with highly optimized deadlock-
free mechanisms previously proposed in the off-chip network context.
One interesting approach for reducing the router buffer resources needed to avoid
deadlock in virtual cut-through (VCT) switched torus networks is based on flow control
of bubbles in the network [16, 100]. To work optimally, the scheme requires global in-
formation about router buffer occupancy within each dimension. Extensive evaluations
using global congestion control are presented in [53, 54, 91], demonstrating the potential
performance benefits of globally-aware flow control that can be achieved. Assuming an
ideal global controller, it is shown that approximately half-full buffer occupancy results
in high and sustained peek throughput. Other congestion control techniques based on
global knowledge are proposed in [111] and [10], but such schemes must endure long
20
delay in gathering the required global information. These prior works establish that if
global awareness of buffer occupancy can somehow be implemented feasibly, minimal
buffer resources for ensuring deadlock freedom and increased flexibility for maximizing
performance may be attainable. However, due to the difficulty of acquiring global infor-
mation with low cost, only a version of bubble flow control which localizes the infor-
mation needed to maintain deadlock freedom has been adopted in practice, requiring
twice the theoretical minimum amount of buffer resources [5, 17].
A key difference between off-chip and on-chip networks is the prevalent use of
wormhole switching instead of VCT switching, which poses serious difficulties in
providing global information locally. In VCT switching, the VC buffer size is at least the
length of the longest packet size, so that all flits belonging to the same packet are guaran-
teed to be received at the downstream VC once the head flit gains access. As a result, a
variety of deadlock issues caused by indirect dependence can be simplified [33], making
VCT the preferred switching method in off-chip interconnects where buffer resources are
more abundant [3, 5]. In contrast, wormhole switching adds complexities to deadlock
avoidance owing to indirect channel dependencies, but without consideration of dead-
locks, wormhole switching minimally requires only one flit-sized buffer per VC. Hence,
due to the low buffer requirement, wormhole switching is much preferred in resource-
constrained on-chip environments [41, 48]. Unfortunately, the added channel dependence
introduced by wormhole switching and the allowance of buffer size to be smaller than the
packet size cause several major difficulties in extending both the bubble-flow control
theorem [100] and its implementation [18] to wormhole switching. Hence, despite the
21
large body of work on this subject, key problems in minimizing resources for on-chip
networks have remained unsolved for over a decade now.
Meanwhile, NoC resource efficiency can also be improved by increasing resource uti-
lization. One approach to utilize NoC resource better is to design application-aware on-
chip networks, particularly when multiple applications are executing in the system simul-
taneously, for the purpose of reducing interference and providing fair or differentiated
services. Early works in this category employ time-division-multiplexing (TDM) tech-
niques [40, 92], where network bandwidth is often under-utilized. An alternative is to use
a frame-based mechanism in which time is coarsely quantized into frames, and flows are
allowed to use reserved frames as well as excess bandwidth [45, 59, 76]. In addition, sev-
eral works mitigate contention on the resources among multiple applications or threads
by prioritizing or separating packets based-on application/thread criticality [28, 60], traf-
fic classes [61], and packet latency slack [29].
While the above non-exhaustive set of techniques exemplifies recent works aimed at
improving NoC resource efficiency in significant ways, very little research explores traf-
fic behaviors resulting from optimizations applied to many-core CMPs. For example, a
number of recent optimizations targeting system components other than the NoC place
threads belonging to the same application closer to each other [94, 117] or move fre-
quently accessed cache data closer to the requesting cores [52, 58, 69] so as to reduce the
communication delay and minimize overall traffic volume. These techniques essentially
transform a considerable amount of chip-wide traffic into short-range traffic within phys-
ically-close groups or regions, leaving a small amount of traffic needing to traverse more
22
distant regions. In this dissertation, such phenomenon is referred to as regional traffic
behavior. Until recently, most NoC interference reduction techniques, including round-
robin age-based techniques [2], are oblivious to regional traffic behaviors. These region-
oblivious techniques are inherently unable to exploit regional behavior and, thus, have
limited effectiveness. Recent works on multiple concurrently running applications have
started to consider regional behavior in their interference reduction techniques [46, 80,
113], but these techniques either place various heavy restrictions on traffic patterns (e.g.,
inter-region traffic is disallowed in [113]) or reduce merely part of the possible interfer-
ence, thereby limiting their usefulness to only some particular scenarios.
2.4 Summary
While on-chip network provides a potentially scalable interconnection solution for
many-core chips, it presents serious challenges in achieving the power and resource-
efficiency needed for satisfying the constraints of future chip multiprocessors. From the
power perspective, although dynamic power is an important component of router power
and has been investigated extensively, router static power reduction is a growing issue
that lacks effective solutions and has largely gone unexplored. In this research, we attack
these open problems by developing innovative architecture-level techniques that effec-
tively decouple computational and communication resources in many-core chips to max-
imize static power savings while also minimizing performance penalty to improve overall
system trade-offs. In addition, to improve NoC resource efficiency, in this research, we
propose two creative solutions to address challenging problems that remain in reducing
23
NoC resource requirements to avoid deadlock and other network abnormalities in VCT-
and wormhole-switched networks. We also investigate the compelling opportunities for
recognizing and exploiting regional traffic behaviors exhibited in on-chip networks, and
propose a region-aware interference reduction scheme to improve NoC resource utiliza-
tion.
24
Chapter 3
Reducing NoC Power with Effective Power-gating
With the diminishing return of existing dynamic power reduction techniques and the
increasing static power percentage of on-chip routers, an effective approach to reduce
NoC power is to reduce its static power component. A promising way to achieve this goal
is to power-gate off unneeded NoC resources when traffic load condition allows. Howev-
er, for power-gating technique to be beneficial, a fundamental requirement is for the idle
periods to be sufficiently long to compensate for the power-gating and performance over-
head. On-chip routers are potentially good targets for power optimizations, but few works
have explored effective ways of applying power-gating to them due to the intrinsic de-
pendence between the node and router – any packet sent, received or forwarded must
wakeup the router before being transferred, thus breaking the potentially long idle period
into fragmented intervals. Simulation results show that directly applying conventional
power-gating techniques to NoC routers would cause frequent state-transitions and signif-
icant energy and performance overhead. In this chapter, we present NoRD (Node-Router
Decoupling) [19], a novel power-aware on-chip network approach that provides for
power-gating bypass to decouple the node’s ability for transferring packets from the
powered-on/off status of the associated router, thereby maximizing the length of router
idle periods.
25
3.1 NoC Power-gating and Associated Challenges
In a chip multiprocessor, each node may consist of a processor core, caches, and an
associated router. Node-router dependence means that the ability for a node to send, re-
ceive or forward a packet depends directly on the on/off status of the associated router.
For example, a node can inject a packet into the network only when the associated router
is in the powered-on state. Conversely, routers become idle when the associated nodes
have no packet to send, receive or forward. Our full system simulation results show that
on-chip routers can be idle 30%~70% of the time (with x264 having the lowest of 30.4%
and blackscholes having the highest of 71.2%), depending on the physical location of the
routers in the NoC and the load intensity of the applications. Therefore, power-gating
techniques can be applied to on-chip routers to take advantage of their idleness.
When the internal datapaths of a router are empty (i.e., input ports, output latches, and
the crossbar), the router microarchitecture can be power-gated off to save static power
after notification of all its neighbors. Figure 3.1 shows an example of power-gating router
B and handshaking with one of its upstream routers, router A. A small non-power-gated
controller is added in the router to monitor the emptiness of the datapath and the wakeup
signals from neighbors. When the datapath of router B is detected as empty and the WU
(wakeup) signals are clear, the controller in router B asserts a sleep signal to put router B
into gated-off state and asserts a PG (power-gate) signal to notify router A. Upon detect-
ing the asserted PG signal, router A tags the output port that leads to router B as being
26
power-gated and hence becomes unavailable in the SA stage
1
. Later, after router B is
power-gated, some packet in router A or another neighbor of router B may request an
output port to router B in the SA stage, triggering the WU signal to be asserted which
causes the controller in router B to de-assert its sleep signal. The packet will then be
stalled in the SA stage while waiting for router B to wake up and de-assert the PG signal.
According to previous studies [87, 89], the wakeup latency for on-chip routers under typ-
ical technology parameters is a few nanoseconds, or around 10~20 cycles depending on
the frequency. In what follows, we use the term conventional power-gating of routers to
refer to the above mechanism of applying conventional power-gating to on-chip routers.
There are three major challenges to achieving effective power-gating of on-chip rout-
ers. The first one is the intensified limitation caused by breakeven time (BET). It has
been observed that, when applying power-gating to functional units, the BET limitation
may cause large energy penalty for some applications where functional units do not ex-
1
To ensure the receiving of packets that are already in ST and LT stages, either router B needs to wait two
more cycles before deciding to enter gated-off state, or WU should be generated early enough.
Figure 3.1: Application of power-gating technique to on-chip routers.
GND
Vdd
WU
PG
FIFO
FIFO
VA & SA
·· ··
·· ··
Ctrlr
FIFO
FIFO
VA & SA
·· ··
·· ··
Ctrlr
Router A Router B
27
hibit long enough idle periods [79]. Unfortunately, when applying conventional power-
gating to on-chip routers, the BET limitation becomes much more prevalent due to inter-
mittent packet arrivals seen by the routers. Figure 3.2 illustrates the problem even in the
case where the NoC has substantial idleness, as given by a low average arrival rate of 0.1
flits/cycle (i.e., 10% traffic load). In (a), with two successive single-flit packets arriving
in the first two cycles, the router has up to 18 idle cycles for useful power-gating; where-
as in (b), discrete packet arrivals cut down idle periods to below the BET, leading to an
energy penalty as opposed to savings if power-gated. Our evaluation on PARSEC
benchmarks shows that the number of idle periods having a length less than or equal to
the BET constitutes more than 61% of the total number of idle periods. Thus, on the one
hand, routers on average exhibit very good idleness that could benefit from applying
power-gating, but on the other hand, a large percentage of these idle periods are too short
to meet the BET requirement as any sending, receiving or forwarding operation of a node
would generate packets for the associated router to process, thus severely limiting the
effectiveness of conventional power-gating of routers.
One direct way to address this problem is to reduce the BET through better circuit-
level design or advanced manufacturing processes, which unavoidably have physical lim-
itations (e.g., transistor sizing of the inverter-chain has limited ability in mitigating the
18 cycles
0 1
9 cycles 9 cycles
0 10
(a) (b)
Figure 3.2: Intermittent packet arrival breaks long idle period into fragments.
28
energy overhead of sleep-signal distribution). Another possibility is to apply conventional
power-gating to smaller individual components within each router, such as per input port
or per virtual channel [88, 89]. This method, however, can only mitigate the impact of the
BET problem as individual components have only slightly longer idle period, and even if
the BET condition is satisfied, many power-gated cycles are wasted to offset the energy
overhead. Moreover, this requires prohibitive hardware implementation overhead. For
example, there are 35 power domains in a single router in [89] to implement this method
of power-gating in addition to the complex coordination needed among different compo-
nents, which incurs significant energy and area overhead with considerable design effort.
Thus, a much more effective way of removing the dependence between the node and
router is needed, so as to combat the BET limitation from the source by reducing the
number of wakeups while maintaining the ability to transport packets in the NoC.
The second challenge is the cumulative wakeup latency in multi-hop networks. Just
as the BET limitation of energy-savings is magnified in power-gated on-chip routers, the
wakeup latency problem is also exacerbated in NoC environments, which affects perfor-
mance negatively. Conventional power-gating of routers requires routers to be in the on-
state to forward packets, which makes the wakeup latency exposed directly to the critical
path of packet transport to downstream routers. A packet routed in a multi-hop NoC can
experience wakeup latency multiple times as routers at many hops along the path could
be gated-off. To make things worse, power-gating works best when load rates are low,
but in those situations more routers are in the gated-off state, making packets more likely
to encounter multiple wakeups. One approach is to use early wakeup signal generation
29
(e.g., generate the wakeup signal as soon as the output port is computed). However, this
has limited ability to hide router wakeup latency, e.g., 3 cycles maximum out of the
10~20 cycles of wakeup latency for a 4-stage pipeline. Look-ahead wakeup is also possi-
ble [87, 89], in which the candidate router monitors all the wakeup signals two hops away
so that it can hide at most 6 cycles of wakeup latency. This still limited technique re-
quires monitoring hardware that is very complex and expensive to implement as every
router essentially has to monitor every input port in up to 12 routers within a 2-hop dis-
tance, assuming a 2-D mesh topology. A much better approach would be to effectively
remove the wakeup latency from the critical path by providing bypass of powered-off
routers.
The third major and most obvious problem in applying conventional power-gating to
on-chip routers is the network disconnection problem. This problem is caused by the
node-router dependence, as whenever a router is power-gated off, the associated node is
disconnected from the rest of the network. The disconnection problem impacts system in
two ways. First, the local node cannot send/receive packets to/from the network if the
associated router is powered-off, which limits the opportunity of power-gating to only
those cases when the core and cache associated with the node are completely idle. Second,
remote nodes cannot access any resource on the local node either, particularly the cache
line and coherence directory. For a typical shared last level cache (LLC) configuration,
this essentially decreases the effective cache size. For example, if half of the routers are
power-gated off, the accessible LLC size available to the remaining nodes is reduced by
50%. Especially worth noting is that a private LLC does not help much due to the main-
30
taining of cache coherence protocols. For instance, a dirty line in the private LLC of the
local node is the unique last copy of the data in the entire system. Any other request to
this line from remote nodes must wakeup the local router to access the data and resume
correct execution, even if the local core is idle. Therefore, a more effective way to cir-
cumvent powered-off routers and maintain the connectivity of on-chip resources using
some alternative path is needed.
3.2 Node-Router Decoupling
3.2.1 The Basic Idea
The proposed approach is based on the idea of breaking node-router dependence via
wakeup-avoidance decoupling bypass paths. Recall that in conventional power-gating of
routers, due to the node-router dependence, any incoming packet from either a local node
or other nodes would first have to wake up the gated-off router before further packet
transport could occur. This wakeup incurs energy overhead and performance penalty on
each occurrence. By providing decoupling bypass for each router, the ability to transport
packets in the network is decoupled from the on/off status of the routers. This solves all
three problems of conventional power-gating of routers. First, packets (sent, received or
forwarded) have the option to go through bypass paths instead of powering-on the routers
to continue progress, thus avoiding unnecessary wakeups and the associated energy over-
head which causes BET in the first place. Second, bypass allows packets to be transferred
while the router is being awoken, which removes the wakeup latency completely from the
31
critical path of packet transport. Third, when the associated router is powered-off, the
local node can still be connected with the rest of the network through the decoupling by-
pass paths, thus eliminating the disconnection problem.
While NoRD conceptually is a simple yet attractive solution, implementing decou-
pling bypass that provides chip-wide connectivity even when many or all routers are gat-
ed-off and transition between the gated-on/off state is not straightforward. In the pro-
posed design, we add internal bypass paths in each router that can forward packets direct-
ly from a selected input port to the network interface (NI) and then forward the packets
from the NI back to a selected output port. The input/output port pairs from all routers
form – in the worst case – a unidirectional ring across the chip, so that all the NIs are
always connected. The resulting bypass paths, together with all remaining paths provided
by the normal deadlock-free routing algorithm, allow packets to be transported without
deadlock in NoCs comprised of any combination of powered-on and powered-off routers.
In the rest of this section, we present the detailed design of NoRD, addressing the con-
struction of bypass paths, the implementation of NI forwarding, the transition and inter-
face between routers in bypass mode and normal mode, the avoidance of deadlock and
other network abnormalities under the presence of both on and off routers, and asymmet-
ric wakeup threshold to further increase the efficiency of NoRD.
32
(a) Chip-level Bypass Ring
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
①
③
FIFO
FIFO
X+
VA & SA
X‐
Y+
NI
Y‐
Y‐ X‐
X+
NI
Y+
· ··· · ···
Output
buffer
Bypass
latch
To Processor
Core
Eject
Inject
NI
Core
Ejection Q
Injection Q
ctrl
From Processor
Core
②
(c) Bypass datapath in network interface (NI)
Figure 3.3: Decoupling bypass (shaded components are not power-gated).
(b) Example bypass datapath in router 2
33
3.2.2 Decoupling Bypass
Without loss of generality, we start by describing the microarchitecture of bypass us-
ing a 4x4 2D mesh as an example. Decoupling bypass is achieved through two-level co-
ordination. At the chip level, an input port (referred to as a Bypass Inport) and an output
port (referred to as a Bypass Outport) from each router are chosen in a way such that,
collectively across the network, they form a unidirectional ring (referred to as Bypass
Ring) connecting all nodes, as shown in Figure 3.3(a). At individual router level, two
datapaths are added as follows. In order to inject packets from the local node (e.g., pro-
cessor core), a datapath is added from the NI input to the Bypass Outport (the bottom
bold line in Figure 3.3(b)). In order to receive packets destined to the local node from the
network, a second datapath is added from the Bypass Inport to the NI outport to eject
packets from the router (the top bold line in Figure 3.3(b)). The bypass paths consisting
of minimal hardware described here are not power-gated.
To forward packets through a powered-off router, a bypass path from the router’s By-
pass Inport to its Bypass Outport is established through the node’s NI. Flits are ejected
from the powered-off router to the NI and injected back into the same router along the
path of the Bypass Ring, as shown Figure 3.3(c). In a typical NoC with wormhole switch-
ing, the NI is responsible for accepting data from the node and encapsulating it into pack-
ets and flits (NI core), allocating a virtual channel and checking flow control credits in
the NI input port of the associated router, and injecting the formatted flits into the net-
work. Receiving data from the network to the node has a similar but reversed process.
Now, to implement router bypassing through the NI of the node, we add a latch and a
34
demultiplexer ahead of the ejection queue, insert a multiplexer after the NI’s injection
queue, and create a path between the input and output ports of the NI according to Figure
3.3(c). With this forwarding path, a flit can now be forwarded from the gated-off router’s
Bypass Inport to its Bypass Outport in three stages, as annotated in Figure 3.3(b) and (c):
① at the end of link traversal, instead of being written into the router’s input buffer as
done when the router is powered-on, the flit is written directly into the NI bypass latch
through the bypass datapath; ② based on the packet’s destination header bits, the NI
either sinks this flit in the local node or forwards the flit by allocating a VC (and check-
ing its credits); ③ the flit is re-injected into the power-gated router’s Bypass Output
through the bypass datapath. The bypass datapath is enabled only when the router is in
the power-gated off state.
The above two-level coordination essentially decouples nodes from the on/off status
of routers, as now a node can send, receive and forward packets through the decoupling
bypass even if the associated router is in the gated-off state. Moreover, it ensures the
connectivity of all nodes. Packets can route through a combination of Bypass Ring paths
to circumvent gated-off routers and normal paths of gated-on routers to minimize hop
count. Even in the extreme case of all routers being gated-off, packets can still traverse
along the Bypass Ring to reach any destination.
Owing to the decoupling bypass that provides network connectivity in all cases, dead-
lock-free adaptive routing based on Duato’s Protocol [33] is easily supported. Escape
resources are comprised of the unidirectional ring formed by the (Bypass Inport, Bypass
Outport) pairs in both gated-on and gated-off router state, where two VCs can be used to
35
break cyclic dependence. Additional VCs can be used as adaptive resources for adaptive
routing over the NoC.
The deadlock- and livelock-free routing of NoRD is as follows. Every router has
adaptive VCs and escape VCs (powered-off routers have no VCs but still have the corre-
sponding adaptive/escape latches for bypassing). At normal routers, packets on adaptive
VCs use minimal adaptive routing to choose the next hop, but packets on escape VCs are
confined to choose the Bypass Outport (i.e., move along the bypass ring) and confined to
escape VCs until destination. For packets on adaptive VCs, misrouting occurs only when
all of the downstream routers on the minimal path are powered-off AND the Bypass Out-
port forces a detour (note that the Bypass Outport could, in fact, also be on the minimal
path). In that case, packets must choose the Bypass Outport to traverse to next router
(could be either normal or off) misrouted by one hop. However, packets are still allowed
to remain on adaptive VCs for normal routers or the corresponding adaptive latches for
bypassed routers (i.e., the entire set of adaptive resources) if the total misrouted hops are
below a threshold; otherwise packets are forced to enter escape VCs (or the correspond-
ing escape latches for bypassed router) and route along the unidirectional ring without
returning to adaptive resources until the destination is reached. At the next router, if
packets are still on adaptive VCs, they will repeat the above process (i.e., use minimal
adaptive routing if available on the bypass ring or mesh, or enter escape resources on the
Bypass Ring if needed) until reaching the destination. No U-turns are allowed at any hop.
The above routing for NoRD follows Duato’s Protocol for deadlock-free adaptive routing
as the escape VCs on the Bypass Ring have no cycles in the extended channel depend-
36
ence graph and the adaptive channels allow for fully adaptive routing. As detoured pack-
ets have a cap on the number of misroutes allowed before being forced to enter escape
VCs with a bounded hop count, NoRD avoids both deadlock and livelock. Also, any ad-
ditional hops from detours are partially offset by gains in completely hiding router
wakeup latency as compared to conventional power-gating and reduced per hop latency
of the bypass path. Finally, starvation for NI resources by the local node is easily avoided
by granting priority over bypass traffic to the local node if not served for a predetermined
number of consecutive cycles. However, this should happen rarely as the router is as-
sumed to be power-gated off only when the load is low and contention is minimal.
3.2.3 Transition between Gated-on and Gated-off States
To transition between gated-on and gated-off states and to interface with neighboring
routers for correct flow control, several handshaking signals are needed as illustrated in
Figure 3.4. In this example, we focus on the state-transition of router B, and the bypass of
router B is from router A through the NI of router B to router D.
To transition from gated-on to gated-off state, similar to the conventional power-
gating mechanism described in Section 3.1, if router B is empty and both IC and WU are
clear (these two signals will be explained shortly), it asserts the PG signals, enables by-
pass and goes into gated-off state by asserting the sleep signal (not shown). Upon detect-
ing the asserted PG signal, routers C, D and E tag the output port that leads to router B as
power-gated (and becomes unavailable in the SA stage) and stop tracking credits, while
router A, which is the Bypass Ring upstream router, sets the credit of each VC in that
37
output port to 1 as router B now has only one output buffer available as shown in Figure
4(b). To ensure the receiving of packets that are already in the ST and LT stages of the
neighboring routers, an IC (incoming) signal is generated at the beginning of SA if there
is a flit in the SA stage and propagates to router B. In this way, the IC signal is always
two cycles ahead of flits to notify router B that a flit is incoming and router B should not
enter into gated-off state. Finally, for any flits that are in the VA and SA stages of routers
C, D and E, they will restart the pipeline from RC using the new output port availability
information as they are still in the input channel. Note that these flits must be head flits;
otherwise if the head flits have left router C/D/E to B but body/tail flits have not yet ar-
rived at router B, then the virtual channel is not de-allocated and router B is not consid-
ered as empty.
To transition router B from gated-off state to gated-on state, the WU signal first needs
to be generated according to a wakeup metric. Ideally, the wakeup metric should de-
Figure 3.4: Handshaking in NoRD (PG: power-gate, WU: wakeup, IC: incoming).
IC
PG
Router
A
Router
B
Router
D
WU
IC
PG
Router
C
IC PG
Router
E
IC PG
NI of
Router B
38
assert WU when the load is low, and assert the signal when it is above a threshold when
the load becomes high. A naïve way is to use the number of flits transmitted by the gated-
off router in a fixed period of time, but this may not necessarily generate a wakeup signal
when the load is high as flits could be stalled due to network congestion. Another tradi-
tional metric is to use router buffer utilization [107], which also is not suitable as input
buffers are not used in the gated-off state. As all traffic to gated-off routers are forwarded
through the NI and allocated a VC there to (re)inject into the network, we use as a thresh-
old parameter the number of VC requests at the local NI over a period of time (10 cycles)
for the wakeup metric. This metric works for both low and high load as the number of
VC requests goes up even if the flits are stalled, and it remains valid in the extreme case
when all the routers are gated-off, as the wakeup signal is generated locally.
With the number of VC requests used as threshold wakeup metric, the operation of
turning on a gated-off router is straightforward. When the WU signal is asserted, router B
starts to wake up while the bypass is still functioning. When wakeup finishes, router B
de-asserts the PG signal. Upon detecting the de-asserted PG signal, routers C, D and E
reset the credits to full while router A adds back (full-1) credits. Once the flit in the NI
bypass datapath is written into the input buffer of router B, the bypass of router B is disa-
bled to complete the state-transition.
3.2.4 Asymmetric Wakeup Thresholds
While previous subsections describe the necessary operations to keep NoRD func-
tional, the efficiency of NoRD can be increased using asymmetric wakeup thresholds. For
39
certain topologies and constructions of the Bypass Ring, some routers may have greater
impact on performance than others based on their location in the NoC. For example,
powering on Routers 4 and 5 in Figure 3.3(a) has larger performance benefits than pow-
ering on Routers 0 and 1, as the former provide a shortcut to route packets that would
otherwise be detoured through 9->13->12->8. Therefore, taking the placement of bypass
paths and routers into account, additional performance gains can be obtained.
To differentiate between routers in NoRD, asymmetric wakeup thresholds can be
used. For example, NoC routers can fall broadly under two classes – performance-centric
and power-centric – based on their importance, where a low wakeup threshold is assigned
to the performance-centric class and a high wakeup threshold is assigned to the power-
centric class. The intuition behind this is to wake up early a few performance-critical
routers while waking up late the rest (majority) of the routers. In this way, not only per-
formance improves due to the added shortcuts in routing paths, but also more static pow-
er can be saved by allowing non-performance-critical routers to stay in the gated-off state
for a longer time. As a threshold metric is needed for wakeup anyway, no additional
hardware is required.
To select the set of routers that are more critical to performance, a short off-line pro-
gram based on the Floyd-Warshall all-pair shortest path algorithm [39] was used. Figure
3.5 plots the best node-to-node average distance and per-hop latency that can be achieved
with a given number of powered-on routers for the 2-D mesh example in Figure 3.3(a).
As expected, with more routers turned on, the average hop distance between nodes in
NoRD decreases rapidly due to the added flexibility in routing paths. Meanwhile, more
40
Figure 3.5: Impact of powering-on routers.
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
012345678910111213141516
Average per‐hop latency (cycles)
Average node‐to‐node distance (hops)
Number of powered‐on routers
Node‐to‐node distance
Per‐hop latency
packets are routed through the normal pipeline of powered-on routers instead of the sim-
pler and shorter bypass pipeline, thus gradually increasing the per-hop latency. Figure 3.5
also shows that, by turning on six routers, the average hop distance can be greatly re-
duced with moderate increase in the per-hop latency, indicating a viable trade-off point.
The corresponding router set that achieves this data point consists of Routers 4, 5, 6, 7, 13
and 14 in Figure 3.3(a). In this example, these routers are designated as the performance-
centric routers, and the remaining routers are classified as the power-centric routers. Oth-
er classifications are still possible and an optimal classification could be determined dy-
namically with comprehensive consideration of topology, traffic patterns, bypass place-
ment, and routing algorithm. For instance, the routing algorithm may adaptively steer
packets to a few performance-centric routers and the rest of the routers can be designated
as power-centric routers. While further work can be conducted to investigate the design
complexity of finding the optimal classification and the trade-off in doing so, the inten-
sion here is to show that asymmetric wakeup threshold, even with a simple dual-mode
41
classification, can provide additional benefits in both performance and energy to com-
plement the proposed decoupling bypass mechanism.
In summary, NoRD maximizes the opportunity for saving energy by allowing frag-
mented idle periods that are even shorter than the BET to be exploited, which is not pos-
sible in conventional power-gating of routers. Moreover, by steering short packet spikes
to bypass paths without waking up the routers, the energy overhead in distributing the
sleep signal and powering-on the router is also largely avoided. Therefore, NoRD is able
to increase the cumulative energy savings while reducing the power-gating energy over-
head. Meanwhile, NoRD also minimizes the performance penalty of power-gating tech-
niques from the following aspects: (1) the use of decoupling bypass reduces the number
of state-transitions and, hence, avoids the wakeup latency when routers do not need to be
turned on; (2) when router wakeup is unavoidable, decoupling bypass provides temporary
paths for packets while the router is being awoken, thus hiding wakeup latency; (3) a few
performance-centric routers with low thresholds can be awoken earlier to guard perfor-
mance. With these features, NoRD can greatly reduce the performance penalty of con-
ventional power-gating of routers, as the following evaluation shows.
3.3 Evaluation Methodology
3.3.1 Simulator Configuration
The proposed NoRD scheme is evaluated quantitatively under full-system simulation
using Simics [83], with GEMS [84] and Garnet [6] for detailed timing of the memory
42
system and on-chip network. Orion 2.0 [65] is integrated in Garnet for NoC power and
area estimation using technology parameters from an industrial standard 45nm CMOS
process and 1.1V operating voltage. The saved static power is modeled after [51] and the
overhead is modeled after [51, 65]. A wakeup latency of 12 cycles is used assuming a 4ns
wakeup delay and 3GHz frequency, and 3 cycles can be hidden when the early wakeup
technique [87] is applied. The simulators are modified to model all the key additional
hardware for power-gating and bypass, including the extra power consumption in the NI
buffering and forwarding logic. The additional dynamic (static) power of the NI in NoRD
is lumped into router dynamic (static) power to provide fair comparison across different
schemes. Step 2 in Figure 3.3(c) that checks VC availability in the NI is assumed to take
one cycle, as this step essentially reuses the original function in the NI which is modeled
as one cycle in Garnet. Wormhole switching with credit-based flow control is assumed,
although NoRD is agnostic to the switching and flow control mechanism used. Table 3.1
lists the key parameters used in the evaluations. Full system simulation uses a 16-node
Table 3.1: Key parameters used in simulating Node-Router Decoupling.
Core model Sun UltraSPARC III+, 3GHz
Private I/D L1$ 32KB, 2-way, LRU, 1-cycle latency
Shared L2 per bank 256KB, 16-way, LRU, 6-cycle latency
Cache block size 64Bytes
Coherence protocol MOESI
Network topology 4x4 and 8x8 mesh
Router 4-stage, 3GHz
Virtual channel 4 per protocol class
Input buffer 5-flit depth
Link bandwidth 128 bits/cycle
Memory controllers 4, located one at each corner
Memory latency 128 cycles
43
mesh, and synthetic traffic simulation uses both 16- and 64-node configurations to evalu-
ate scalability.
We compare the following designs: (1) No_PG: baseline design with no power-gating;
(2) Conv_PG: applying conventional power-gating to routers; (3) Conv_PG_OPT: con-
ventional power-gating optimized with early wakeup (this optimized design not only im-
proves performance by partially hiding wakeup latency, but also reduces power-gating
overhead by avoiding powering-off all idle periods that are shorter than 4 cycles); (4)
NoRD: our proposed approach based on node-router decoupling. In addition, all designs
under evaluation are augmented with adaptive routing algorithms using Duato’s Protocol
[33]. The only difference is that (1)~(3) use adaptive routing in adaptive VCs and XY
routing in escape VCs, whereas (4) uses adaptive routing and the ring-based escape
mechanism described in Section 3.2.2.
3.3.2 Workloads
Multi-threaded PARSEC 2.0 benchmarks [14] are used for the majority of simula-
tions, as the performance and power consumption of realistic workloads are of primary
concern. Each core is warmed up for sufficiently long time (with a minimum of 10 mil-
lion cycles) and then run until completion. We also perform simulations with synthetic
traffic (uniform random and bit-complement [3]) to provide insight on the behavior of
different designs across a wide range of load rates and parameter values. In those cases,
packets are uniformly assigned two lengths. Short packets are single-flit while long pack-
44
ets have 5 flits. For synthetic traffic, the simulator is warmed up for 10,000 cycles and
then the statistics are collected over another 100,000 cycles.
3.4 Results and Analysis
3.4.1 Wakeup Thresholds
To simulate NoRD, the appropriate wakeup thresholds must first be found. This is
done empirically. All routers are forced into sleep mode without waking up – concentrat-
ing traffic on the Bypass Ring – and the number of VC requests (averaged over all routers)
is recorded while varying the load rate. It can be seen from Figure 3.6 that the maximum
achievable throughput of the Bypass Ring is low (i.e., 14% of the throughput when all
routers are turned on), indicating that some routers need to be awoken when network traf-
fic increases, as measured by VC requests.
The objective of choosing the wakeup thresholds is to maximize the static power sav-
ings opportunity while not significantly increasing packet latency. In this sense, the dual-
Figure 3.6: Determining wakeup threshold.
Req = 1
Req = 2
Req = 3
Req = 4
Req = 5
0
20
40
60
80
100
0 0.02 0.04 0.06 0.08 0.1
Average Latency (cycles)
Injection Rate (flits/node/cycles)
45
threshold technique in asymmetric wakeup thresholding provides more flexibility in
achieving a good trade-off. In the current implementation of NoRD, the performance-
centric routers are assigned a threshold of 1 as they are critical to performance and need
to be awoken early. The remaining power-centric routers can use a higher threshold to
enable more power-savings. Considering that a threshold value of 4 VC requests can lead
to nearly 60% increase in packet latency, the power-centric routers are assigned a thresh-
old of 3 to avoid large performance penalty. Although the thresholds here are determined
empirically, they work very well across all benchmarks.
3.4.2 Impact on Static Energy
Figure 3.7 presents the results of static energy of different designs normalized to
No_PG. It can be seen that, Conv_PG reduces the static energy slightly more than
Conv_PG_OPT by 4.2% on average (51.2% vs. 47.0%). This is because Conv_PG does
power-gating as long as the routers are empty whereas Conv_PG_OPT power-gates rout-
ers only if the idle periods are longer than 3 cycles as indicated by the early wakeup sig-
Figure 3.7: Static energy comparison (normalized to No_PG).
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Static energy (norm. to No_PG)
No_PG Conv_PG Conv_PG_OPT NoRD
46
nal. As shown later, early wakeup pays off for Conv_PG_OPT in terms of performance.
The lowest static power is achieved in the proposed NoRD approach for all benchmarks,
with an average reduction of 62.9% compared with No_PG. When comparing relatively,
NoRD provides savings relative to Conv_PG and Conv_PG_OPT of 23.9% and 29.9% on
average, respectively. This improvement mainly comes from the increased opportunity in
utilizing short idle periods and the reduced number of wakeups through bypass paths.
3.4.3 Reducing Power-gating Overhead
To provide more insight of the effectiveness of NoRD in reducing power-gating
overhead, Figure 3.8(a) compares the energy overhead caused by router wakeup for con-
ventional power-gating designs and the bypass design, normalized to Conv_PG (No_PG
is not shown in the figure as it does not have any wakeups). As can be seen, the power-
gating overhead in NoRD is considerably reduced by 80.7% and 74.0% compared with
Conv_PG and Conv_PG_OPT, respectively. Figure 3.8(b) shows the reduction in the
total number of wakeups in different designs normalized to Conv_PG. NoRD decreases
the number of wakeups by 81.0% and 73.3% over Conv_PG and Conv_PT_OPT, respec-
tively, which explains the above substantial reduction of power-gating overhead and
demonstrates the usefulness of the decoupling approach.
3.4.4 Impact on Dynamic Energy
Due to the detour of some packets in bypassing powered-off routers, the dynamic en-
ergy of NoRD may increase. Figure 3.9 plots the breakdown of NoC energy across the
47
benchmarks, so that the relative impact of each NoC energy component can be examined.
For the NoC dynamic energy (routers plus links), NoRD incurs an overhead of 10.2% on
average, which constitutes 4.0% of the total NoC energy consumption. However, the stat-
ic energy and wakeup overhead savings offered by NoRD constitutes 24.7% of the total
NoC energy. Compared to No_PG, Conv_PG and Conv_PG_OPT, this renders NoRD a
net savings of NoC energy of 9.1%, 9.4% and 20.6%, respectively.
(a) Power-gating energy overhead
(b) Reduction in router of wakeups
Figure 3.8: Reduction of power-gating overhead.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Power‐gating overhead energy
Conv_PG Conv_PG_OPT NoRD
0%
20%
40%
60%
80%
100%
Reduction in router wakeups
Conv_PG Conv_PG_OPT NoRD
48
Figure 3.9: Overall NoC energy breakdown.
0%
20%
40%
60%
80%
100%
120%
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
No_PG
Conv_PG
Conv_PG_OPT
NORD
blackscholes bodytrack canneal dedup ferret fluidanimate raytrace swaptions vips x264 AVG
Breakdown of power (normalized to No_PG)
power‐gating overhead router static power router dynamic power link dynamic power link static power
49
3.4.5 Impact on Performance
After presenting the energy statistics, we now compare the performance impact of
different designs, which is another importance objective of power-gating techniques. Fig-
ure 3.10(a) shows the average packet latency, and Figure 3.10(b) compares the execution
time of the four designs. No_PG does not have any performance penalty as there is no
power-gating, and hence provides a lower bound on average packet latency and execution
time. As can be seen, the aggressive power-gating scheme, Conv_PG, significantly de-
grades the average packet latency by 63.8% on average; whereas Conv_PG_OPT with
early wakeup mitigates this degradation to 41.5% on average. These large penalties in
conventional power-gating designs mainly come from the fact that once a router is pow-
er-gated off, any packet from either local traffic or in-network traffic suffers additional
wakeup latency before being processed by the node. The comparison between
Conv_PG_OPT and Conv_PG indicates that early wakeup does help a lot in reducing the
performance penalty, but still cannot mask entirely the negative effects of wakeup latency.
In contrast, NoRD decouples nodes from routers, effectively removing the wakeup laten-
cy from the critical path. The latency overhead in NoRD is caused by packet detours,
which is partially offset by reduced per hop latency and avoidance of long wakeup laten-
cy as discussed before. As a result, the overall degradation of average packet latency in
NoRD is only 15.2%, on average.
50
The disparities in average packet latency among these designs result in different exe-
cution time, as shown in Figure 3.10(b). Although different benchmarks exhibit varia-
tions in the specific percentage of degradation due to their difference in network sensi-
tivity, the trend is similar that NoRD has the smallest performance penalty compared to
Conv_PG and Conv_PG_OPT. Overall, the Conv_PG, Conv_PG_OPT and NoRD in-
(a) Average packet latency comparisons
(b) Execution time comparisons
Figure 3.10: Impact of NoRD on Performance.
0
5
10
15
20
25
30
35
40
45
Average packet latency (cycles)
No_PG Conv_PG Conv_PG_OPT NoRD
50%
60%
70%
80%
90%
100%
110%
120%
130%
Execution time (norm. to No_PG)
No_PG Conv_PG Conv_PG_OPT NoRD
51
crease the execution time by 11.7%, 8.1% and 3.9%, respectively, in order to achieve the
energy saving described previously.
3.4.6 Effects on Hiding Wakeup Latency
So far, the effectiveness of NoRD has been demonstrated in real applications using
full system simulations. In addition to the above primary results, we also perform simula-
tions with synthetic uniform random traffic to highlight key characteristics of NoRD.
Recall that cumulative wakeup latency is one of the big obstacles to power-gating routers,
particularly in multi-hop networks. To illustrate that NoRD fundamentally solves this
problem, Figure 3.11 shows the average packet latency of Conv_PG, Conv_PG_OPT and
NoRD while varying the wakeup latency across a wide range. The load rate is set to the
average load rate of PARSEC benchmarks. As can be seen, the latency of Conv_PG and
Conv_PG_OPT increases by nearly 1.5X and when the wakeup latency increases from 9
to 18 cycles; whereas the latency of NoRD remain similar for different wakeup latencies,
which clearly demonstrates its ability to hide wakeup latency.
Figure 3.11: Impact of wakeup latency.
0
10
20
30
40
50
9 121518
Average latency (cycles)
Wakeup latency (cycles)
Conv_PG Conv_PG_OPT NoRD
52
3.4.7 Behavior across Full Range of Network Loads
Next, we investigate the behavior of different designs across the entire network load
range: from zero load to saturation loads. Figure 3.12 presents the performance and pow-
er results of a 16-node mesh under uniform random traffic, and Figure 3.13 presents for
64-node under uniform random and bit-complement traffic. Here, while the behavior of
No_PG is very typical, interesting results are found for Conv_PG_OPT and NoRD. These
are explained by separating the loads into three regions.
(1) Low to medium load region: When the load is very low, many routers are in the
gated-off state for the majority of the time in both Conv_PG_OPT and NoRD. For
Conv_PG_OPT, packets are likely to experience wakeup latency once or multiple times,
so the average packet latency is high. For NoRD, packets use bypass more often, so aver-
age latency is increased due to detours. When load gradually increases, more routers are
in the on-state, which tends to reduce the latency. This factor actually offsets the effect of
increased load on average latency, leading to a net decrease in latency for Conv_PG_OPT
and NoRD. As can be seen, NoRD achieves both lower average latency and lower power
than Conv_PG_OPT. Note that, in this region, NoRD has increased benefits compared to
Conv_PG_OPT for larger networks. This is because the cumulative wakeup latency prob-
lem in Conv_PG_OPT is more severe due to the increased NoC diameter in larger net-
works. A gated-off router at any hop of a packet’s route adds extra wakeup latency, and
every router has a high probability of being gated-off under low load. For instance, at 10%
injection rate under uniform traffic, the latency for No_PG, Conv_PG_OPT and NoRD
for a 4x4 mesh is 24, 34 and 29 cycles, respectively; whereas for an 8x8 mesh, it is 36, 52
53
Figure 3.12: Packet latency and power of 16-node for different load ranges.
Figure 3.13: Packet latency and power for 64-node.
(top two figures: uniform random; bottom two figures: bit-complement)
0
10
20
30
40
50
60
70
80
90
0 0.1 0.2 0.3 0.4 0.5 0.6
Average packet latency (cycles)
Injection rate (flits/node/cycles)
No_PG Conv_PG_OPT NoRD
0
2
4
6
8
10
12
0 0.1 0.2 0.3 0.4 0.5 0.6
NoC power (W)
Injection rate (flits/node/cycles)
No_PG Conv_PG_OPT NoRD
0
15
30
45
60
75
90
105
120
0 0.1 0.2 0.3 0.4
Average packet latency (cycles)
Injection rate (flits/node/cycle)
No_PG Conv_PG_OPT NoRD
0
5
10
15
20
25
30
35
40
45
50
0 0.05 0.1 0.15 0.2 0.25 0.3 0.3
NoC power (W)
Injection rate (flits/node/cycle)
No_PG Conv_PG_OPT NoRD
0
20
40
60
80
100
120
140
160
0 0.05 0.1 0.15 0.2
Average packet latency (cycles)
Injection rate (flits/node/cycle)
No_PG Conv_PG_OPT NoRD
0
5
10
15
20
25
30
35
40
0 0.05 0.1 0.15
NoC power (W)
Injection rate (flits/node/cycle)
No_PG Conv_PG_OPT NoRD
54
and 44 cycles, respectively. This indicates that for a 64 node network, the latency of
NoRD is lower than Conv_PG_OPT with an increased difference compared to the 16
node network. Curves for power for an 8x8 NoC are also similar in shape as for 4x4, in-
dicating that the net energy-savings of NoRD that considers all energy contributors is still
more favorable than conventional PG for larger networks.
(2) Medium to high load region: In this region, the three schemes have very similar
latency and power characteristics. The relatively high load causes most of the routers to
be turned on, making little difference between the designs with or without power-gating.
(3) Saturation region: In this region, as nearly all routers are in the on-state, both
Conv_PG_OPT and NoRD are reduced to No_PG, except that they use different escape
mechanisms. In this regard, as the escape ring in NoRD has less flexibility in routing
packets as compared to escape XY routing of Conv_PG_OPT, NoRD saturates a little
earlier. However, this is not an inherent limitation of node-router decoupling, as more
efficient deadlock-free routing algorithms such as in [36] can be used for the bypass ring
to close the throughput difference.
Full system simulations show that real application loads, in practice, typically stay
within the low-to-medium region where NoRD has clear advantages over Conv_PG_OPT
in both performance and power.
55
3.4.8 Discussion
Area Overhead
For any power-gating technique, there is hardware overhead for the sleep switch and
the distribution of the sleep signal. While it greatly depends on the optimization level of
circuit design, the area overhead of a well-designed power-gating block is usually be-
tween 4~10% [51, 57]. More of a concern for NoRD is the area overhead of the added
bypass and related hardware. In evaluating this, the Orion 2.0 [65] on-chip network mod-
el is used with 45nm technology parameters. The simulator is modified to model all the
additional key components of NoRD, including the added forwarding logic in the NI.
Results show that NoRD has an area overhead of only 3.1% compared with
Conv_PG_OPT.
Bufferless Routing
Recently, bufferless routing has been proposed as a means of reducing router power
consumption [37]. Although the bufferless approach may introduce livelock, deflection
and packet reassembly issues, it can eliminate buffers and their associated power con-
sumption. However, as mentioned previously, while buffers are the largest contributor of
static power, other router components consume a considerable percentage (e.g., 45%) of
total static power, which would remain even if a bufferless approach is used. In fact,
bufferless routing is complementary to power-gating techniques in general, as both can
be applied at the same time to reduce router power consumption. For example, flits in
bufferless routing have the option to be deflected through the bypass paths in NoRD if
needed.
56
Shorter router pipelines and aggressive NoRD design
In the baseline, a canonical router is used which takes 4 cycles for the pipeline plus 1
cycle for LT; whereas the bypass for gated-off routers in NoRD takes 2 cycles plus 1
cycle for LT. There are some techniques such as look-ahead routing [71] and speculative
SA [98] that can potentially shorten the 4-cycle router pipeline to 2-cycle. However,
NoRD is still competitive in that case for the following reasons. First, shortening the
pipeline by two also reduces the number of cycles that can hide wakeup latency by two,
making the total time (pipeline delay plus wakeup latency) to go through a gated-off
router to remain the same. Second, these techniques come with overheads. Look-ahead
routing requires contention information to be propagated one-hop ahead, while specula-
tive SA may not always succeed, making 2 cycles a best-case scenario. Ironically, specu-
lative SA is likely to succeed at low load, in which routers are also likely to be gated-off
and the wakeup latency dominates the delay at those routers. Third, the bypass in NoRD
can also be optimized to become more aggressive by directly connecting the Bypass In-
port to the Bypass Outport. This has a similar rationale as for speculation in that the for-
warding of flits optimistically assumes that there is no local flit to inject, thereby bypass-
ing the router in just one cycle. In case of conflict, additional cycles are needed, just like
that in speculative SA. Therefore, when optimizations are used for both the baseline and
NoRD, there are no clear advantages for the baseline, and NoRD remains competitive.
Other Related Techniques
Bypass has been used for various purposes in on-chip networks. In [74], default
backup paths are proposed to allow fault-tolerance with graceful performance degrada-
57
tion. This scheme assumes all routers are notified each time a router becomes faulty and
requires re-computing the routing table for all routers for each fault occurrence. There-
fore, it is not suitable for run-time power-gating in which the status of routers may
change more frequently. In comparison, each router in the proposed NoRD approach can
be powered-on/off independently without notifying all other routers or re-computing any
routing tables. A modular router architecture is proposed in [72] that can bypass some
internal faults within a router. However, this design does not provide chip-wide connec-
tivity and does not explore the application of power-gating techniques as proposed in this
work. Express VC [75] also makes use of bypass in that it virtually bypasses routers to
improve both performance and dynamic power. However, it does not reduce router static
power. Another bypass design is proposed in [55] for adaptive flow control between
bufferless and buffered router modes. It is based on bufferless design and is subject to the
associated constraints, such as flit-by-flit routing, livelock and packet reassembly issues.
Moreover, it only targets the buffers in a router and applies power-gating techniques con-
ventionally, whereas our approach is able to bypass the entire router and implement node-
router decoupling.
Many prior works have investigated techniques to save dynamic and static power of
links [70, 107, 120]. These techniques can readily be used together with NoRD to provide
more energy-efficient NoC designs. These works and other general-purpose dynamic
power-saving techniques (such as clock-gating) have different targets other than router
static power and, therefore, are orthogonal and complementary to this work.
58
3.5 Summary
While power-gating is a promising technique to reduce static power, node-router de-
pendence severely limits its effective use in on-chip routers due to the BET limitation,
wakeup delay and disconnection problem. In this chapter, a novel approach that provides
separate power-gating bypass to decouple the node’s ability for sending, receiving and
forwarding packets from the on/off status of the associated router is proposed. The result-
ing design can significantly reduce the number of state transitions, increase the length of
idle periods, completely hide the wakeup latency from the critical path and eliminate
node-network disconnection problems. Full system simulations show that, compared to
an optimized conventional power-gating technique applied to on-chip routers, NoRD can
further reduce the router static energy by 29.9% and improve average packet latency by
26.3%, with only 3% additional area overhead.
59
Chapter 4
Performance-aware NoC Power Reduction
While node-router decoupling demonstrates the viability of achieving low-power
NoC with power-gating and reduces the performance overhead significantly over conven-
tional power-gating, its performance impact cannot be entirely ignored. To further reduce
its performance penalty, we explore another class of topology – indirect networks. A rep-
resentative example of indirect networks is the Clos network. The Clos is potentially a
good target for power-gating because of its path diversity and decoupling between pro-
cessing elements and most of the routers. While power-gated Clos networks can perform
better than power-gated direct networks such as meshes, a significant performance penal-
ty still exists when conventional power-gating techniques are used. In this chapter, we
present an effective power-gating scheme, called MP3 (Minimal Performance Penalty
Power-gating) [23], which is able to achieve minimal (i.e., near-zero) performance penal-
ty and save more static energy than conventional power-gating applied to Clos networks.
MP3 is able to completely remove the wakeup latency from the critical path, reduce long-
term and transient contention, and actively steer network traffic to create increased pow-
er-gating opportunities.
60
4.1 Motivation
4.1.1 Need for Performance-aware Power Reduction
The design of the on-chip network is key to supporting fast communication among
various on-chip resources. Care should be taken when trading off NoC performance for
power-savings as any non-local data access, coherence messaging and handshaking sig-
naling relies on the on-chip network which is critical to maintaining system performance.
Figure 4.1 from our simulations show that, on average, the runtime of PARSEC bench-
marks on a 64-node CMP is increased by 15% and 36% when the average on-chip packet
latency increases from 32 cycles to 44 cycles and 56 cycles, respectively. With more
cores integrated on a chip in the near future, the on-chip network will have an even larger
impact on system performance. Given the worsening problem of static power consump-
tion and the growing importance of low packet latency, it is imperative to design effec-
tive techniques that can dramatically reduce NoC static power without sacrificing per-
formance.
Figure 4.1: Normalized runtime vs. average packet latency.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
32cycle 44cycle 56cycle
Normalized runtime
61
4.1.2 Limitations in Power-gating Mesh Networks
Prior works on applying power-gating techniques to on-chip network routers all as-
sume mesh-based topologies [19, 30, 87, 88, 89, 103]. While the mesh is a popular net-
work owing to its planar topology, there are several fundamental limitations in applying
power-gating usefully to meshes and other direct networks. Below we discuss these limi-
tations and summarize their impacts on energy and performance that may have been men-
tioned sporadically in previous chapters
As shown in Figure 4.2(a), in direct networks such as the mesh, every router (denoted
by the labeled square) is connected to a processing element (PE, denoted by the circle);
whereas in indirect networks such as the Clos of Figure 4.2(b), only the input and output
routers at the edge of the network are associated with PEs, so that packets sent from PEs
are forwarded indirectly through the middle-stage routers. Compared with Clos, there are
two distinctive properties of mesh networks that greatly limit the effectiveness of apply-
ing power-gating: 1) dependence between each PE-router pair and 2) less path diversity.
From the energy perspective, due to the PE-router dependence, a mesh router must be
awoken whenever the connected PE needs to send a packet to the network or receive a
packet from the network, thus breaking the potentially long idle period of the router into
fragmented intervals that may fall below the required BET. Moreover, the BET limitation
is further intensified in meshes due to the fewer alternative paths as more non-local pack-
ets have to be forwarded through the local router, making the idle intervals even shorter.
For example, any packet sent from router 0-5 in Figure 4.2(a) needs to be forwarded
through router 6 to get to router 7 assuming a minimal routing algorithm.
62
In addition, from the performance perspective, power-gating of mesh routers can have
a considerable negative impact on NoC performance. When a PE needs to send/receive a
packet, due to the PE-router dependence, a wakeup is inevitable if the associated router is
in the powered-off state, and the wakeup latency is exposed directly to the critical path of
the packet’s transport to the next hop. Furthermore, a packet routed over multiple hops
can experience wakeup latency multiple times as routers at many hops along the path
could be gated-off. This cumulative wakeup latency problem is severe in meshes as there
are few alternative node-disjoint paths from which to choose at any particular hop.
To improve the effectiveness of power-gating mesh networks, several optimization
techniques can be used. However, they all have limited capability in mitigating the above
energy and performance issues. For example, early-wakeup signal generation [87] can
only hide up to 3 cycles of the entire wakeup latency, assuming a canonical 3-stage router
with look-ahead routing. The Idle-detect [51] technique can usually only filter out idle
intervals that are shorter than around 4 cycles [30] without substantially losing static
power saving opportunities. It is also possible to implement power-gating for smaller
circuit blocks within each router, such as per input port or per virtual channel [88, 89].
However, individual components have only slightly longer idle periods, and this method
requires prohibitive implementation overhead (e.g., 35 power domains are needed in a
single router [89] to implement this method in addition to the complex coordination
among different components). These techniques have only limited effectiveness as they
can neither remove the inherent dependence between the PE and router in a mesh nor
63
increase path diversity. We address these issues by exploring power-gating on another
class of topology that expands the possibility for power-performance tradeoffs.
(b) 5-stage Clos (indirect network, 80 4x4 routers)
Figure 4.2: Direct network (mesh) vs. indirect network (Clos) for connecting 64 PEs.
All links are unidirectional. PE-router dependence only at I/O routers in Clos.
0 1 2 3 4 5 6 10 7 8 9 11 12 13 14 15
16 17 18 19 20 21 22 26 23 24 25 27 28 29 30 31
32 33 34 35 36 37 38 42 39 40 41 43 44 45 46 47
48 49 50 51 52 53 54 58 55 56 57 59 60 61 62 63
64 65 66 67 68 69 70 74 71 72 73 75 76 77 78 79
Inject
Input
routers
Middle-
stage
routers
Output
routers
Eject
(a) Mesh (direct network, 64 5x5 routers)
64
4.1.3 Opportunities and Challenges in Power-gating Clos NoCs
Clos Networks
Whereas most of the NoC power-gating work to-date tries to combat critical problems
in applying power-gating to mesh networks, very little research has explored the oppor-
tunities of power-gating Clos NoCs belonging to the large class of indirect networks. The
Clos topology has long been studied since first being proposed in 1953 [25]. Early appli-
cations of Clos were for circuit switching in telephone exchange systems due to the to-
pology’s superior capability for establishing many concurrent connections. More recently,
packet-switched Clos and its variants have been proposed for off-chip networks in super-
computers as well as on-chip networks for chip multiprocessors [67, 104, 126].
A packet-switched Clos network consists of three types of routers: input routers (IRs)
that receive input packets from PEs through the injection channels, output routers (ORs)
that output packets to PEs through ejection channels, and middle-stage routers (MRs) that
do not connect to any PE and only perform forwarding functions. In general, a Clos net-
work can be made of any odd number of stages. Figure 4.2(b) shows an example of a 5-
stage Clos composed of 4x4 routers to connect 64 PEs using unidirectional links. The
respective top and bottom PEs are the same repeated for simplicity of representation, per
usual convention.
In the past, the main concern for adopting Clos NoCs was long wires. With special-
ized floor-planning optimizations to reduce total wire length of the Clos [67, 115] and
with routing-over-logic techniques to largely remove the area overhead of long wires [96,
126], the hardware complexity of Clos NoCs can be greatly mitigated. Moreover, Clos
65
also has the flexibility to be implemented with lower radix routers (e.g., 2x2 router) to
increase clock frequency or with higher radix routers (e.g., 8x8 routers) to reduce hop
count, making Clos very competitive to other traditional and ad hoc topologies [67].
These recent optimizations and flexibility on implementation make it interesting to ex-
plore Clos NoCs and their power-savings capabilities.
Opportunities
As an indirect network, Clos has at least three major advantages for applying power-
gating. First, except for the input and output routers, all the middle-stage routers are not
coupled to PEs. Therefore, sending and receiving packets in PEs do not necessarily trig-
ger the wakeup of most routers. This not only reduces the number of router wakeups but
also mitigates the energy overhead and delay associated with the wakeup. It also increas-
es the chances of routers being idle longer than the required BET.
Second, for a given network size, the number of stages in a reasonably designed Clos
is usually smaller than the average hop count in a mesh. As routers at each hop could be
gated-off, the Clos topology can essentially alleviate the aforementioned cumulative
wakeup latency problem by reducing the total number of encountered routers that are in
gated-off state. The result is accelerated packet forwarding.
Third, the Clos provides path diversity, so that in-transit packets have multiple rout-
ing options and can avoid waiting for router wakeup as long as one of the downstream
routers allowed by the routing algorithm is not in the gated-off state. It is possible, in
theory, for packets to avoid all wakeups along a packet’s entire path from source to desti-
66
nation, thereby eliminating the wakeup delay and minimizing the overall performance
penalty of applying power-gating.
Challenges
Although the above opportunities suggest that indirect Clos networks are promising
candidates for power-gating, applying the circuit-level power-gating technique conven-
tionally (or conventional power-gating for short) to Clos can have limited effectiveness,
especially in terms of reducing performance penalty. Simulation results show that con-
ventional power-gating of the Clos, even with early-wakeup and idle-detect optimizations,
can still incur 38% increase in average packet latency and 15% increase in execution time.
This significant performance penalty is caused by a number of reasons, as explained be-
low.
First, as can be observed, besides the middle-stage routers, there are still a sizable
number of input or output routers (e.g., 32 out of 80 routers in the 5-stage Clos). Similar
to the mesh, these routers connect to PEs directly, thus suffering from the same problem:
either the router idleness is upper-bounded by the local PE’s traffic, or packets from/to
PEs have to experience the wakeup latency of the directly associated router. Therefore, a
way to allow packets to be forwarded through the input and output routers with low over-
head is needed while allowing part or all of the static energy of these routers to be saved.
Second, even though the Clos has better path diversity and smaller average hop count,
the wakeup latency is still on the critical path of packet transport. Moreover, the cumula-
tive wakeup latency remains as packets at some particular routers may be left with one
unique path to the destination. For instance, there are 16 different paths from R1 to R64
67
overall (we use Ri to denote the labeled router in Figure 4.2). However, if a packet is cur-
rently in R32 and destined to PEs connected to R64, then the only reachable path is R32 =>
R48 => R64. If both R48 and R64 are powered-off, the packet will experience wakeup
latency twice with no alternative paths. To make things worse, power-gating saves more
static energy when network load is low, in which case routers are also likely to be pow-
ered-off, making packets more likely to encounter multiple wakeups. One effective ap-
proach to solve this problem is to completely remove the wakeup latency from the critical
path by always providing a minimal set of carefully selected powered-on paths between
any PE pair, as proposed in the next section.
Third and most importantly, conventional power-gating of the Clos is uncoordinated
in the sense that every router makes routing decisions unaware of the global network sta-
tus, thus switching between powered-on and off states independently based only on local
traffic information. This wastes energy-saving opportunities and incurs unnecessary per-
formance penalties in various ways. For example, even when the overall network load is
low, packets in the upstream router can still be routed to multiple downstream routers,
requiring more powered-on routers that could otherwise be gated-off. Also, due to the
unhidden portion of the wakeup latency, if a gated-off router starts to wake up only after
it receives a wakeup signal, the router will not be ready by the time the packets actually
arrive, unless some hints about the traffic between the up and downstream routers can be
exchanged in advance. In addition, as some routers in the network may be in sleep state, a
sudden increase in the amount of injecting packets are forwarded temporarily only
through the remaining powered-on routers, which may cause transient congestion and
68
performance degradation until more routers are gradually awoken. To address these is-
sues, we need a more coordinated way to efficiently power-gate all the routers in the net-
work.
In summary, while Clos networks have great potential to reap energy benefits without
incurring excessive performance overhead, this is hard to achieve through conventional
power-gating approaches but instead requires considerable support at the architecture
level as proposed next.
4.2 Minimal Performance Penalty Power-gating
4.2.1 The Basic Idea
In this section, we propose an effective power-gating scheme called MP3 (Minimal
Performance Penalty Power-gating) for Clos NoCs which is able to achieve minimal (i.e.,
near-zero) performance penalty and, at the same time, save more static energy than con-
ventional power-gating. The basic idea is to first guarantee network connectivity by con-
structing a minimum resource set that is always powered-on so that regardless of the
on/off status of other resources, packets always have the last resort of using this resource
set for transporting packets without suffering any wakeup latency. Then, dynamic traffic
diversion actively steers traffic between the minimum and maximum available resources
of the network in a coordinated fashion based on load conditions. In this way, contention
at any particular load level is kept low while more resources can be powered-off through
increased power-gating opportunities. Finally, rapid wakeup further reduces any transient
69
contention that may occur during sudden load increases by powering on a selective and
necessary set of downstream routers in advance. This enables those routers to be ready
when packets arrive. The following subsections describe these techniques in detail.
4.2.2 Guaranteed Connectivity
To minimize the performance penalty of power-gating, the foremost task is to remove
wakeup latency from the critical path of packet transport. We achieve this by providing
guaranteed connectivity in the Clos. The basic idea is to turn “on” a minimal set of re-
sources, S, to ensure that at least one powered-on path always exists between any source
and destination PE pair. The set S can be composed of routers or components within rout-
ers. As this set of resources is always “on” regardless of the network load, the key is to
minimize S to maximize energy savings, with low implementation overhead. We use the
example in Figure 4.3 to explain our method of constructing S. The procedure is general-
ly applicable to other Clos instances.
There are two main steps. The first step is to reduce the number of powered-on rout-
ers in the NoC to a minimum, and the second step is to reduce the amount of “on” com-
ponents within that minimum set of routers. Specially, as can be seen immediately from
the figure, no PE is disconnected even if all the black routers are gated-off. Hence, the
resources associated with the 39 black routers are not needed in S. However, this is not
sufficient as every input or output router is still needed. To reduce S further, notice that
when all the black routers are turned off, each input router only needs to forward packets
70
from four input ports to one output port (e.g., R0 only forwards packets to R16). Based on
this observation, we split the resources of input routers into two power domains.
As depicted in Figure 4.4, the striped components are in one power domain and are
needed in S, whereas the rest of the components form the other power domain. Essential-
ly, to maintain the connectivity from four input ports to one output port, only one of the
four 4-to-1 multiplexers in the crossbar is needed in S. Also, only one virtual channel
(VC) for each message class is needed in an input port to correctly buffer packets without
message-dependent deadlock. In general, assuming the original router has m dependent
message classes, p input ports, and v VCs per class per port, the minimal number of VCs
needed in S is m×p – one VC for each message class per port. Hence, the amount of VC
resources in S is 1/v of the total VC resources. The higher the value of v, the more static
energy that can be saved. In most wormhole routers, the value of v is typically two or
more in order to mitigate head-of-line blocking effectively (e.g., Intel’s 48-core SCC chip
Figure 4.3: Illustration of MP3 for guaranteeing connectivity, dynamically steering traffic,
and reducing transient contention by rapid wakeup.
IR
UR
CR
LR
OR
0 1 2 3 4 5 6 10 7 8 9 11 12 13 14 15
16 17 18 19 20 21 22 26 23 24 25 27 28 29 30 31
32 33 34 35 36 37 38 42 39 40 41 43 44 45 46 47
48 49 50 51 52 53 54 58 55 56 57 59 60 61 62 63
64 65 66 67 68 69 70 74 71 72 73 75 76 77 78 79
71
has 8 VCs for two message classes [49]). In addition to the four input ports and one out-
put port, we also conservatively put all the router arbitrator components into S given that
arbitrators usually consume a very small portion (less than 5%) of the total router energy.
A two-domain separation for router arbitrators can also be used if some customized rout-
er designs employ very large arbitrators. The two-domain split approach incurs much
lower hardware overhead than implementing power-gating at per port or per VC level,
and allows the majority of router components to be powered off without losing the re-
quired forwarding functionality.
Likewise, R16-R19 perform the same 4-to-1 minimal forwarding and can follow the
same two-domain design. Similarly, all the output routers and R48-R51 only need to for-
ward packets from one input port to four output ports, so the minimal resources in S for
these routers include one input port with m (out of m×v) VCs, one-fourth of the crossbar,
Figure 4.4: Partially-on design for saving energy while maintaining
minimal forwarding functions.
72
four output ports with m (out of m×v) latches per port, and control logic. All the remain-
ing resources are put in the other power-domain.
Overall, the above approach based on identifying a minimal resource set enables a
wide range of power-gating configurations. At one end of the spectrum, all 80 routers in
Figure 4.3 can be turned on to support high network load during data-intensive phases of
an application’s execution. At the other end, when the load intensity allows it, only 1
router (white) needs to be fully powered-on while 40 routers (gray) can be partially pow-
ered-off and 39 routers (black) can be fully powered-off, allowing maximum static ener-
gy savings. More importantly, network connectivity is guaranteed at all times, so that any
packet can always use the resource set S as the last resort for transporting packets regard-
less of the on/off status of other resources, thus eliminating wakeup latency from the crit-
ical path of packet forwarding.
4.2.3 Dynamic Traffic Diversion
While the guaranteed connectivity approach lays the foundation for effective power-
gating of Clos NoCs, it accomplishes only part of our objective as packets are not auto-
matically concentrated to only those resources needed for a specific load. In order to per-
form power-gating in a more coordinated fashion, we propose dynamic traffic diversion
which systematically steers traffic to certain resources based on prevailing load condi-
tions to 1) allow non-essential resources to be powered off via concentration and 2) grad-
ually power on more resources as load increases to reduce contention and balance per-
formance via distribution. To achieve these objectives, an appropriate metric is first se-
73
lected for monitoring traffic intensity and then, based on the load status, the routing algo-
rithm is augmented to enforce a network-wide consistent order of concentrating and dis-
tributing traffic to essential resources. Finally, a handshaking mechanism is carefully
designed to power on/off resources correctly. The details are explained below.
Traffic Intensity Metric: An appropriate metric is needed as an indicator of traffic in-
tensity. R. Das, et al. found that several intuitive metrics are actually ineffective in as-
sessing load status [30]. For example, the metric of average buffer occupancy per router
does not perform well as some input buffers along the congested paths may be heavily
occupied while the average occupancy is still low. Injection rate also is not satisfactory as
there is no universal threshold that works well for all traffic patterns (e.g., uniform ran-
dom and transpose saturate at different injection rates, making it difficult to choose a pre-
determined threshold). In addition, the average blocking delay per flit is theoretically
accurate but prohibitively expensive to implement in practice. Therefore, the use of the
maximum buffer occupancy as an appropriate metric is suggested [30], where the occu-
pancy of each input port is counted, and then the maximum value among all the input
ports is computed and compared with predetermined thresholds. A similar metric with a
slight difference is used for MP3 in that the thresholds are adjusted based on the number
of powered-on VCs to make the metric suitable for both partially-on and fully-on routers.
This metric allows the threshold to be determined empirically and performs well for dif-
ferent traffic patterns and benchmarks.
Routing: After an appropriate metric is selected, the next augmentation is to allow the
routing algorithm to become aware of the load status reflected in the metric and to steer
74
traffic accordingly. For Clos networks, packets in earlier stages have more routing free-
dom than packets in later stages. For example, in Figure 4.3, packets in input routers (IRs)
or upper routers (URs) have up to four output port choices, but packets in center routers
(CRs) or lower routers (LRs) can choose only one output port to reach the destination.
Therefore, steering of traffic is achieved during earlier stages of packet forwarding.
In the case of the example depicted in Figure 4.3, we assume the metric threshold is
divided into four ranges to create a 4-level configuration that corresponds to increasing
load conditions. The threshold increases one level when the load condition makes the
packet latency exceed 15% of the zero-load latency under the router on/off configurations
for the previous level. For each router belonging to IR or UR, its four downstream routers
are numbered 1 to 4 from left to right. When the load condition of a router reaches level k
(k = 1, 2, 3, 4), the router is allowed to forward packets to its downstream routers num-
bered from 1 to k, but not above k (i.e., the leftmost k downstream routers). Adaptive
routing among the k options is used based on the number of available credits (or any oth-
er commonly used criteria). In this way, 1) no downstream routers are used beyond the
minimally needed k routers corresponding to current load conditions and 2) among the
downstream routers, utilization is maximized through load-balancing adaptive routing. At
the highest load, all four downstream routers can be used in this method, which is the
same as the no-power-gating case with no sacrifice in throughput. It is worth noting that,
by enforcing the left-to-right order at every router, the entire network agrees on a con-
sistent order of which resource set to concentrate or expand (e.g., when load is on level-1,
R0, R4, R8 and R12 will consistently all forward packets to R16), thus avoiding the inef-
75
ficiencies of uncoordinated power-gating. Also, if some routers at a particular stage, e.g.,
CRs, are accidently turned off, the upstream stage routers, URs, will experience higher
maximum buffer occupancy and consequently wake up more downstream routers, which
are the exact same stage CRs. This will restore the balance between load intensity and
powered-on routers.
Handshaking: Finally, we discuss the details of the required conditions and handshak-
ing mechanism for routers to correctly transition between power states. No extra signal is
needed between up and downstream routers besides what is already provided in conven-
tional power-gating. Based on the types of routers, there are four cases:
Case 1 – white routers: As white routers are always powered-on, no transition is need.
Case 2 – black routers: Black routers transition from on to off if 1) the datapath of the
router is empty and 2) all of the wakeup signals from its upstream routers are de-asserted.
The router transitions from off to on if any of its upstream routers asserts the wakeup
signal. Here, an upstream router asserts the wakeup signal to a downstream router if a
packet needs to be forwarded to that router. An optimization for wakeup signal genera-
tion will be presented in the next subsection, but the conditions for state transitions are
the same.
Case 3 – gray routers in IRs and URs: A gray router in this category transitions from
fully-on to partially-on if 1) the metric indicates load is on level-1 and 2) the datapath to
the three rightmost downstream routers are empty (any new incoming packets will be
forwarded only to the leftmost downstream router after detecting the low load). The rout-
76
er transitions from partially-on to fully-on if the load is on level-2 or above. Note that a
fully-on router does not necessarily need to forward packets to all its downstream routers.
Case 4 – gray routers in LRs and ORs: A gray router in this category transitions from
fully-on to partially-on if 1) the datapath of the resources that are not in S is empty and 2)
all of the wakeup signals from its three rightmost upstream routers are de-asserted. The
router transitions from partially-on to fully-on if any of its three rightmost upstream rout-
ers asserts the wakeup signal.
4.2.4 Rapid Wakeup
One intended effect of dynamic traffic diversion is to reduce performance degradation
caused by power-gating, as more resources will eventually be turned on to accommodate
the increase of traffic in the long run. However, transient contention may still be possible
during the time that the new resource is being awoken. For example, suppose the network
load suddenly jumps from level-1 to level-2 at R0, so that R0 tries to wake up R20 to dis-
tribute the traffic. Yet, R20 will not be ready until it is fully awake after the unhidden
portion of wakeup latency, during which the packets still have to be forwarded to R16.
Then, after R20 is powered on, packets routed through R20 will find that R36 is asleep.
So again, packets need to wait for R36 to wake up, and so on. In such cases, while R20,
R36, R52 are sequentially waking up, incoming packets are queued in the input buffers
along this path. When the backpressure propagates back to R0, most of the new packets
of level-2 load are still forwarded through R16. This leads to transient contention since
77
the resources from R16 and beyond are supposed to handle only level-1 load without con-
tention.
To avoid this type of pathological performance degradation, we propose rapid
wakeup, which relays the wakeup signal from upstream routers to downstream routers in
a chained fashion to wake up needed downstream routers in advance so that those routers
will be ready when packets arrive. In order to realize rapid wakeup effectively, the key is
to minimize the needed router set, which is achieved by limiting the breadth and depth of
the signal relay tree from the upstream router. First, to limit the breadth of the relay tree,
the wakeup signal is relayed to only one downstream router if multiple options are avail-
able. For instance, R20 relays the wakeup signal only to R36 as R37-R39 are not addi-
tionally needed for packets to reach any destination provided that R36 is powered on. In
contrast, R36 relays the wakeup signal to R52-R55 as they are indispensable for packets
to reach any destination. This is because the destination of a particular packet is unknown
beforehand and, more importantly, most of the destinations will in fact be visited since a
batch of packets likely will arrive due to the load increase.
Second, notice that packets themselves take a few cycles to traverse each router, so
downstream routers that are several hops away do not need to be awoken too early. In
general, an N
hop
-away downstream router can wake up in time if
N
hop
× T
link
+ T
unhidden_wakeup
≤ N
hop
× (T
router
+ T
link
)
This reduces to
N
hop_min
= T
unhidden_wakeup
/T
router
78
which is about 2-3 hops depending on actual parameter values. This means that the relay
depth only needs to be 2-3 hops. After limiting the breadth and depth, the remaining relay
tree from a particular router is minimal in the sense that all the remaining relays are nec-
essary and any delay in waking up these downstream routers will cause some perfor-
mance penalty. Note that, although the gated-off time of these routers may be slightly
reduced, the reduced amount is only a few cycles upper-bounded by T
unhidden_wakeup
assum-
ing the above formula to limit the depth while still being able to wake up in time. The
majority of routers are not affected. Hence, rapid wakeup can largely remove the transi-
ent contention penalty while retaining most of the power-gating opportunities. Moreover,
since the wakeup signal is required for power-gating anyway, no additional signaling
network is needed.
4.2.5 Impact of MP3 on Performance and Energy
Putting the three techniques together, the proposed MP3 scheme fully exploits the
power-gating potential offered by indirect Clos networks while effectively addressing its
performance and energy challenges.
From the performance perspective, the guaranteed connectivity technique is first used
to remove the wakeup latency from the critical path of packet forwarding. Then, dynamic
traffic diversion is used to guard against contention in the long-run and rapid wakeup is
used to reduce transient contention. Therefore, MP3 removes all the major sources of
possible performance degradation, thereby minimizing performance penalty of power-
gating the Clos.
79
From the energy perspective, guaranteed connectivity enables a wide spectrum of en-
ergy-performance tuning opportunities by constructing a minimally needed resource set.
Dynamic traffic diversion then utilizes these opportunities to coordinate router power-
gating by steering traffic and turning on/off resources dynamically. This not only extends
the idle periods of the majority of routers, but also reduces the number of wakeups and
the associated energy overhead that causes BET limitation in the first place. As a result,
MP3 is able to save more static energy with less energy overhead, thus effectively in-
creasing the net energy savings, as shown by evaluation results given in the next section.
4.3 Evaluation Methodology
The proposed MP3 scheme is evaluated quantitatively under full system simulation
with the combined use of multiple architecture-level and circuit-level simulators. Cycle-
accurate SIMICS and GEMS are used for processor functional and memory timing simu-
lation. GARNET [6] is used for detailed NoC performance evaluation, from which the
network activity statistics are collected and fed into DSENT [108] for network power
estimation. The simulators are modified to model all the key additional hardware in MP3,
such as handshaking logic, buffer occupancy comparators, wakeup signal relay, and so on.
Each PE in the network contains an UltraSPARC III+ core running at 2GHz, a 32KB I/D
private L1 cache and a 256KB shared L2 cache slice. Coherence is managed by the
MOESI protocol. Four memory controllers are provided. The cache and memory control-
ler on a PE share the injection channel in the network interface. All topologies under
comparison have the same bisection bandwidth of 1TB/s. To accurately reflect the link
80
delays of the Clos, the floorplan optimization in [67] is followed to estimate every link
length. GARNET is configured to have the delay of each link to be proportional to its
length. A canonical 3-stage router with look-ahead routing [71] is modeled. Two virtual
channels per message class are provided, though MP3 can achieve more energy savings
with more VCs, as mentioned in Section 4.2.2. Also, as the Clos has more bisection links
than the mesh, for comparison purposes, both the Clos and mesh networks are configured
with the same total bisection bandwidth and the same total buffer sizes.
Given that the metric of maximum buffer occupancy is insensitive to traffic patterns
(one of the main benefits), thresholds for congestion levels are determined empirically.
However, router wakeup latency has a large impact on system performance. To estimate
wakeup latency accurately, the physical layout of a router at 45nm technology with 1.0V
voltage is generated using a standard VLSI design flow. Synopsys Design Compiler is
used for logic synthesis, and Cadence Encounter is used to process the gate-level netlist
to generate the power grid, floorplan, clock trees and routes. Parasitic extraction is per-
formed on a 451um-by-451um layout to obtain the parasitic resistance and capacitance as
well as the cell load on the Vdd wiring. Finally, the extracted data is fed into a SPICE RC
model, providing a wakeup latency of 8 cycles. Because of the criticality of wakeup la-
tency, additional sensitivity studies are also conducted to shed more light on the applica-
bility of different schemes.
The following schemes are compared on a 64-core system: (1) Mesh-No-PG: mesh
network with no power-gating; (2) Mesh-ConvOpt-PG: conventional power-gating of
mesh optimized with early-wakeup and idle-detect – these optimizations not only im-
81
prove performance by hiding 3 cycles of wakeup latency, but also reduce energy over-
head by avoiding powering-off all idle periods that are shorter than 4 cycles; (3) Clos-No-
PG: Clos network with no power-gating; (4) Clos-ConvOpt-PG: conventional power-
gating of Clos with early-wakeup and idle-detect optimizations; (5) Clos-MP3: Clos with
the proposed power-gating scheme. All the five schemes allow adaptive routing for fair
comparison. While the mesh is included in the evaluation as a point of reference, the
main objective is to evaluate the power-gating opportunities of Clos and how its power-
gating potential can be exploited by our proposed scheme.
4.4 Results and Analysis
4.4.1 Impact on Performance
As one of the primary targets, we first examine the performance impact of different
schemes by running multi-threaded PARSEC benchmarks [14]. Figure 4.5 compares the
Figure 4.5: Average packet latency comparisons.
0
10
20
30
40
50
60
Average packet latency (cycles)
Mesh‐No‐PG Mesh‐ConvOpt‐PG Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
82
average packet latency, and Figure 4.6 shows the execution time of the five schemes
normalized to Clos-No-PG. Results are consistent across the range of benchmarks. As
Mesh-No-PG and Clos-No-PG do not use power-gating, they provide a lower bound of
performance for the mesh and Clos, respectively.
As can be seen from Figure 4.5, even with early-wakeup and idle-detect optimizations,
the conventional power-gating scheme for the mesh, Mesh-ConvOpt-PG, still significant-
ly increases the average packet latency by 64.5% on average compared with Mesh-No-
PG; whereas Clos-ConvOpt-PG causes 38.6% increase in the average packet latency
compared with Clos-No-PG. This indicates that the indirect network nature of Clos in-
deed helps to reduce performance degradation as compared to the mesh, but it still cannot
entirely mitigate the negative effects of wakeup latency. In contrast, Clos-MP3 complete-
ly removes the wakeup latency from the critical path and reduces both long-term and
transient contention. Consequently, Clos-MP3 achieves a remarkable reduction of aver-
age packet latency, having only 1.8% increase on average. This is equivalent to a 36.8%
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Execution time (norm. to Clos‐No‐PG)
Mesh‐No‐PG Mesh‐ConvOpt‐PG Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
Figure 4.6: Execution time comparisons.
83
improvement compared with Clos-ConvOpt-PG. Similar trends are also reflected in exe-
cution time. As shown in Figure 4.6, Mesh-ConvOpt-PG and Clos-ConvOpt-PG increase
execution time by 36.3% and 14.9% on average, respectively. In comparison, the pro-
posed Clos-MP3 incurs a minimal increase of only 0.65% in execution time, effectively
realizing near-zero performance penalty.
4.4.2 Impact on Router Static Energy
The performance advantage of Clos-MP3 does not sacrifice its energy savings at all.
Figure 4.7 presents the results of router static energy of different designs normalized to
Clos-No-PG. As can be seen from the figure, Mesh-ConvOpt-PG reduces the router static
energy by 38.2% relative to Mesh-No-PG. In comparison, Clos-ConvOpt-PG is able to
reduce the static energy by 41.1% relative to Clos-No-PG, which is slightly better than
Mesh-ConvOpt-PG due to the inherent suitability of Clos for power-gating. The lowest
router static energy is achieved in the proposed Clos-MP3, with an average reduction of
Figure 4.7: Router static energy comparisons.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Router static energy (norm. to Clos‐No‐PG)
Mesh‐No‐PG Mesh‐ConvOpt‐PG Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
84
47.7%. This improvement mainly is attributed to the ability of Clos-MP3 to dynamically
concentrate traffic and actively create power-gating opportunities. When compared rela-
tively, the proposed Clos-MP3 saves 9.8% more router static energy than Clos-ConvOpt-
PG. This highlights the effectiveness of Clos-MP3 in both providing higher performance
and lower energy simultaneously.
4.4.3 Comparison of Power-gating Overheads
To gain more insight on the feature of Clos-MP3 to reduce unnecessary wakeups by
steering traffic, Figure 4.8 shows the energy overhead (left vertical axis) caused by router
wakeup for the conventional power-gating schemes and the Clos-MP3 scheme, normal-
ized to Mesh-ConvOpt-PG. As can be observed, the power-gating overhead in Clos-MP3
is substantially lower than the other two schemes. Figure 4.8 further compares the reduc-
tion in the total number of wakeups in the different schemes (right vertical axis). Whereas
Clos-ConvOpt-PG decreases the number of wakeups by 60.3% compared to Mesh-
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Normalized number of wakeups
Power‐gating overhead energy
(norm. to Mesh‐ConvOpt‐PG)
Overhead‐Mesh‐ConvOpt‐PG Overhead‐Clos‐ConvOpt‐PG Overhead‐Clos‐MP3
Wakeups‐Mesh‐ConvOpt‐PG Wakeups‐Clos‐ConvOpt‐PG Wakeups‐Clos‐MP3
Figure 4.8: Comparison of power-gating overhead.
85
ConvOpt-PG, Clos-MP3 is able to reduce wakeups by 87.6%, on average, owing to its
coordinated power-gating among all routers in the network. This explains the large reduc-
tion in energy overhead and demonstrates the usefulness of Clos-MP3.
4.4.4 Impact on NoC Energy
Figure 4.9 plots the breakdown of NoC energy across the benchmarks normalized to
Mesh-No-PG, showing the relative impact of each energy component. Several observa-
tions can be drawn from the figure. First, although Clos may consume more link energy
than mesh for the no power-gating cases, the total NoC energy of the Clos is still lower
than that of the mesh, indicating that Clos is a competitive NoC topology. Second, the
large power-gating overhead in Mesh-ConvOpt-PG makes it very ineffective, leading to a
less than 6.3% reduction in overall NoC energy; whereas Clos with conventional power-
gating saves 19.4% of overall NoC energy. Third, the proposed Clos-MP3, while signifi-
cantly reducing the performance penalty, can also save 22.5% of NoC energy. This
means that, compared with the state-of-the-art power-gating scheme for Clos (i.e., Clos-
ConvOpt-PG), MP3 is better in terms of both performance and energy.
4.4.5 Comparison across Full Network Load Range
In order to understand the behavior of different schemes more fully, the schemes are
examined under synthetic traffic in which the network load is varied across the full range:
from zero load to saturation load. Figure 4.10 presents the performance and power results
for uniform random, transpose and bit-complement traffic patterns. On the performance
86
side, while typical behavior is observed for Clos-No-PG, interesting results are found for
Clos-ConvOpt-PG. It can be seen that, at low load, the average packet latency of Clos-
ConvOpt-PG is actually very high. This is because many routers are gated-off at this load,
so packets are likely to experience wakeup latency multiple times (i.e., cumulative
wakeup latency), which increases the packet latency considerably. When load increases,
the average packet latency first decreases as more routers are awoken, and then starts to
rise again as load approaches saturation. In contrast, the average packet latency of Clos-
MP3 follows Clos-No-PG closely across the entire load range, showing that it only incurs
minimal performance penalty. It is important to note that Clos-MP3 can indeed reach the
maximum throughput of the no-power-gating case. This means that all routers can be
correctly woken up in Clos-MP3 if needed, which is important and necessary for support-
ing high network load phases of application execution.
When static power is compared, the proposed Clos-MP3 clearly has a significant ad-
vantage for various traffic patterns. As shown in the figure, the static power savings of
Clos-ConvOpt-PG are less than 10% when the load rate only reaches 50% of saturation.
In comparison, Clos-MP3 saves more than 10% of the static power even when load pass-
es 75% of saturation. These results suggest that Clos-MP3 is much more energy-
proportional than conventional power-gating.
87
Figure 4.9: Breakdown of NoC energy.
0%
20%
40%
60%
80%
100%
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
Mesh‐No‐PG
Mesh‐ConvOpt‐PG
Clos‐No‐PG
Clos‐ConvOpt‐PG
Clos‐MP3
blackscholes bodytrack canneal dedup ferret fluidanimate raytrace swaptions x264 AVG
Breakdown of NoC energy
(norm. to Mesh‐No‐PG)
Link
Router_dynamic
Router_static
Power‐gating overhead
88
Figure 4.10: Behavior across full range of network loads.
0
15
30
45
60
75
90
105
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average packet latency (cycles)
Injection rate (flits/node/cycle)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Total router static power (W)
Injection rate (flits/node/cycle)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
0
15
30
45
60
75
90
105
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Average packet latency (cycles)
Injection rate (flits/node/cycle)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Total router static power (W)
Injection rate (flits/node/cycle)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
0
15
30
45
60
75
90
105
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Average packet latency (cycles)
Injection rate (flits/node/cycle)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Total router static power (W)
Injection rate (flits/node/cycle)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
(a) Uniform random
(b) Transpose
(c) Bit-complement
89
4.4.6 Effect of Rapid Wakeup
Simulations are also performed to demonstrate the ability of rapid wakeup to reduce
transient contention. To assess this effect quantitatively, the injection rate is quickly in-
creased from 5% to 25% when the simulation time passes the 10k-cycle mark (sufficient
for reaching steady state in synthetic uniform random traffic). Figure 4.11 plots the
changes of average packet latency as more resources are waking up to accommodate the
new load. Compared with the normal wakeup, rapid wakeup mitigates transient conten-
tion in two ways: 1) rapid wakeup stabilizes the packet latency within 75 cycles, which is
42% shorter than the normal wakeup and 2) rapid wakeup also reduces the peak increase
of average packet latency during the transition by 34%. Due to these features, rapid
wakeup is very helpful in minimizing performance penalty of Clos-MP3.
4.4.7 Wakeup Latency Tolerance
As mentioned previously, cumulative wakeup latency is a big obstacle for reducing
the performance overhead of conventional power-gating, particularly in multi-hop net-
Figure 4.11: Rapid wakeup.
20
25
30
35
40
45
50
55
60
9900 9950 10000 10050 10100 10150 10200 10250
Average packet latency (cycles)
Simulation time for assessing transient contention (cycles)
Normal Wakeup Rapid Wakeup
Figure 4.12: Wakeup latency tolerance.
0
20
40
60
5 8 11 14
Average packet latency (cycles)
Wakeup latency (cycles)
Clos‐No‐PG Clos‐ConvOpt‐PG Clos‐MP3
90
works. The evaluation so far has shown that Clos-MP3 incurs minimal performance pen-
alty when the wakeup latency is 8 cycles (which is obtained from our detailed circuit-
level simulation). To illustrate that Clos-MP3 can effectively address the challenge of
wakeup latency, Figure 4.12 compares the average packet latency of Clos-No-PG, Clos-
ConvOpt-PG and Clos-MP3 with varying values of wakeup latency. The load rate is set
to the average load rate of PARSEC benchmarks. As can be seen, the average packet la-
tency of Clos-ConvOpt-PG increases by 56% when the wakeup latency increases from 5
to 14 cycles; whereas the latency of Clos-MP3 remains very similar (less than 3.5% in-
crease) for different wakeup latencies. This demonstrates the ability of Clos-MP3 to hide
wakeup latency and its wide applicability to various designs (e.g., under different fre-
quencies).
4.4.8 Discussion
Scalability
The proposed Clos-MP3 does not have any particular element that limits its scalabil-
ity (e.g., no central controller, no global signaling, etc.) and can be used for any size of
Clos NoCs. Thus, the scalability of Clos-MP3 is only bounded by the Clos topology itself
which has been shown to have similar scalability as mesh NoCs [67].
High-radix Routers
Thus far, a 5-stage Clos is used as a case-in-point to illustrate the proposed MP3
scheme. This Clos example is compared to a traditional mesh network that is also com-
posed of low radix routers. When higher radix routers are allowed under design con-
91
straints (e.g., to meet certain clock frequency criteria), several other topologies are avail-
able to increase network performance. For example, with an 8x8 router radix, mesh can
use a concentration degree of 4 to reduce the network diameter to 6 for a 64-node system
[9], as shown in Figure 4.13(a) [44]. Flattened butterfly [73] can further reduce this di-
ameter to 2 by directly connecting the nodes in a dimension, with a router radix of 10x10,
as shown in Figure 4.13(b) [44]. In addition, folded Clos (fat-tree) also has a network
diameter of 2 with 8x8 router radix. While these topologies are able to reduce packet la-
tency considerably, Clos remains competitive given that a 3-stage Clos can also achieve a
network diameter of 2 with 8x8 router radix, resulting in a similar reduction of packet
latency.
Prior research has shown that high-radix Clos networks have comparable hardware
complexity but higher power efficiency (assuming no power-gating) than several other
mainstream topologies [67]. However, in terms of static power savings potential, the
aforementioned topologies (i.e., concentrated mesh, flattened butterfly and fat-tree) are
(a) Concentrated mesh (b) Flattened butterfly
Figure 4.13: Concentrated mesh and flattened butterfly with high-radix routers.
92
much more limited by the router-PE coupling than Clos. This is because, with high-radix
routers, a packet to any of the many input ports needs to wake up the router, which re-
duces the router idleness and causes wakeup delay. In contrast, this effect is greatly miti-
gated in Clos with the MP3 technique. For example, all the 8 input routers and 8 output
routers in a 3-stage Clos can benefit from the two-domain partial power-gating (which is
even better than the 5-stage Clos as now roughly only 1/8th of the router needs to be
turned on minimally). Dynamic traffic diversion also works better due to increased adap-
tivity (8 outputs to choose from at each router). Rapid wakeup may have reduced benefits
but can still hide the majority of wakeup latency. Hence, high-radix Clos can have similar
packet latency advantages as other high-radix topologies while being a better target for
power gating. These findings together with the 5-stage Clos example presented in previ-
ous sections lead to the conclusion that Clos is a competitive topology for both low-radix
and high-radix networks.
Applicability
The proposed MP3 scheme is an important extension to enhance the power-gating ca-
pabilities of a variety of interconnection networks. First, MP3 is applicable to both on-
chip and off-chip Clos networks with different radices and network sizes. Second, MP3
can also be applied to other topologies that have multiple node-disjoint paths (excluding
edge routers), which are often provided in indirect networks such as Beneš, Omega, and
non-flattened butterfly with extra stages. A similar methodology of dividing the edge
routers into two power domains and dynamically turning on and off edge/middle routers
with wakeup signal relay can be applied. The proposed MP3 scheme, however, has lim-
93
ited applicability to direct networks (even with multiple paths) due to direct coupling
between router and PE that may trigger the power-state to transition very frequently and
because the two power domains may not be sufficient to maintain network connectivity
(e.g., concentrated mesh requires all inputs and outputs to be on).
Other Related Work
A couple of related works have already been mentioned in previous sections. In addi-
tion, a multiple network-on-chip power-gating design is proposed in [30]. This design
mainly targets power-gating mesh networks, and the increase in average packet latency is
sizable. A router parking scheme is introduced in [103] to power-gate routers in meshes
when the core is idle, but it needs to flush private caches before turning off routers, which
may cause serious performance issue. Our work differs from these works in that we ex-
plore power-gating opportunities for Clos networks, and the proposed scheme is able to
entirely remove wakeup latency from the critical path, thus achieving near-zero perfor-
mance penalty.
Some work has gone into improving Clos for off-chip interconnects [104], and recent
research has shown that it is also very promising to adopt Clos for on-chip networks [63,
64, 67, 126] as new floorplan and layout techniques emerge. However, none of the off-
chip or on-chip works looked into the power-gating of Clos. Our work provides insight
on the power-gating characteristics of Clos networks and helps to facilitate more efficient
use of Clos NoC.
94
4.5 Summary
Current and future many-core systems require on-chip networks to be designed with
both power and performance awareness. While mesh networks have several fundamental
limitations for effective power-gating, this chapter presents the power-gating opportuni-
ties and challenges of Clos networks. To combat the various limitations and inefficiencies
in conventional power-gating of Clos, a minimal performance penalty power-gating
scheme (MP3) is proposed. MP3 not only removes the wakeup latency from the critical
path and reduces long-term and transient contention, but also actively steers network traf-
fic to create increased power-gating opportunities coordinated globally across the net-
work. Simulation results show significant reduction in the performance penalty while
saving more router static energy than conventional power-gating. These results demon-
strate the viability of using power-gating in NoCs with only minimal performance over-
head.
95
Chapter 5
Reducing Resources for VCT-switched Networks
While previous chapters focus on reducing the number of powered-on resources in
dynamic manner to reduce power, it is also important to reduce the resource requirements
for maintaining correct NoC operation in the first place. This allows on-chip networks to
be built with minimal hardware and allows more resources to be powered off dynamical-
ly, thus widening the entire spectrum of trade-offs between power, resource and perfor-
mance. This chapter discusses how resources requirements can be reduced effectively for
VCT-switched networks; the next chapter discusses the case for wormhole-switched net-
works.
Resource requirements for interconnection networks are usually lower-bounded by
the amount of resources needed to provide deadlock-freedom. Along with routing, net-
work flow control aims to maximize resource utilization while preventing oversubscrip-
tion and deadlock in resource usage. It is, thus, critical to on-chip networks.
Network flow control mechanisms that are aware of global conditions potentially can
achieve higher performance and resource efficiency than flow control mechanisms that
are only locally aware. Owing to high implementation overhead, globally-aware flow
control mechanisms in their purest form are seldom adopted in practice, leading to less
efficient simplified implementations. In this chapter, we present an efficient implementa-
96
tion of a globally-aware flow control mechanism, called Critical Bubble Scheme [18, 20],
for VCT-switched k-ary n-cube networks. This scheme achieves near-optimal perfor-
mance with the same minimal buffer requirements of globally-aware flow control and can
be further generalized to implement the general class of buffer occupancy-based network
flow control. The proposed scheme is also very useful in handling protocol-induced dead-
locks in on-chip environments.
5.1 Globally-Aware Flow Control
5.1.1 Pros and Cons of Globally-aware Flow Control
Fundamental metrics used to evaluate interconnection networks (such as throughput,
latency, power and cost) are global measures as overall performance rarely is determined
by the status of a given link or router but, rather, on communication paths comprised of
multiple links and routers across the network. If resource allocation decisions are made
locally at nodes, the effect should be optimized over the entire network. Given this, glob-
ally-aware flow control in which nodes take into consideration conditions from across the
network in making decisions locally is preferred over locally-aware flow control where
only local information is taken into account.
Globally-aware flow control has at least three advantages. First, as the status across
the network typically is non-uniform, a locally-aware node may have quite different local
status information than remote nodes. Hence, an allocation decision based solely on this
information may not only be suboptimal but can even be counter-optimal. Second, as
97
mentioned in [111], global awareness allows changes in network conditions to be detect-
ed earlier than the use of local information only as the latter suffers propagation delay of
backpressure to the local node. Third, globally-aware flow control enables finer-
granularity in optimally tuning network resources and restrictions on the use of those
resources. For example, local-only flow control may unnecessarily place restrictions on
the use of some set of local resources (e.g., channel buffers), thus requiring more re-
sources as every node must enforce the same restrictions. With globally-aware flow con-
trol, local restrictions can be eased by enforcing restrictions amortized over a larger set of
resources.
Along with these advantages, unfortunately, are some challenges in efficiently im-
plementing globally-aware flow control. Key global information about network condi-
tions must be gathered and acted upon by an omniscient global controller. This requires
either prohibitively high overhead cost to implement or simplified implementations
which can be quite inefficient. The following discusses a representative example.
5.1.2 Bubble Flow Control
Bubble Flow Control proposed in [100] is a well-known globally-aware flow control
mechanism that reduces to a locally-aware flow control mechanism in simplified form. In
what follows, we use the term Theoretically-optimal Bubble Flow Control (Theoretical
BFC) to refer to the theoretically-optimal instantiation of Bubble Flow Control and use
the term Localized Bubble Flow Control (Localized BFC) to refer to the simplified im-
plementation in which only local information is used to control bubble flow. This is the
98
scheme adopted in BFC implementations to-date [5, 17]. Also in what follows, a bubble
denotes a free packet-sized buffer.
Theoretical BFC
Bubble Flow Control is applicable to k-ary n-cube networks (e.g., 2D tori). Fully
globally-aware in its theoretically-optimal form, it applies virtual cut-through flow con-
trol in a way to avoid deadlock while requiring nominal buffer resources across the net-
work. Dimension-order routing (DOR) in tori eliminates cyclic routing dependencies that
can occur across various network dimensions. It does not prevent, however, cyclic de-
pendencies and routing-induced deadlock that can occur within dimensions caused by
torus wraparound links. A classic solution to this deadlock problem is to use a dateline
technique in which two virtual channels (high and low) are associated with each physical
channel [26]. When packets transported on low channels cross the dateline, they switch to
high channels thus breaking cyclic dependency within network dimensions. A drawback
of this approach, however, is the requirement of two virtual channels per link with their
corresponding buffer resources.
Theoretical BFC reduces buffer requirements to only one virtual channel by imposing
two simple rules on injection and forwarding of packets across dimensions. The idea is to
prevent packets from using the potentially last free buffer of a dimensional ring [100].
The two rules are the following.
(i) Forwarding of a packet within a dimension is allowed if the receiving channel
buffer has at least one packet-sized free buffer, i.e., a bubble.
(ii) Forwarding of a packet from one dimension to another (including injection of a
99
new packet into a dimension) is allowed if the receiving channel buffer has a
packet-sized free buffer and there is at least an additional free buffer located any-
where among the channel buffers of any router within that directional ring.
The first rule is the same as virtual cut-through flow control. In order to understand
the second rule, Figure 5.1 shows a simple illustration. Let’s say this ring is a unidirec-
tional ring of an arbitrary dimension in a k-ary n-cube network. Shaded rectangles indi-
cate full buffers whereas non-shaded ones indicate empty buffers. Packet P wishes to
enter into this dimension either from a different dimension or from an injection point (i.e.,
attached processor node). This access will be allowed only if the receiving channel buffer
in Router I has a packet-sized free buffer, and there is an additional free buffer located
anywhere (for example Router J) in the same direction. In this way, after accepting pack-
et P, there is always at least one free buffer in the ring, which guarantees that at least one
packet is able to make progress. This free buffer acts as a bubble and ensures deadlock
freedom. A detailed mathematical proof can be found in the original paper [100].
Difficulties with Theoretical BFC
The allocation of buffer resources in Theoretical BFC requires buffer utilization in-
formation of other nodes in the directional rings of the network. A major difficulty in
implementing Theoretical BFC is the need for a global controller. The global controller
must gather and distribute global information about free buffer status within each net-
work ring so that every node has sufficient global knowledge to decide on the allocation
of buffers to requesting packets. Even with perfect global knowledge of free buffer status
provided by the global controller, Theoretical BFC requires additional global control to
100
handle multiple simultaneous injection requests in a deadlock-free manner. Consider the
scenario in Figure 5.2 in which, in the same cycle, two packets request to inject into the
same directional ring containing only two free buffers. At most only one of the injections
should be granted to avoid a potential deadlock configuration. However, according to the
rules in [100] defining Theoretical BFC, each of the two routers will allow the injections
based on global knowledge of the existence of two free buffers in the dimensional ring.
Hence, besides enforcing the rules as defined in [100], a global controller must also de-
termine which one of the multiple simultaneous injection requests should be granted to
avoid deadlock and eliminate uncertainty. The above major difficulties hinder the adop-
tion of Theoretical BFC in practice.
Localized BFC and Its Shortcomings
To obviate the need for a global controller, a Localized BFC scheme was proposed
and adopted to simplify BFC implementation [5, 100]. Instead of checking for the exist-
ence of two free buffers anywhere along the directional ring as in Theoretical BFC, Lo-
Figure 5.1: Theoretical BFC requires global information in the ring to avoid deadlock.
Figure 5.2: Simultaneous injection in Theoretical BFC requires global coordination.
P
Router Router Router Router RouterJ RouterI
P2 P1
Router Router RouterA Router Router RouterB
101
calized BFC checks only that there are two free buffers in the channel buffer of the re-
ceiving router. For example, if the channel buffer of Router I in Figure 5.1 has two free
buffers, access will be granted to packet P. This simplification essentially reduces the
globally-aware Theoretical BFC technique to a locally-aware technique as, now, all deci-
sions are made based solely on local information as opposed to global information across
the network’s dimensional ring.
Localized BFC has three shortcomings that can degrade performance. First, by requir-
ing two free buffers in the local channel buffer for packets to enter a new dimension, Lo-
calized BFC increases the buffer access delay component of packet latency. As shown in
Figure 5.3, both packets satisfy the Theoretical BFC rules and should be granted access
given that overall buffer occupancy within the dimension is far from saturation. However,
both accesses are denied with Localized BFC as only one free buffer exists in the corre-
sponding local channels. Hence, the packets suffer a longer buffer access delay than in-
curred by Theoretical BFC.
Figure 5.3: Localized BFC decisions are not optimal across the network.
Figure 5.4: Localized BFC requires more free buffers than Theoretical BFC.
P2 P1
Router Router RouterA Router Router RouterB
Router Router Router Router Router Router
P2 P3 P4 P5 P1
102
Second, Localized BFC does not use channel buffers to their theoretically optimal
level as more than the minimum number of required buffers remain unused within each
dimension. Figure 5.4 depicts a representative scenario. Five packets wish to enter the
directional ring simultaneously. According to Theoretical BFC, the minimum number of
free buffers needed is six: one for each of the five packets plus one bubble to avoid dead-
lock. However, Localized BFC requires there to be at least ten free buffers: two free
buffers for each packet. Thus, in this case, Localized BFC requires 66% more free buffers
to guarantee deadlock freedom, which is highly inefficient.
Third, in this simplified implementation of BFC, the minimum number of buffers
needed to avoid deadlock in each channel buffer of routers now becomes two instead of
one as required by Theoretical BFC. This introduces another inefficiency, particularly
when channels have shallow/minimum buffers. For example, given tight buffer budgets,
some on-chip network or network-on-chip (NoC) designs may allow for only one packet-
sized buffer per channel, which precludes the use of Localized BFC. Since NoC-based
systems are gaining increasing importance in recent years, the following subsection dis-
cusses this issue in more detail.
5.1.3 Inefficiency in Handling Protocol-Induced Deadlock
So far, the deadlock that we have discussed is routing-induced deadlock. However,
there is also another type of deadlock – protocol-induced or message-dependent deadlock
that can occur in computer systems such as shared-memory chip multiprocessors (CMPs).
While its formal model and detailed description are provided in [105, 116], essentially,
103
message-dependent deadlock is caused by multiple dependent message classes in cache
coherence protocols. For example, reply messages can only be generated by nodes receiv-
ing request messages. In other words, reply messages depend on request messages. This
inter-message dependency creates additional arcs in the channel dependency graph, and if
it happens to complete a cycle, then deadlock may occur.
To avoid message-dependent deadlock, separate logical networks (e.g., implemented
as virtual channels) are often employed for separating different message classes [26].
Since the actual frequency of deadlock typically is very low [99, 105], those logical net-
works acting as escape channels should be designed to use as few resources as possible
so that the remaining buffer resources can be shared among all message classes as adap-
tive channels, thus maximizing resource utilization [85].
However, when applying Localized BFC to handle message-dependent deadlock in
torus networks, resource utilization is far from optimal. For example, in the MOESI di-
rectory cache protocol [84], different message types can be classified into three depend-
ent message classes which require at least three independent layers of virtual channels in
Localized BFC to serve as escape resources. This would increase to six layers if the tra-
ditional dateline technique is used. Each layer of virtual channels for Localized BFC
would require at least two buffers per VC. An additional layer of virtual channels with at
least one buffer per VC is needed to implement minimal adaptive routing. This configura-
tion devotes a large portion of available resources (i.e., 3 out of 4 virtual channel layers,
or 6 out of 7 buffers) to avoid message-dependent deadlock in adaptive routing and safe-
guard against likely rare cases of potential deadlock.
104
5.2 Critical Bubble Scheme
5.2.1 The Basic Idea
The proposed Critical Bubble Scheme is an efficient implementation of Theoretical
BFC that retains global awareness properties for realizing near-maximal benefits, while
avoiding costly global coordination. The principle behind the Critical Bubble Scheme is
to mark and track as critical at least one bubble in each directional ring of a network and
restrict the use of critical bubble(s) only to packets traveling within dimensions to prevent
deadlock, as in Theoretical BFC. Critical bubble(s) flow within and are confined to direc-
tional rings of the network. Their movement is tracked using nominal control signals be-
tween neighboring routers. In essence, their presence (or non-presence) at router nodes
conveys global information across each network dimension about what restrictions
should be enforced locally to avoid deadlock.
Figure 5.5: The Critical Bubble Scheme avoids deadlock without requiring explicit
global coordination.
P
Router Router Router Router Router RouterR
105
As an illustration of how the scheme works, consider an arbitrary ring within a 2D to-
rus as shown in Figure 5.5. Initially, for each unidirectional ring of the network, a free
buffer in any router along the ring is marked as the critical bubble for the entire ring (i.e.,
striped rectangle). This critical bubble is transferred backwards between routers and
tracked along the ring as intra-dimensional packets displacing it move forward. Assume a
packet P from another dimension wishes to enter this dimension through router R. If the
one free buffer in R’s local receiving channel is not the critical bubble (i.e., it is a non-
critical bubble as shown in the figure), the packet is allowed to enter this dimension (i.e.,
it is allocated the free buffer) immediately. The packet need not wait for a second buffer
in the local channel to become free as in Localized BFC nor does it have to check buffer
occupancy of other nodes along the ring for explicit global awareness. The absence of the
critical bubble at the local router indicates the existence of a free buffer elsewhere in the
ring, thus providing implicit global awareness.
5.2.2 Detailed Description of the Critical Bubble Scheme
Initialization
Assume a k-ary n-cube in which every dimension is composed of two opposite unidi-
rectional rings. For each unidirectional ring, a random free buffer from any router chan-
nel belonging to this direction can be marked as the critical bubble. The resulting network
has one critical bubble in every dimension and direction. The other free buffers operate as
normal buffers (i.e., non-critical bubbles).
106
Formal Rules
In implementing Theoretical BFC, the Critical Bubble Scheme imposes the following
two rules on the forwarding of packets to avoid deadlock.
(i) Forwarding of a packet within a dimension is allowed if the receiving channel
buffer has one packet-sized free buffer, no matter whether it is a normal free buff-
er or a critical bubble.
(ii) Forwarding of a packet from one dimension to another (including injection of a
new packet into a dimension) is allowed only if the receiving channel buffer has
at least one packet-sized normal free buffer. It is not allowed if the only packet-
sized free buffer is a critical bubble.
Transfer of Critical Bubbles
A key requirement of the Critical Bubble Scheme is always to maintain a free buffer
(critical bubble) in each unidirectional ring of the network. According to the types of
packet forwarding, there are two cases to be considered.
a) If a packet is allowed to change dimension or inject into a new dimension, then ac-
cording to the second rule, there is a normal free buffer ready to accept this packet.
Hence, the critical bubble remains untouched.
b) If a packet is forwarded from router A to router B within a dimension, then if the
receiving buffer in router B has a normal free buffer, the critical bubble remains unused.
Figure 5.6: Transfer of a Critical Bubble between routers.
RouterA RouterB
54321
RouterA RouterB
5432 1
107
Otherwise, as depicted in Figure 5.6, the arrival of packet 1 at router B displaces the criti-
cal bubble backward to router A. This is done by router B asserting a special control line
to indicate to router A that it should mark the newly freed buffer in router A as the critical
bubble for the ring.
5.2.3 Impact on Implementation and Performance
The Critical Bubble Scheme provides an elegant and efficient implementation solu-
tion to overcome the difficulties of implementing Theoretical BFC while still avoiding
the deficiencies of Localized BFC. It is also particularly useful in supporting multiple
message classes in buffer-resource-limited on-chip network environments.
Implementing Global Awareness of Theoretical BFC
As discussed in Section 5.1, the most difficult problem in implementing Theoretical
BFC is the need for a global controller. The main advantage of the Critical Bubble
Scheme is its global awareness without the need for a complex global controller. No sta-
tus information needs to be gathered or distributed remotely across the network, and no
uncertainty arises as to whether local routers can allocate buffer resources to packets
when multiple simultaneous requests are made for dimensional resources. This is illus-
trated in Figure 5.7 which shows the same scenario shown in Figure 5.2 but for the Criti-
cal Bubble Scheme. Again, we have two free buffers in the dimension but also have two
packets wishing to turn into the dimension. At most one of these packets should be grant-
ed access. It is clear that router A will deny access as there is no normal free buffer left,
108
and router B will grant access. Uncertainty is removed without needing explicit global
coordination.
Avoiding the Deficiencies of Localized BFC
The Critical Bubble Scheme also avoids all three deficiencies of Localized BFC. First,
there is no increased access delay. Figure 5.8 depicts the scenario in Figure 5.3 but for the
Critical Bubble Scheme. As can be seen, even though there is only one free buffer at ei-
ther of the receiving channels in routers A and B, both packets can access this dimension
immediately without incurring extra buffer access delay.
Second, the scheme improves buffer utilization over Localized BFC. As shown in
Figure 5.4, no matter whether the critical bubble is in one of the five injecting router
channels or in the rightmost router, only six free buffers are required to grant all accesses,
achieving the minimum number of free buffers as in the theoretically-optimal case.
Figure 5.7: Simultaneous injections have no uncertainty with Critical Bubble Scheme.
Figure 5.8: Simultaneous injection of multiple packets with the Critical Bubble Scheme.
P2 P1
Router Router RouterA Router Router RouterB
P2 P1
Router RouterA Router Router RouterB Router
109
Third, as mentioned before, Localized BFC requires a minimum of two packet-sized
buffers in every receiving channel while Theoretical BFC requires only one. The reason
is that Localized BFC has to check that there are at least two free buffers in the channel
buffer before allowing packets access. With the Critical Bubble Scheme, just one packet-
sized buffer is required in every channel as this scheme checks only that there is at least
one normal free buffer in the channel before granting access to packets. Therefore, the
Critical Bubble Scheme effectively achieves the minimum number of buffers in every
channel as dictated by Theoretical BFC.
Efficiently Handling Message-Dependent Deadlock
As described earlier, in order to eliminate deadlock among the various message clas-
ses, routers require at least one escape virtual channel per port for each message class.
With the Critical Bubble Scheme, it is possible to halve the buffer resources required in
each channel and allocate the saved buffers to adaptive channels to speed up the common
case. For the previous MOESI directory protocol example, the Critical Bubble Scheme
minimally requires three packet-sized buffers – one for each of the three message classes
to implement escape paths – which allows the adaptive channel to use the four remaining
buffers assuming the same total channel buffer budget. This greatly increases the poten-
tial buffer utilization and can improve system performance, as shown in the evaluation.
5.2.4 Deadlock Freedom
This subsection provides formal proof sketches of the deadlock freedom of the pro-
posed Critical Bubble Scheme for virtual cut-through switched torus networks for all the
110
cases discussed above. To facilitate the proofs, some basic definitions are first derived
from [35].
Definition 1. Q represents the set of input queues associated with router nodes. Each
queue q
i
∈ Q has capacity of cap(q
i
) packets and the current number of packets occupy-
ing the queue is denoted as size(q
i
). Q
I
is a subset of Q consisting of all injection queues.
Q
y
is a subset of Q consisting of all input queues belonging to some unidirectional ring y.
Each queue, q
i
, associates with a binary number b
i
. It is set to 1 if the next free buffer of
this queue is a critical bubble.
Definition 2. F(q
i
, q
j
) is the flow control function for a packet that wishes to enter q
j
from
q
i
. It can be either true or false, and the packet is allowed to take this move only if F is
true.
Definition 3. If at any given cycle, there is at most one injection request to Q
y
(or for-
warding request from another dimension to dimension y), then the request pattern is
called sequential injection requests. Otherwise, if there are multiple injection requests to
Q
y
(including forwarding requests from another dimension) in the same cycle, then the
request pattern is called simultaneous injection requests.
Definition 4. A dependency between message classes M
i
and M
j
in which the generation
of messages in M
j
depends on the consumption of messages in M
i
at nodes, defined by
the communication protocol, is denoted by M
i
← M
j
, in which ← is a partial order rela-
tion indicating that M
i
precedes M
j
which is the terminus of the dependency.
111
Lemma 1. The Critical Bubble Scheme is deadlock-free under sequential injection re-
quests.
Proof Sketch. Assume there is a packet that wishes to move from q
i
in unidirectional
ring x of a dimension to q
j
in unidirectional ring y in another dimension. The rules of
Theoretical BFC are the following:
Rule 1. When x = y, F(q
i
,q
j
) is True if: size(q
j
) ≤ cap(q
j
) - 1 (1)
Rule 2. When x ≠ y ∨ q
i
∈Q
I
, F(q
i
,q
j
) is True if: size(q
j
) ≤ cap(q
j
) - 1 ∧ (2)
∑size(q
k
) ≤ ∑cap(q
k
) - 2, over all q
k
∈Q
y
, (3)
The rules of Critical Bubble Scheme are the following:
Rule 1
*
. When x = y, F(q
i
,q
j
) is True if: size(q
j
) ≤ cap(q
j
) - 1 (4)
Rule 2
*
. When x ≠ y ∨ q
i
∈Q
I
, F(q
i
,q
j
) is True if: size(q
j
) ≤ cap(q
j
) - 1 ∧ (5)
b
j
= 0 (6)
We now prove by considering all cases that if a flow control function F
CBS
satisfies
Rule 1* and 2*, it must also satisfy Rule 1 and 2. First, comparing (1) with (4) and (2)
with (5), they are exactly the same. Second, since F
CBS
follows (6), it means the critical
bubble is not in the next free buffer of input queue q
j
but is somewhere else in Q
y
. This
indicates that there are at least two free buffers in unidirectional ring y: one is the free
buffer found in (5), and the other is the critical bubble elsewhere in the ring. This is ex-
actly the condition of (3). Therefore, by enforcing the two new rules, the Critical Bubble
Scheme also satisfies the rules enforced by Theoretical BFC. As Theoretical BFC is
proved to be deadlock-free with sequential injection request under its two rules in [1], the
Critical Bubble Scheme is also deadlock-free under sequential injection request. □
112
Lemma 2. The Critical Bubble Scheme is deadlock-free under simultaneous injection
requests.
Proof Sketch. First consider the case of packets being transported only within unidirec-
tional rings of a network. As long as a critical bubble exists within each unidirectional
ring, there is no intra-dimensional deadlock. This is because, in the worst case of all other
buffers in the unidirectional rings being occupied, the critical bubble within each ring
serves as the last free buffer, guaranteeing that at least one packet in each ring can make
forward progress. This is the fundamental principle behind Bubble Flow Control.
Now consider the situation in which there are k simultaneous injection requests to
ring y in cycle T by packets from outside the ring and ring y has no intra-dimensional
deadlock before T. Each injection request is either rejected or granted. If every request is
rejected, then no more free buffers of ring y are consumed, and ring y remains deadlock-
free. Otherwise at least one request is granted. Consider the worst case in which all re-
quests are granted. Since each granted request must satisfy (5) and (6), there must be a
normal free buffer (i.e., non-critical bubble) ready to accept the injected packet. Hence,
even if all packets are injected into ring y, a critical bubble continues to exist within the
ring guaranteeing that at least one packet in the ring can make forward progress. Together
with dimension-order routing, no inter-dimensional deadlock can occur either. Thus, the
Critical Bubble Scheme is deadlock-free under simultaneous injection requests. □
Lemma 3. The Critical Bubble Scheme applied to resources used for each message class
of a communication protocol is deadlock-free under either sequential or simultaneous
injection requests.
113
Proof Sketch. Assume there are n message classes and the total virtual channel resources
C are divided into two disjoint subsets C
1
and C
2
. C
1
can be shared amongst all message
classes (i.e., adaptive channels) while C
2
is further divided into n independent subsets, S
k
(k = 1, 2 … n), one for each message class which is the minimum needed to separate
message classes into distinct sets of escape resources, according to [105]. We prove the
lemma by induction, considering all cases.
Without loss of generality, assume the n message classes have dependency M
1
←
M
2
… ← M
n-1
← M
n
as defined by the protocol. As any set of S
k
is independent of other
sets and a partial ordering exists such that S
1
← S
2
… ← S
n-1
← S
n
, there is no cyclic de-
pendency among the n sets of escape virtual channels. When applying the Critical Bubble
Scheme to each message class M
k
separately, according to Lemmas 1 and 2, S
k
is dead-
lock-free. However, as messages in M
n
which is the terminus of the dependency can drain
from S
n
, messages in M
n-1
can also drain from S
n-1
and so on until messages in M
1
drain
from S
1
. Thus, the union of S
k
(i.e., the C
2
channels) serve as escape resources for all
channels C, including C
1
channels. According to [105], no deadlock can occur.
□
5.3 Router Architecture
This section discusses the modification of standard router architecture to support the
Critical Bubble Scheme.
Figure 5.9 shows a block diagram of the architecture of a typical virtual cut-through
router. Arriving packets first get stored in the input buffer and advance in FIFO manner.
114
The routing module is responsible for computing the output port for packets. When a
packet is at the head of the FIFO queue and ready to move, the arbitration unit will con-
figure the switch to set up a path for the packet to the allocated virtual channel in the out-
put. The packet then traverses the switch to the output port and moves to the next hop. In
a virtual cut-through router, virtual channel allocation and switch arbitration can be
merged into a single module [26], denoted as the Arbitration Unit here. Each output vir-
tual channel also contains several state fields to track its status, including a state I that
records the input port and virtual channel that are forwarding packets to this output virtu-
al channel, and a state C that counts the number of credits in the downstream input chan-
nel buffer.
The router architecture to enable the Critical Bubble Scheme is very similar to the
typical virtual cut-through router architecture. We need three modifications to the router
Figure 5.9: Typical virtual cut-throutgh router microarchitecture. The shaded areas
are modified to implement the proposed Critical Bubble Scheme.
Routing Module
Input buffers
...
MUX
DEMUX
Input buffers
...
MUX
DEMUX
Output port
G I C B
Output port
G I C B
Arbitration Unit
Control line
Control line
Control line
Control line
115
architecture as shown shaded in Figure 5.9: a counter B at the output channel to count the
number of critical bubbles in the input channel of the downstream router, a 1-bit control
line to indicate the increase of B, and a slightly modified Arbitration Unit.
To illustrate the modification needed for the arbitration unit, the original arbitration
unit is compared against the modified one. During arbitration, the original arbitration unit
checks the availability of output virtual channels and whether there is packet entering or
waiting in the input channel. The modified arbitration unit adds the following check in
parallel with the original checks:
Is input channel in the same dimension as the output channel?
No => B should be less than C
Yes => If B=C, assert control line
In the above condition, if the input is not in the same dimension as the output, then
the added checking makes sure that there is at least one normal free buffer in the down-
stream input channel as the downstream channel has C free buffers but only B critical
bubbles (e.g., B = 1). Since the output channel has a field I which records the input port
and virtual channel, it needs only two comparators for the comparison of the input/output
ports and B with C.
If the case of intra-dimensional packet forwarding, the arbitration unit checks whether
B equals C before the packet is forwarded. If B equals C, the forwarded packet will occu-
py a critical bubble in the downstream input channel, causing the critical bubble to be
displaced and transferred backward to the input channel of the router from which the
packet came. This is done by decreasing B by 1 and decreasing the credit count. As the
116
number of critical bubbles of the current router is stored in the upstream router, the 1-bit
control line associated with the upstream router (denoted by control line in the figure) is
asserted. When the upstream router detects an asserted control line signal, it will decrease
its B in the next cycle and de-assert the control line to complete the critical bubble trans-
fer.
5.4 Evaluation
In this section, a detailed evaluation of the proposed Critical Bubble Scheme is pre-
sented. This scheme is first stressed under synthetic traffic to validate its correctness
quantitatively across a wide range of load rates, examine its effects on buffer utilization,
and compare its performance against other bubble-based flow control mechanisms. Next,
the effectiveness of the Critical Bubble Scheme in handling message-dependent deadlock
is evaluated using full system simulation with the PARSEC benchmark suite.
5.4.1 Simulation Methodology Using Synthetic Loads
To evaluate the proposed scheme using synthetic loads, CBS is implemented on a
general-purpose cycle-accurate interconnection network simulator GARNET [6]. The
simulator is modified to support virtual cut-through, adaptive routing and the Critical
Bubble Scheme. Routers use a standard 4-stage pipeline with 1 cycle for link traversal.
An 8-ary 2-cube torus network with bidirectional physical channels is simulated. Dead-
lock avoidance based on bubble flow control is assumed using one escape and one adap-
117
tive virtual channel per physical channel and each virtual channel has two packet-sized
buffers.
All simulations are run for 100,000 cycles with a warm-up period of 10,000 cycles.
Packets are randomly generated to be either short packets with single-flit or long packets
with nine flits. To stress the network, four synthetic traffic patterns are used in these
evaluations: uniform random, perfect-shuffle, bit complement and transpose [26]. We
compare the performance of Theoretical BFC, Localized BFC, and the Critical Bubble
Scheme.
5.4.2 Effects of Critical Bubble Scheme under Synthetic Traffic
As the basis of this evaluation, the correctness of the Critical Bubble Scheme is first
validated. Figure 5.10 plots the performance of three versions of bubble flow control:
Theoretical BFC which is possible only in simulation and impractical in reality, Local-
ized BFC which is a simplified implementation used in the original BFC paper [100], and
the proposed Critical Bubble Scheme (CBS). As can be seen in the figure, the perfor-
mance of CBS closely follows Theoretical BFC for all four traffic patterns, indicating
that the proposed scheme is able to efficiently implement Theoretical BFC and, therefore,
achieve similar potential maximum benefits.
In the comparison of the performance of CBS with Localized BFC under these four
traffic patterns, CBS clearly has advantages. Specifically, CBS has much lower latency at
medium and high load rates (relative to the load rate at the saturation point unless other-
wise stated). When the applied load rate is low, buffer occupancy also is low in both
118
schemes. Therefore Localized BFC, which requires at least two free buffers in the chan-
nel when injecting or changing dimensions of a packet, has almost the same effect as
CBS which requires only one normal free buffer. However, as the applied load rate in-
creases, the deficiencies of increased buffer access delay and low buffer utilization for
Localized BFC gradually manifest while CBS, which has lower buffer requirements, en-
joys a higher success rate for injection and changing dimensions. The above lead to per-
formance gains for the Critical Bubble Scheme under medium and high load rates.
(a) Uniform Random (b) Perfect Shuffle
(c) Bit Complement (d) Transpose
Figure 5.10: Effects of Critical Bubble Scheme under different synthetic traffic patterns.
0
30
60
90
120
150
180
0 0.1 0.2 0.3 0.4 0.5 0.6
Average Latency (cycles)
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
0
30
60
90
120
150
180
0 0.1 0.2 0.3 0.4 0.5
Average Latency (cycles)
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
0
30
60
90
120
150
180
0 0.1 0.2 0.3 0.4
Average Latency (cycles)
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
0
30
60
90
120
150
180
0 0.1 0.2 0.3 0.4 0.5
Average Latency (cycles)
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
119
To further demonstrate the effect of CBS on reducing the buffer access delay portion
of packet latency, Figure 5.11 plots the buffer access delay of CBS normalized to that of
Localized BFC at medium and high load rates under the four traffic patterns. While the
buffer access delay for injecting or changing dimensions of packets are very close be-
tween CBS and Theoretical BFC, the histogram shows significant reduction by up to 62%
for CBS over Localized BFC. This indicates that the proposed Critical Bubble Scheme
(a) Uniform Random (b) Perfect Shuffle
(c) Bit Complement (d) Transpose
Figure 5.11: Effect of Critical Bubble Scheme on reducing buffer access delay.
0%
20%
40%
60%
80%
100%
0.32 0.37 0.40 0.42
Normalized Access Delay
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
0%
20%
40%
60%
80%
100%
0.27 0.32 0.34 0.37
Normalized Access Delay
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
0%
20%
40%
60%
80%
100%
0.23 0.28 0.30 0.32
Normalized Access Delay
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
0%
20%
40%
60%
80%
100%
0.27 0.31 0.34 0.36
Normalized Access Delay
Applied Load (flits/cycle/node)
Localized BFC Critical Bubble Scheme Theoretical BFC
120
successfully overcomes the deficiencies of Localized BFC while achieving lower access
delay and higher buffer utilization offered by Theoretical BFC.
5.4.3 Simulation Methodology Using Full System Simulation
The Critical Bubble Scheme is also evaluated under real application workloads using
full system simulation. The simulation infrastructure is based on SIMICS [83] enhanced
with GEMS [84] for detailed timing of the memory and the modified GARNET [6]. A
16-node chip connected by a 4x4 2-D torus is simulated. Each node has an in-order core,
a 16KB private L1 cache and a 512KB bank of shared L2 cache. The MOESI directory
cache coherence protocol is used in the system with 2 memory controllers. Messages are
divided into two lengths of packets. Short packets are 8-byte single-flit while long pack-
ets carrying 64-byte cache lines have nine flits. Additional parameters are listed in Table
5.1. The PARSEC [13] benchmarks compiled with pthread programming model are used.
To investigate the effectiveness of the Critical Bubble Scheme in handling message-
dependent deadlock, we compare two configurations using the same buffer budget. Since
Table 5.1: Parameters used in full system simulation for CBS.
Network topology 4x4 torus
Router 4-stage, 2GHz
Link bandwidth 64 bits/cycle
Core model Sun UltraSPARC III+, 2 GHz
Private I/D L1$ 16KB, 2-way, LRU, 1-cycle latency
Shared L2 per bank 512KB, 16-way, LRU, 6-cycle latency
Cache block size 64Bytes
Virtual channel 1~2 VCs per protocol class
Coherence protocol MOESI
Memory controllers 4, located one at each corner
Memory latency 128 cycles
121
the MOESI directory protocol has three dependent message classes, configuration 1 em-
ploys Localized BFC as the flow control for the three escape virtual channels, each hav-
ing two packet-sized buffers. There is also a one packet-sized buffer for the adaptive vir-
tual channel shared by all message classes. This is the minimal buffer resources required
by Localized BFC using adaptive routing. Configuration 2 employs the Critical Bubble
Scheme for each of the one packet-sized buffer for the escape virtual channel. The re-
maining four packet-sized buffers are used for adaptive virtual channels. Hence, both
configurations have similar budget (i.e., 7 long packet-sized buffers) for implementing
the 4 virtual channels per physical channel.
5.4.4 Effects of CBS in Handling Message-Dependent Deadlock
Figure 5.12 compares the execution time of configuration 1 using Localized BFC and
configuration 2 using the Critical Bubble Scheme across the benchmark suite. To present
data from multiple applications more effectively, execution time is normalized to config-
uration 1. The results vary among applications as different applications may generate
Figure 5.12: Comparison of normalized execution time for PARSEC benchmarks.
122
distinct network loads and have different sensitivity to network performance. On average,
there is 7.2% overall execution time reduction with configuration 2.
There are two major reasons for this improvement. The first is similar to the synthetic
traffic case as the Critical Bubble Scheme reduces buffer access delay and increases buff-
er utilization. The second stems from the larger amount of buffer resources that can be
used by adaptive virtual channels owing to the reduced resource requirements for escape
virtual channels with our proposed scheme. This advantage of the Critical Bubble
Scheme over Localized BFC increases with more complex protocols that have a greater
number of dependent message classes.
We further analyze the impact of Critical Bubble Scheme on network performance in
this environment of multiple message classes. Figure 5.13 breaks down the average pack-
et latency into zero-load latency (consists of serialization latency and hop latency) and
contention latency [26]. Compared with configuration 1, configuration 2 using the Criti-
cal Bubble Scheme achieves an average of 18.8% reduction in average packet latency.
This is mainly because the competition for scarce adaptive resources in configuration 1
Figure 5.13: Comparison of the breakdown of average packet latency.
0
10
20
30
40
C1C2 C1C2 C1C2 C1C2 C1C2 C1C2 C1C2 C1C2 C1C2 C1C2 C1C2 C1C2
Average Packet Latency (cycle)
C1: Config. 1 with Localized BFC C2: Config. 2 with Critical Bubble Scheme
Contention latency
Hop latency
Serialization latency
blackscholes
bodytrack
canneal
dedup
ferret
fluidanimate
raytrace
streamcluster
swaptions
vips
x264
Avg.
123
incurs a large increase in contention latency. By using limited buffer resources more effi-
ciently, configuration 2 is able to achieve better performance.
5.5 Discussion
5.5.1 Scalability of Critical Bubble Scheme
We consider the scalability in terms of different network sizes and buffer sizes. With
continuing technology scaling, the number of nodes to be connected in the network likely
will increase. For larger networks, the difference between locally- and globally-aware
flow control is more pronounced as local information can hardly reflect the global status
of the network and the delay in propagating network state to local nodes will increase.
Thus, the benefits of globally-aware CBS will be even greater for larger networks. Simu-
lation results show that under uniform random traffic and high injection rate (i.e., 95% of
the corresponding saturation load rate), average packet latency difference of CBS over
Localized BFC for network sizes of 4x4 and 8x8 are 22.3% and 27.2%, respectively.
When buffers are large, there is little difference between Localized BFC and CBS as
many free buffers are available anyway. However, in the more interesting and practical
case (e.g., for NoCs) of shallow buffers, the relative fraction of free buffers needing to be
reserved to avoid deadlock in Localized BFC is higher than with CBS, further reducing
resource efficiency relative to CBS and causing increased performance degradation. The
extreme case of one packet-sized buffer per channel precludes the use of Localized BFC.
Simulation results show that when buffer size decreases from 4 to 3 to 2 buffers per
124
channel under uniform random traffic and high injection rate, average latency difference
of CBS over Localized BFC increases from 6.6% to 12.5% to 27.2%, respectively.
5.5.2 Implementing Occupancy-Based Global Flow Control
It is interesting to see that BFC is actually a special case of buffer occupancy-based
flow control mechanisms which restrict the use of buffers based on their occupancy. A
typical example of this is congestion control, which restricts packet injection into the
network when the buffer occupancy is above a threshold to avoid drop in throughput
from network saturation. Throttling can be locally-aware if it uses buffer occupancy of
the local router or globally-aware if it uses aggregate buffer occupancy across a network
dimension or throughout the network. The globally-aware version has many potential
benefits as discussed, but efficient implementations remain a challenge.
The Critical Bubble Scheme can be generalized to implement this general class of
globally-aware flow control. The idea is to allow multiple critical bubbles instead of just
a single critical bubble as a quasi means of throttling packet injection to relieve network
pressure. The only modification to the operation of the Critical Bubble Scheme described
above is to mark multiple free buffers as critical bubbles during network initialization.
The same two rules are enforced for forwarding packets, and deadlock freedom remains
guaranteed. In this way, a broader class of globally-aware flow control mechanisms—
such as global flow control with a preset buffer occupancy threshold T, self-tuned global
congestion control with adaptive threshold T, and so on—can be implemented efficiently.
Preliminary results show that global congestion control implemented using multiple criti-
125
cal bubbles successfully sustains throughput after network saturation giving 11% higher
throughput than local congestion control. Further investigation of this topic is left for
future work.
5.6 Summary
Globally-aware flow control in interconnection networks has many potential ad-
vantages over locally-aware flow control but faces several serious implementation diffi-
culties. The primary contribution of the proposed Critical Bubble Scheme (CBS) is to
provide a way to implement globally-aware flow control mechanisms correctly and effi-
ciently so as to increase resource utilization. By marking a certain number of free buffers
as critical bubbles (minimally one) and appropriately using them to restrict packet injec-
tion, this scheme achieves near-optimal performance offered by globally-aware flow con-
trol mechanisms while avoiding costly explicit global control and coordination. The pro-
posed scheme can be readily applied to both off-chip and on-chip networks that employ
virtual cut-through and bubble flow control to ensure freedom from both routing-induced
and protocol-induced deadlock.
126
Chapter 6
Reducing Resources for Wormhole-switched Net-
works
Wormhole switching allows the channel buffer size to be smaller than the packet size,
thus is more preferred by on-chip networks where area and power resources are greatly
constrained. However, wormhole packets can span multiple routers, thereby creating ad-
ditional channel dependences and adding complexities in both deadlock and starvation
avoidance. While the Bubble Flow Control theory that can avoid deadlock in VCT-
switched tori with only one virtual channel has been proposed a decade ago (with its effi-
cient implementation of Critical Bubble Scheme presented previously), there has been no
working solution for wormhole switching that achieves the similar objective. In this
chapter, we present WBFC (Worm-Bubble Flow Control) [22], a new flow control
scheme that can avoid deadlock in wormhole-switched tori using minimally 1-flit-sized
buffers per VC and one VC in total. Moreover, any wormhole-switched topology with
embedded rings can use WBFC to avoid deadlock within each ring of the network. Thus,
WBFC has both theoretical value and practical applications.
127
6.1 Challenges in Extending BFC for Wormhole
The difference between VCT and wormhole switching is condition when packets are
allowed to advance to next hop. For VCT switching, the head flit of a packet p is allowed
to enter the downstream receiving queue q if:
cap(q) – size(q) ≥ L(p)
whereas for wormhole switching, the condition is:
cap(q) – size(q) ≥ 1.
For VCT switching, in order to allow any packet to access a specific queue q, the cor-
responding condition should hold for all packets, thereby requiring the capacity of q to be
at least the length of the longest packet. Meanwhile, as all flits belonging to the same
packet are guaranteed to be received at q once the head flit gains access, a variety of
deadlock issues caused by indirect dependence can be simplified [33]. Therefore, VCT
switching is preferred when buffer resources are abundant [3, 5]. In contrast, wormhole
switching adds complexities to deadlock avoidance, but it requires minimally only one
flit-sized buffer per VC according to its above condition. Hence, due to the low buffer
requirement, wormhole switching is more preferred in resource-constrained on-chip envi-
ronments [41, 48]. Unfortunately, the added channel dependence introduced by worm-
hole switching and the allowance of buffer size being smaller than the packet size cause
several major difficulties in extending both the BFC theorem and its implementation to
wormhole.
128
6.1.1 Atomic Buffer Allocation in Wormhole Switching
One difficulty in extending the BFC technique to wormhole switching arises from the
requirement of atomic buffer allocation which typically is assumed in wormhole net-
works to ease deadlock handling and provide more routing flexibility. Figure 6.1 com-
pares atomic buffer allocation with non-atomic buffer allocation, for the case of worm-
hole switching. Each small rectangle represents a flit-sized buffer. The numbers in the
rectangles identify to which packet the head (H), body (B) and tail (T) flits belong. In (a),
atomic buffer allocation does not allow a non-empty VC occupied by one packet to be
allocated to another packet (e.g., VC2 cannot be allocated to P2). In other words, with
atomic buffer allocation, the head flit of a packet p is allowed to enter the downstream
queue q only if size(q) = 0. In contrast, non-atomic buffer allocation in (b) allows multi-
Figure 6.1: Comparison of atomic and non-atomic buffer allocation.
Each small rectangle represents a wormhole flit-sized buffer.
(a) Atomic buffer allocation
(b) Non-atomic buffer allocation
129
ple packets to be collocated in the same VC (e.g., flits from both P2 and P1 can reside in
VC2). The condition for packet movement is therefore also being cap(q) – size(q) ≥ 1.
VCT switching typically employs non-atomic buffer allocation. This is because once
a packet is allocated a VC, all flits belonging to the same packet are guaranteed to be
received at this VC. However, unlike VCT, wormhole does not guarantee enough space
for the entire packet, so it is possible that if the head flit is blocked in a VC, the rest of the
flits of the same packet can be stuck in the upstream VC (e.g., in (b), B2 cannot enter
VC2 if H2 is blocked by T1). Essentially, non-atomic buffer allocation in wormhole
switching creates additional dependencies between upstream and downstream channels
[33], which may lead to deadlock configurations. Two detailed deadlock examples
caused by these dependencies are provided in [36] and [81]. To avoid such deadlock is-
sues and support general adaptive routing functions, wormhole switching typically as-
sumes atomic buffer allocation [26, 33, 34, 36] though it is possible that, in some cases,
non-atomic buffer allocation can be used for wormhole. As will be addressed later, the
approach proposed in this work can be tailored to handle those situations as well.
With atomic buffer allocation, extension of BFC to wormhole switching is not
straightforward. One natural extension of BFC to wormhole is to always ensure the exist-
ence of a flit-sized bubble in the ring (as opposed to a packet-sized bubble in VCT).
However, as depicted in Figure 6.2, deadlock can still occur. In the figure, none of the
downstream VCs are empty; thus, according to atomic buffer allocation, no head flit can
acquire a VC to move forward. To make things worse, even if BFC is extended to have
more than one wormhole flit-sized bubble in the ring, it is still not enough, as illustrated
130
in the same example (i.e., there are 3 empty flit-sized buffers in the ring but still no pack-
et can advance with atomic buffer allocation). Therefore, to make bubble-based flow con-
trol useful for wormhole, a new sufficient condition for deadlock freedom under worm-
hole switching needs to be derived, as revealed in the proposed theoretical support of
WBFC in Section 6.2.
6.1.2 Small Buffer Size
On the implementation side, regardless of whether buffer allocation is atomic or non-
atomic, a key feature of wormhole is the allowance of buffer size to be smaller even than
the packet size. Unfortunately, this makes BFC and CBS non-applicable to wormhole
switching. Specifically, the small buffer size in wormhole presents two obstacles in im-
plementing a wormhole version of BFC and CBS.
First, small local buffers require more global information to make the grant/denial de-
cision for packet injection, while the presence/absence of the critical bubble is no longer
enough to reflect all the needed global information. Figure 6.3 shows an example. Let us
first consider that there is only one packet, P1, which needs to be injected into the ring
(ignoring P2 for the moment). Similar to CBS, assume there are enough critical worm-
hole flit-sized bubbles maintained in the ring and cannot be used by any injecting packet.
Figure 6.2: Deadlock caused by atomic buffer allocation in wormhole switching.
131
Recall that in CBS, as long as the local buffer is not the critical bubble, a packet can be
injected immediately based only on local information. However, for the wormhole case,
the local buffer with size of 3 flits is not large enough to hold the entire 5-flit packet.
Even if VC2 is empty, P1 is still unable to make a decision since P1 is not informed as to
whether or not the ring has enough non-critical empty buffers to accept it. Therefore, with
small local buffers, more global information on other nodes is needed (e.g., the infor-
mation of an empty buffer in R4). For the remainder of the chapter, this is referred to as
the small-buffer problem. To combat this problem, the injecting packet needs to be able
to reserve some extra resources (e.g., VC4). In Section 6.3, an implementation is pro-
posed not only to reserve resources efficiently, but also to convey the global reservation
status locally.
Second, new starvation issues arise when multiple packets try to inject simultaneously.
Consider the example in Figure 6.3 again with both P1 and P2 injecting. The packets are
5-flit whereas the local buffers are 3-flit. As a result, even if both packets are able to re-
serve resources, it is possible that none of them can be injected. For instance, if P1 re-
serves VC4 and P2 reserves VC2, then both packets cannot acquire any more empty VCs
and starvation occurs. This problem does not happen in CBS as buffers in VCT are al-
Figure 6.3: Problems with buffer size smaller than packet.
132
ways large enough to receive the entire packet. This new starvation problem is more dif-
ficult to solve than the small-buffer problem because, essentially, we need a global coor-
dination scheme that does not require a central coordinator and uses local information
only. This is addressed in Section 6.3, but note that, the small-buffer problem and the
starvation problem are independent of atomic or non-atomic buffer allocation.
6.2 Worm-Bubble Flow Control Theory
In this and next sections, we propose the theoretical support and implementation for
Worm-Bubble Flow Control (WBFC) to extend BFC from VCT switching to wormhole
switching. The goal is to maintain that all buffers in each ring of a torus never become
fully occupied.
6.2.1 A Sufficient Condition for Deadlock Freedom
First, a sufficient condition is derived for deadlock-free flow control on torus unidi-
rectional rings that supports atomic buffer allocation and the small buffer size property of
wormhole switching.
Definition 5. A worm-bubble (WB) is the control unit used in worm-bubble flow control.
It refers to an empty VC buffer queue and associated color field. The capacity of a worm-
bubble wb is the number of flits in the buffer queue, denoted as cap(wb), minimally one
wormhole flit.
133
Put simply, for atomic buffer allocation, a worm-bubble is a VC-sized buffer which
can be as small as only one wormhole flit; whereas in BFC, a bubble is a packet-sized
buffer. Worm-bubble flow control treats each VC, minimally one flit buffer queue, as
either empty or occupied for atomic allocation.
Theorem 1. For wormhole switching with atomic buffer allocation, a torus unidirectional
ring y is deadlock-free if there exists at least one worm-bubble located anywhere in the
ring after packet injection.
Proof Sketch. Let Q
y
be a subset of Q consisting of all input queues belonging to ring y.
By definition, a worm-bubble exists only if the associated buffer queue is empty. There-
fore, the condition of the theorem says that, after packet injection,
∃q
i
∈ Q
y
, such that size(q
i
) = 0
Now examine q
i
’s upstream buffer queue q
u
(i.e., along the reverse direction of the
unidirectional ring). If size(q
u
) also equals 0, then the upstream buffer queue of q
u
is ex-
amined, and let q
u
denote the new buffer queue. This search process continues until either
(i) all buffer queues along the ring are examined and are all empty, thus no deadlock can
occur; or (ii) a non-empty q
u
is found for the first time along the upstream direction of q
i
before the search goes back to q
i
.
We prove the deadlock freedom of case (ii) by contradiction. Consider the down-
stream buffer queue of q
u
, say q
d
. Since the downstream direction is opposite to the
search direction, q
d
must be examined before q
u
and is empty (i.e., size(q
d
) = 0). Now
assume that there is a deadlock in the ring, so that no packet can make progress. In par-
134
ticular, the flit f at the head of q
u
cannot move into downstream buffer queue q
d
. However,
if this f is a body or tail flit, it can advance to q
d
as long as size(q
d
) < cap(q
d
), which is
currently satisfied as size(q
d
) = 0; otherwise, if f is a head flit, atomic buffer allocation
requires that f can advance to q
d
only if size(q
d
) = 0, which is also satisfied by q
d
. There-
fore, in either case the flit f can advance, contrary to the assumption that there is a dead-
lock configuration.
Considering both (i) and (ii), each unidirectional ring is guaranteed to be deadlock-
free if there exists at least one worm-bubble anywhere in the ring. □
The theoretical support for WBFC proposed here is a significant extension of BFC
from the VCT domain to the wormhole domain. As long as the packet injection process
maintains the existence of a WB in the ring, together with DOR, WBFC can achieve
deadlock-free routing in wormhole-switched tori using only one VC.
6.3 Worm-Bubble Flow Control Implementation
6.3.1 The Basic Idea
To maintain the existence of a worm-bubble after single and simultaneous packet in-
jection
2
, WBFC implementation schemes (particularly those that use local information
only) need to solve the small-buffer problem and the starvation problem caused by insuf-
ficient global information. The basic idea of the proposed WBFC scheme is first to re-
serve (mark) some additional buffers in the ring for an injecting packet and then, after the
2
Injection includes initial injection by the local node as well as changing from one dimension into another.
135
packet is injected and traversing along the ring, progressively unmark the previously re-
served buffers, thereby freeing buffers so that they can be used by other injecting packets.
6.3.2 Combating the Small-buffer Problem
We first discuss how to overcome the small-buffer problem that occurs under single
packet injection. Since this problem is caused by the small local buffer size, extra buffers
located elsewhere in the ring are needed for a packet to inject.
To achieve that, WBs are marked as either white or black. The white WBs are normal
empty buffer queues that can be used by any packet, whereas the black ones can only be
used by in-transit packets. Similar to CBS, the black WBs can be displaced backward in
the ring either proactively (e.g., if the upstream WB is white and there is a packet waiting
to access the black WB), or by the in-transit packet that occupies it (i.e., black WB is
transferred backwards). The initial colors of WBs are all white.
For the convenience of this discussion, a quantity M is defined for each packet as be-
low, which is useful for marking buffers.
Definition 6. The M value of a packet p represents the minimal number of worm-bubbles
needed to receive the packet into a ring, denoted as M
p
. That is,
M
p
= L(p) / cap(wb)
For example, if a long packet p is 5-flit long and the WB is 3-flit large, then M
p
=
5/3 = 2, meaning that it takes 2 WBs to receive the injecting packet. In case of short
packets, say 1-flit-sized, then M
p
= 1/3 = 1. Additionally, in what follows, packets that
136
are already completely injected and propagating within a ring are referred to as in-transit
packets.
There are four steps in the proposed WBFC scheme:
Step 1) Initialization: For each unidirectional ring of a torus, one white WB from any
router channel belonging to this direction is marked as black. For example, in Figure
6.4(a), the WB in R3 is marked as black (to illustrate more clearly, individual flits within
each WB are not shown).
Step 2) Injection: An injecting packet p needs to reserve M
p
WBs in order for the ring
to accept it. To achieve that, each injection channel is associated with a counter C
I
. Every
time p sees a white WB in the downstream receiving channel, p will mark it as black and
increase C
I
by one. Once C
I
reaches (M
p
– 1) and p sees a white WB again, p can be in-
jected. For example, a packet p with M
p
of 2 wishes to inject into the ring in Figure 6.4(a).
Since VC2 is a white WB, p marks it as black and increases C
I
to 1 in (b). Then, due to
the backward movement of black WB (either proactively or by in-transit packet), the
black WB in VC2 is transferred to VC1 in (c). At this time, p sees a white WB again in
VC2 with C
I
reaches (M
p
– 1) = 1. Now p can be injected safely as it knows for sure that
there are two WBs in the ring to accept it (one reserved as indicated by C
I
plus one WB
in VC2). Note that, for short packets with M
p
of 1, they can be injected immediately after
they see a white WB in the downstream receiving channel, without waiting for more
white WBs.
Step 3) In-transit: To free the reserved buffers after the packet is injected, the head
flit of each packet p is augmented with a counter C
H
. When p is injected into the ring, the
137
number in C
I
is copied to C
H
, and C
I
is set to 0, as exemplified in (d). Then p continues
traversing along the ring. If p encounters a black WB and C
H
is not 0, then p unmarks the
black WB and decreases C
H
, as shown in (e). If C
H
is already 0, then it is just a normal
backward displacement of the black WB.
Step 4) Ejection: It is possible that p may not encounter enough black WBs when it
reaches its destination, as exemplified in (f) where the initial black WB is in R0 instead of
R3. In that case, the remaining C
H
is added to the C
I
of the injection channel at the desti-
nation node (e.g., in (g), the remaining count in C
H
is added to the C
I
of the injection
channel in R4) for the conservation of this global WB information. The added count to C
I
can be reused by the next injecting packet at R4. For instance, in (h), another long packet
p wishes to inject from R4. Since C
I
is already (M
p
– 1) = 1 and VC5 is a white WB, the
packet can be injected immediately according to the condition in step 2 above without
reserving more white WBs.
Although not emphasized in the above, changing dimension is treated similarly as in-
jection. In fact, changing dimension is equivalent to eject from the first dimension using
step 4 above and then inject to the second dimension according to step 2 above. The
above process always ensures that there are just enough empty buffer queues in the ring
to enable forward progress before granting any packet injection, even if the local buffer is
not large enough to hold the entire packet. Thus, it solves the small-buffer problem.
Moreover, each injecting packet only checks information locally (i.e., C
I
and the status of
the direct downstream channel), thus avoiding any costly global communication. During
the process, as all packets collectively cannot unmark more black WBs than they have
138
Figure 6.4: Walk-through example of the use of white and black WBs in WBFC.
(a) Initially, the WB in R3 is
marked as black and C
I
is 0
(b) P marks the WB in R2 as
black and increases the counter
(c) With C
I
= M
p
– 1, P can
inject to the white WB in R2
(d) When P is injected, C
I
is
copied to C
H
and then reset.
(e) P unmarks the encountered
black WB and decreases C
H
.
(h) Another packet reuses the C
I
and gets injected immediately
(f) In another case, P does not
encounter enough black WB
(g) The remaining C
H
is added
to the C
I
at destination node
139
marked, along with the initially marked black WB, at least one black WB is always main-
tained in the ring. According to Theorem 1, the ring is deadlock-free. A proof sketch of
the deadlock freedom is provided shortly.
6.3.3 Combating the Starvation Problem
The starvation problem caused by multiple simultaneous packet injections must also
be solved. Figure 6.5(a) shows a rare but theoretically possible corner case in which five
long packets want to inject during the same cycle. All packets see a white WB in the
downstream receiving channel and a zero local counter. With the above scheme, all pack-
ets mark the white WB as black WB. However, this would cause all the WBs in the ring
to become black and hence, none of the packets can inject forever.
To solve this problem without a global coordinator, the scheme is augmented based
on the following rationale. A gray WB is introduced to act like a token. The gray WB can
only be used by one of the packets that have partially completed the reservation process
(i.e., only when C
I
> 0). In order to break the starvation configuration, any packet (with
C
I
> 0) that sees a gray WB in its downstream receiving channel is able to inject immedi-
ately. In this way, the starvation problem can be avoided using local information only and
without a central coordinator. Correspondingly, the initialization step needs some modifi-
cations in order to allow instant packet injection with gray WB.
Instead of marking one black WB in the initialization, one gray WB and (M
L
– 1)
black WBs are marked, where M
L
is the M value of the longest packet. By doing that, any
packet that grabs the gray WB (i.e., the token) is guaranteed to be provided with enough
140
buffer queues already in the ring to accept it. In addition, to ensure the circulation of the
gray WB in the ring, black WBs can proactively move backward and transfer the gray
WB forward to the downstream channel.
For example, in Figure 6.5(b), with M
L
= 2, initially one black WB and one gray WB
are marked in R0 and R1, respectively. Again, consider the same scenario of five simul-
taneously injecting packets. With the above augmented implementation, as shown in (c),
P1 cannot use the gray WB since C
I
= 0 while P2 ~ P5 mark the white WB as black and
increase C
I
to 1. Then when the gray WB moves from R1 to R2 as shown in (d), P2 sees
the gray WB and finds C
I
> 0. Hence, P2 can start injection immediately, breaking the
starvation configuration. P2 is completely injected in (e). Note that the gray WB can only
be used by one injecting packet at a time. So once a head flit grabs the gray WB, the gray
WB (now occupied) will be associated with the head flit until it reaches its destination
and is converted back to a gray WB which can be used by other packets or be displaced
forward. The proof sketch for deadlock freedom is provided shortly.
141
Figure 6.5: : Walk-through example of avoiding starvation by the use of gray WB.
(a) Starvation occurs when all
packets mark simultaneously
(b) Initially, one gray WB and (M
p
– 1) black WBs are marked
(c) P1 cannot acquire the gray WB
because C
I
= 0
(d) After the gray WB is displaced
to R2, P2 with C
I
> 0 can acquire
the gray WB and start injection
immediately
(e) P2 is injected completely. Note
that C
H
is decreased due to the en-
counter of black WB in R3.
142
6.3.4 Formal Description of Injection Rules
While the prior explanation of how WBFC works is detailed, the essential injection
rules (and thus the implementation) are relatively simple and elegant, as described below.
Definition 7. For a buffer queue q
i
, the function color(q
i
) returns the color of its associat-
ed worm-bubble.
The rules for F(q
i
,q
j
) have three cases:
(i) Same dimension move, i.e., x = y. In this case, it is the same as wormhole with
atomic buffer allocation, so F(q
i
,q
j
) is True if:
size(q
j
) = 0; (7)
(ii) Short packet injection or changing dimension, i.e., M
p
= 1 and q
i
∈ Q
I
∨ x ≠ y. In
this case, F(q
i
,q
j
) is True if:
size(q
j
) = 0 ∧ color(q
j
) ≠ black; (8)
(iii) Long packet injection or changing dimension, i.e., M
p
> 1 and q
i
∈ Q
I
∨ x ≠ y. In
this case, F(q
i
,q
j
) is True if:
size(q
j
) = 0 ∧ color(q
j
) = white ∧ C
I
≥ M
p
– 1, or
size(q
j
) = 0 ∧ color(q
j
) = gray ∧ C
I
> 0 (9)
The above F(q
i
,q
j
) of WBFC together with DOR can realize deadlock-free determin-
istic routing using only one VC. To achieve adaptive routing, similar to BFC and CBS,
this one VC acts as an escape resource and more VCs can be added as adaptive resources.
The rules of WBFC apply only when packets wish to enter from adaptive VCs to escape
VCs and inject/change dimension in escape VCs (i.e., to keep the escape resource dead-
143
lock-free). Packets on adaptive VCs can inject or change dimension freely without being
restricted by WBFC.
6.3.5 Deadlock Freedom of WBFC
Lemma 4. Worm-bubble flow control is deadlock-free for each unidirectional ring of a
torus.
Proof Sketch. Consider an arbitrary unidirectional ring y of a torus. We prove that, in
each of the following possible cases, there exists a WB in the ring after packet injection.
Case (i) The maximum M
L
= 1. This means that all packets are short and, initially, 1
gray WB is marked. Since C
I
= M
p
– 1 is always 0 for short packets, the gray WB can
never be used during packet injection according to Equation (6). Thus, an empty gray
WB always exists in ring y (as all packets are short, this case is equivalent to CBS).
Case (ii) M
L
> 1 and a packet p uses a gray WB during injection. Consider the number
of black and gray WBs in ring y. Initially, there are at least 1 gray WB and (M
L
– 1) black
WBs. A packet that uses a gray WB to inject must have C
I
≥ 1, meaning that it must have
marked at least one 1 more black WB. Consequently, there are at least 1 gray and M
L
black WBs in the ring before injection, and p consumes at most 1 gray and (M
L
– 1) WB.
Therefore, at least one black WB remains in ring y after the injection. The gray WB can-
not be used by other packets until p reaches the destination, at which point, the consumed
WBs will be released to maintain the WB counts.
144
Case (iii) M
L
> 1 and a packet p only uses black and white WBs during injection. In
this case, before injection, p has reserved at least (M
p
– 1) black WBs and there is a white
WB in the receiving channel. The injection of p can at most consume all these WBs.
Hence the initially marked WBs are not touched, which includes at least (M
L
– 1) ≥ 1
black WBs in ring y.
Combining cases (i)~(iii), there always exists a black or gray WB in ring y after pack-
et injection. According to Theorem 1, ring y is deadlock-free. As y is selected arbitrarily,
each ring of a torus is deadlock-free. □
6.3.6 Reducing Injection Delay
As with all bubble-based flow control mechanisms, a concern is whether the flow
control technique causes a large delay in packet injection. For example, packets in WBFC
need to wait for white or gray WBs before being able to inject. However, there are five
features in WBFC that can substantially mitigate the increase in injection delay.
First, as just mentioned above, packets on adaptive VCs are not subjected to the add-
ed rules, so they have no extra delay. Second, short packets with M
p
of 1 can be injected
immediately after they see a white WB; only long packets need to wait for non-local
white WBs. Third, similar to CBS, idle white and gray WBs are proactively displaced to
reduce the waiting time of packets in need. Fourth, the gray WB not only serves as a star-
vation-avoidance token, but can also accelerate packet injection, as packets that are quali-
fied for grabbing the gray WB can be injected immediately even if C
I
has not reached (M
p
– 1). Fifth, when packets arrive at the destination node, the remaining counts in C
H
are
145
passed on to the destination C
I
so that later injecting packets can reuse the already
marked WB, thereby reducing the injection time.
6.3.7 Modifications to Router Microarchitecture
Figure 6.6 shows the modifications needed to implement WBFC on a typical worm-
hole router, with the shaded areas indicating the added or modified components.
At the output port, a 2-bit Clr field is added to record the color of the downstream
WB, and the C
I
field is added to count the number of reserved/marked WBs for an injec-
tion channel. The counter is used whenever packets need to inject (or change dimension)
into the escape VC from the injection queue or from adaptive VCs. To avoid the WBs in
a unidirectional ring from being overly reserved, the counter is shared by all injection
sources to an output direction. The injection source that wins the VA/SA will use the
Figure 6.6: Router microarchitecture for WBFC.
The shaded blocks are added or modified to implement WBFC.
C
I
Clr
C
I
Clr
····
Input
Unit
VC Allocator
Switch
Allocator
Routing
Output
Unit
wbt_a
wbt_b
wb_clr
wbt_a
wbt_b
wb_clr
146
counter until the packet marks enough WBs and gets injected (note that the in-transit
packets can still use the output when the injecting packet is waiting for a white WB). The
number of bits of the C
I
field is log k , assuming k nodes in a ring. In addition, only the
escape VC needs Clr and C
I
, therefore the overhead for the extra two fields is only 4 bits
(for 4x4 torus) or 5 bits (for 8x8 torus) per output port.
To support the rules of WBFC, VA logic is slightly modified based on Equations
(7)~(9). However, two things are worth mentioning: (i) the logic needed to implement
(7)~(9) is no more than a few comparators and AND/OR gates, hence the hardware over-
head is small; (ii) the added logic operates largely in parallel with the original VA logic
(e.g., the checking of WB color and counter in WBFC is performed while the original VA
is checking the emptiness of the VC), thus it does not penalize the router critical path.
To control the WB transfer between routers, three control signals (wbt_a, wbt_b and
wb_clr) are added. When the black WB in the downstream router needs to be transferred
backwards (either proactively or due to incoming packets), wbt_a is asserted to request a
color transfer to the upstream router. Upon detecting the asserted wbt_a, if there is a free
white/gray WB in the upstream router, then the upstream router asserts the wbt_b signal,
puts the white/gray color on the wb_clr line, and changes the current WB color to black.
When the downstream router sees an asserted wbt_b, it changes the black WB to
white/gray color. Finally, the wbt_a and wbt_b signals are deasserted, in turn, to complete
the color transfer.
147
6.4 Evaluation
The proposed WBFC is evaluated experimentally under full-system simulation using
Simics [83], with GEMS [84] and Garnet [6] for detailed timing of the memory system
and on-chip network. Orion 2.0 [65] is integrated in Garnet for NoC power and area es-
timation using technology parameters from an industrial standard 45nm CMOS process
and 1.1V operating voltage. All the key additional hardware described in Section 3 are
modeled in the simulators and accounted for in the evaluation. A canonical wormhole
router with credit-based buffer control is assumed. Table 6.1 lists the key parameters of
the simulation configuration. With a typical 128-bit link width, short packets of 16B are
single-flit while long packets carrying 64B data plus a head flit have 5 flits. Theoretically,
buffers with wormhole switching could be as small as a single flit, but this causes long
packets to span across many routers, thus substantially increasing the performance penal-
ty of other packets if one packet is blocked. Due to these considerations, a medium buffer
size per VC (e.g., 3-flit depth) is used in the majority of simulations. The impact of buffer
Table 6.1: Configuration for evaluating worm-bubble flow control.
Network topology 4x4 and 8x8 torus
Router 4-stage, 2GHz
Input buffer 3-flit depth
Link bandwidth 128 bits/cycle
Core model Sun UltraSPARC III+, 2 GHz
Private I/D L1$ 32KB, 2-way, LRU, 1-cycle latency
Shared L2 per bank 256KB, 16-way, LRU, 6-cycle latency
Cache block size 64Bytes
Virtual channel 1~3 VCs per protocol class
Coherence protocol MOESI
Memory controllers 4, located one at each corner
Memory latency 128 cycles
148
size is assessed separately in a sensitivity study. The same clock frequency is assumed for
all designs, although WBFC with fewer VCs could be clocked faster to further reduce
latency and increase throughput.
The following designs are compared: (a) WBFC-1VC: the minimal buffer configura-
tion for WBFC with DOR; (b) DL-2VC: the minimal buffer configuration for Dateline
with DOR; (c) WBFC-2VC: WBFC with adaptive routing; (d) DL-3VC: Dateline with
adaptive routing; (e) WBFC-3VC: WBFC with adaptive routing using 3VC in total. All
adaptive routing is minimal routing based on Duato’s Protocol [33], and Dateline is opti-
mized with balanced channel utilization [26]. With these five designs, we are able to
compare Dateline and the proposed WBFC under the same buffer resources as well as the
minimal buffers required for each. Buffers are assumed to be allocated atomically.
Both synthetic traffic and multi-threaded applications are used as workload. For syn-
thetic traffic, the simulator is warmed up for 10,000 cycles and the statistics are collected
over another 100,000 cycles. Four traffic patterns are simulated: uniform random (UR),
transpose (TP), bit complement (BC) and tornado (TO) [26]. Packets are uniformly as-
signed short and long lengths. Since the link width is 128-bit, the 3-bit counter C
H
can be
easily coded in head flit without extra flit. For real applications, multi-threaded PARSEC
benchmarks [13] are used. Each core is warmed up for sufficiently long time and then run
until completion.
149
6.4.1 Performance for 4x4 Torus Network
We first examine the performance impact of Dateline and the proposed WBFC. Fig-
ure 6.7 plots the performance of these two flow control mechanisms with varying number
of VCs in a 4x4 torus. As can be seen, WBFC-1VC is able to perform normally across a
wide range of load rates under all four traffic patterns, indicating that WBFC correctly
realizes deadlock-free DOR with only one VC. This enables WBFC to achieve minimal
adaptive routing with just two VCs; whereas Dateline can only implement DOR with
2VCs. As a result, WBFC-2VC delivers much higher throughput
3
than DL-2VC, with
improvements of 46%, 98%, 8.6% and 25% for UR, TP, BC and TO, respectively. When
supplied with the same amount of three VCs, both DL-3VC and WBFC-3VC support
adaptive routing, but WBFC-3VC has more VCs serving as adaptive resources. Thus,
WBFC-3VC continues to perform better than DL-3VC for all traffic patterns.
We also observe that increasing the number of VCs can greatly improve the perfor-
mance for both techniques, although doing so may incur additional area and power cost.
This phenomenon is most evident in TP as packets routed with DOR in transpose would
severely congest a few “turn” routers, leading to low throughput. When provided with
even one adaptive VC to balance the load on links, the throughput can be improved by
118% from DL-2VC to DL-3VC and by 168% from WBFC-1VC to WBFC-2VC. In ad-
dition, although WBFC-2VC uses one fewer VC than DL-3VC, its performance is only
11% less than DL-3VC.
3
Throughput corresponds to the load rate at which the average latency is three times that of the zero-load latency.
150
WBFC has large performance improvement over Dateline for UR, TP and TO, while
in BC, the improvement is relatively small. This is because DOR is the perfect routing
algorithm for bit-complement, so DL-2VC without adaptivity in routing can perform rela-
tively well. However, each packet in DL-2VC is restricted to use a certain VC to avoid
deadlock, whereas packets in WBFC-2VC have more VCs to use and circumvent con-
gested routers. As a result, WBFC-2VC still exhibits a higher throughput of 8.6% than
DL-2VC. Similarly, WBFC-3VC improves throughput by 7.2% over DL-3VC.
Figure 6.7: Performance comparison for 4x4 torus.
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Average latency (cycles)
Injection rate (flits/node/cycle)
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Average latency (cycles)
Injection rate (flits/node/cycle)
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5 0.6
Average latency (cycles)
Injection rate (flits/node/cycle)
0
20
40
60
80
100
0 0.1 0.2 0.3
Average latency (cycles)
Injection rate (flits/node/cycle)
WBFC‐1VC DL‐2VC WBFC‐2VC DL‐3VC WBFC‐3VC
(a) Uniform Random (b) Transpose
(c) Bit-complement (d) Tornado
151
6.4.2 Performance for 8x8 Torus Network
Figure 6.8 compares the performance of Dateline and WBFC in an 8x8 torus. A simi-
lar trend of WBFC performing better than Dateline can be observed for all patterns, but
with larger gaps among different designs. For instance, for the 4x4 network, the through-
put improvement in UR is 46% and 19% from DL-2VC to WBFC-2VC and from DL-
3VC to WBFC-3VC, respectively. For the 8x8 network, these gains rise to 66% and 31%,
respectively, indicating an increased benefit of WBFC over Dateline for larger network
Figure 6.8: Performance comparison for 8x8 torus.
WBFC‐1VC DL‐2VC WBFC‐2VC DL‐3VC WBFC‐3VC
(a) Uniform Random (b) Transpose
(c) Bit-complement (d) Tornado
0
20
40
60
80
100
120
0 0.1 0.2 0.3 0.4 0.5 0.6
Average latency (cycles)
Injection rate (flits/node/cycle)
0
20
40
60
80
100
120
0 0.1 0.2 0.3 0.4 0.5
Average latency (cycles)
Injection rate (flits/node/cycle)
0
20
40
60
80
100
120
0 0.1 0.2 0.3 0.4
Average latency (cycles)
Injection rate (flits/node/cycle)
0
20
40
60
80
100
120
0 0.1 0.2
Average latency (cycles)
Injection rate (flits/node/cycle)
152
sizes. The advantage of Dateline is that packets can be injected sooner, whereas WBFC
places more restrictions on packet injection in exchange for fewer VCs for escape and
more VCs for adaptive routing. However, as shown in the next subsection, the injection
delay in WBFC is mitigated in larger networks; hence, the gain of WBFC over Dateline
is increased in 8x8 torus.
WBFC does not have any design elements that limits its scalability. In fact, a valuable
advantage of WBFC implementation is its ability to convey global status by examining
only local information. No central coordinator is needed in solving either the small-buffer
problem or the starvation problem. Thus, WBFC has good scalability.
6.4.3 Injection Delay
As mentioned in Section 6.3.6, injection delay is an important aspect for bubble-
based flow control techniques. Figure 6.9 shows the injection delay of WBFC and Date-
line at low, medium and high uniform loads, corresponding to 10%, 50% and 90% of the
Figure 6.9: Injection delay comparisons.
0
2
4
6
8
10
12
14
10% 50% 90% 10% 50% 90%
4x4 8x8
Injection delay (cycles)
WBFC‐1VC DL‐2VC WBFC‐2VC DL‐3VC WBFC‐3VC
153
throughput, respectively. To provide a fair comparison, the percentage in each design is
relative to its own saturation throughput. Moreover, the injection delay includes both the
delay incurred for injection and for changing dimensions on all VCs.
As can be seen, WBFC-1VC has higher injection delay than DL-2VC, for instance,
by 2.1 and 2.3 cycles under 50% load for 4x4 and 8x8, respectively. This increased injec-
tion delay is expected as WBFC places more restrictions on packet injection. However,
when compared to DL-2VC, WBFC-2VC is observed to have a reduced injection delay.
This is mainly because packets on the adaptive VC in WBFC-2VC are not subjected to
the injection restrictions and there are more packets traveling on the adaptive VC than on
the escape VC due to higher preference for adaptive resources in Duato’s Protocol.
Therefore, although the injection delay on the escape VC (not shown separately in the
figure) is similar to that for WBFC-1VC, the overall injection delay is lower than that of
DL-2VC. This trend has been observed for both 2VC and 3VC configurations under all
three load ranges. These results indicate that the waiting time for white/gray WB in
WBFC does not increase the overall injection delay much.
6.4.4 Performance for PARSEC Benchmarks
Figure 6.10 plots the execution time of PARSEC benchmarks for the five designs,
normalized to WBFC-1VC. As can be seen, WBFC-2VC/3VC performs better than DL-
2VC/3VC for all workloads, with the largest reduction of 8.0%/8.3% seen for dedup.
Overall, the execution time difference among the five designs is relatively small. Com-
pared with WBFC-1VC, the reduction in execution time for DL-2VC, WBFC-2VC, DL-
154
3VC and WBFC-3VC are 3.1%, 5.0%, 4.3% and 5.9% on average, respectively. However,
one interesting result is that WBFC-2VC performs better than DL-3VC for blacksholes,
dedup, fluidanimate and swaptions. Considering that WBFC-2VC only uses two VCs
while DL-3VC uses three, WBFC-2VC is actually superior to DL-3VC in both perfor-
mance and cost, as shown next.
6.4.5 Area Comparison
Figure 6.11 compares the router area of Dateline and WBFC with the ability of sup-
porting deterministic and adaptive routing. All key additional overhead of WBFC has
been accounted for, including the extra fields in output unit, the modified VA logic and
the control lines for transferring WBs. As area remains constant regardless of the work-
load, this figure represents the area for both synthetic and application scenarios.
WBFC decreases the minimal number of VCs needed to achieve deadlock-free de-
terministic routing from two to one. As a result, compared to DL-2VC, WBFC-1VC con-
Figure 6.10: Execution time comparisons for PARSEC benchmarks.
0.75
0.8
0.85
0.9
0.95
1
Execution time (norm. to WBFC‐1VC)
WBFC‐1VC DL‐2VC WBFC‐2VC DL‐3VC WBFC‐3VC
155
siderably reduces the buffer area and the control logic area by 50% and 61%, respectively,
resulting in an overall router area reduction of 17% (overhead included). As for adaptive
routing, WBFC-2VC is the minimal configuration for enabling adaptivity in routing,
whereas DL-3VC is the minimal for Dateline. Accordingly, compared to DL-3VC,
WBFC-2VC reduces buffer area, control logic area and overall area by 33%, 52% and
15%, respectively. Also plotted in the figure is the area for WBFC-3VC, which has been
shown to have the highest performance among all five designs. Since WBFC-3VC utiliz-
es the same number of VCs as DL-3VC, the main difference lies in the area overhead
caused by WBFC, which accounts for only 3.4% of the total router area.
6.4.6 Energy
Figure 6.12 presents the router energy from running PARSEC benchmarks. Results
are normalized to the highest DL-3VC. The dynamic (and static) energy cost of addition-
al hardware in WBFC has been lumped to the dynamic (and control logic static) energy
consumption.
Figure 6.11: Router area comparisons.
0
0.2
0.4
0.6
0.8
1
WBFC‐1VC DL‐2VC WBFC‐2VC DL‐3VC WBFC‐3VC
Deterministic Adaptive
Breakdown of router area
buffer xbar overhead ctrl
156
The router energy is determined by both power consumption and execution time. On
average, although WBFC-1VC has the longest execution time, it still achieves the lowest
energy consumption, with a reduction of 53.4% in static energy and 27.2% in total energy
compared to DL-3VC. WBFC-2VC and DL-2VC also reduce the overall router energy by
approximately 15% compared to DL-3VC. Besides, WBFC-2VC (and WBFC-3VC) has
lower energy than DL-2VC (and DL-3VC) due to the shorter execution time. Moreover,
the energy consumption in control logic also drops as the number of VCs decreases,
demonstrating the impact of reducing the number of VCs on control logic in addition to
buffer resources.
6.4.7 Impact of Buffer Size
In this subsection, we study the impact of buffer size on the performance of Dateline
and WBFC. The buffer size is configured to be as small as a single flit, and as large as an
entire long packet of 5-flit. Figure 6.13 shows the results for an 8x8 torus with uniform
Figure 6.12: Overall router energy breakdown.
0%
20%
40%
60%
80%
100%
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
WBFC‐1VC
DL‐2VC
WBFC‐2VC
DL‐3VC
WBFC‐3VC
blackscholes bodytrack canneal dedup ferret fluidanimate raytrace swaptions vips x264 AVG
Breakdown of router energy (normalized to DL‐3VC)
Buffer_static Ctrl_static Xbar_static Dynamic
157
traffic. The suffix of the names indicates the buffer size (e.g., 3F means the buffer is con-
figured to be 3-flit large).
As seen from the figure, WBFC exhibits higher throughput than Dateline for all three
buffer sizes, with an improvement of 42.8%, 30.8% and 21% for 1-flit, 3-flit and 5-flit
buffers, respectively. For each technique (either Dateline or WBFC), the throughput in-
creases when the buffer size increases, just as expected. However, it is also observed that
WBFC-3VC using 3-flit buffer can actually outperform DL-3VC using 5-flit buffer, with
a throughput gain of 13.3%. This highlights the effectiveness of WBFC in utilizing buffer
resources.
6.5 Applications and Extensions
The proposed WBFC is a very important and useful flow control mechanism for
wormhole switching. The theoretical support and implementation approach proposed can
be applied to other designs.
Figure 6.13: Impact of buffer size.
0
20
40
60
80
100
120
0 0.1 0.2 0.3 0.4 0.5 0.6
Average latency (cycles)
Injection rate (flits/node/cycle)
DL‐3VC‐1F
WBFC‐3VC‐1F
DL‐3VC‐3F
WBFC‐3VC‐3F
DL‐3VC‐5F
WBFC‐3VC‐5F
158
WBFC can be generally applied to any k-ary n-cube network. Furthermore, Theorem
1 does not limit the unidirectional ring to be only the ring in a torus. In fact, any worm-
hole-switched topology that contains a ring(s) can use the theoretical support and imple-
mentation of WBFC to avoid deadlock within the ring(s). For example, the Rotary Router
[1] is a novel yet very efficient router architecture. However, its current instantiation is
limited to only VCT switching as it relies on BFC to avoid the deadlock in its internal
ring. With WBFC proposed in this work, the Rotary Router can be implemented using
wormhole switching as well, which could greatly increase the buffer utilization. Similarly,
hierarchical rings [102], augmented ring networks [7] and other ring-based topologies can
utilize WBFC to avoid deadlock in the ring(s).
Thus far, in this analysis, atomic buffer allocation and buffer sizes being smaller than
packet sizes for wormhole networks have been assumed, which is the most common but
most difficult case. However, the WBFC developed for this case is also a general ap-
proach that can be extended to handle other more relaxed wormhole switching assump-
tions. Depending on whether the buffer allocation is atomic and whether the buffer size is
smaller than packet size, there are four cases in total:
(a) Atomic buffer allocation with buffer size smaller than packet size. In this case,
WBFC can be used directly.
(b) Atomic buffer allocation with buffer size larger than packet size. This case should
not happen in practice, as it is a waste of resources to make the buffer larger than a packet
when at most one packet is allowed in the buffer.
159
(c) Non-atomic buffer allocation with buffer size larger than packet size. This case re-
laxes all the more restrictive assumptions of wormhole switching. A direct extension of
CBS mentioned in Section 6.1.1 can be used, in which a wormhole flit-sized buffer is
marked as the critical bubble (instead of a packet-sized buffer in the original CBS), and
everything else follows similarly as with CBS.
(d) Non-atomic buffer allocation with buffer size smaller than packet size. In this case,
packets can enter into the downstream channel as long as there is a wormhole free flit-
sized buffer, as opposed to a free VC-sized buffer in atomic buffer allocation. Therefore,
we can re-define WB to be a flit-sized buffer and everything else is the same as before.
Note that with this definition, M
p
equals L(p). Basically, injecting packets need to reserve
M
p
white WBs (i.e., M
p
free flit-sized buffers) in the ring before being injected, and then
progressively free the reserved buffers when traversing along the ring. Gray WB also
works in the same way to avoid starvation. Any packets that have partially completed the
reservation process can grab the gray WB and be injected instantly. Essentially, all the
rules of WBFC operate at the wormhole flit-size level instead of at the VC-size level, and
the procedures remain the same.
6.6 Summary
Resource-constrained on-chip networks demand deadlock-free flow control that im-
poses minimal VC requirements and, at the same time, supports wormhole switching.
The Dateline technique requires at least two VCs per link to avoid the deadlock caused
by torus wraparound links while BFC reduces that to one VC but can only work for VCT
160
switching. WBFC can avoid deadlock in wormhole-switched torus networks with only
one VC while preserving wormhole’s property of allowing minimally 1-flit-sized buffer
per VC. By using worm-bubbles with an appropriate coloring scheme, WBFC overcomes
the small-buffer problem and the starvation problem that have previously hindered the
extension of BFC from VCT to wormhole. In addition to the theoretical contribution of
WBFC, simulation results show that, compared to optimized Dateline, WBFC avoids
deadlock with lower minimal resource requirements, and achieves higher throughput
when configured with the same amount of buffer resources. Moreover, when configured
minimally, the correctly operational and deadlock-free WBFC can reap significant sav-
ings in router area and power, and allow more resources to be powered off if used in
combination with further power optimizations such as fine-grained power-gating schemes.
Thus, WBFC provides a wider spectrum of cost-performance trade-offs, enabling practi-
cal usefulness.
161
Chapter 7
Improving Resource Utilization by Interference
Reduction
In addition to reducing the physical resource requirements, NoC resource efficiency
can also be improved by increasing resource utilization. This can lead to higher perfor-
mance when given the same number of resources, or achieve similar performance but
demanding fewer resources and lower power, particularly if the unneeded resources are
powered-off. This chapter discusses the opportunities in improving NoC resource utiliza-
tion in many-core systems. These systems are capable of hosting multiple concurrently
running applications, and the traffic characteristics of NoCs may exhibit new regional
behaviors. By recognizing and exploiting these traffic behaviors, the effectiveness of
NoC interference reduction and the NoC resource utilization can be greatly improved.
However, few works have investigated these regional behaviors and their potential im-
pact on interference, leaving the opportunity largely unexplored. In this chapter, regional
behaviors in NoC are identified and characterized. A region-aware interference reduction
technique, RAIR [21], is presented that not only removes any restrictions on the inter-
region traffic patterns but also captures and exploits regional behavior throughout the
design, thus improving the effectiveness of interference reduction and resource utiliza-
tion.
162
7.1 Regionalized NoC and Its Opportunities
7.1.1 Formation of Regionalized NoC (RNoC)
A Regionalized Network-on-Chip (RNoC) refers to an on-chip network in which traf-
fic exhibits clustered regional patterns, as if the network is divided into multiple regions.
A conventional NoC can be considered as a special case of RNoC in which the number of
regions equals one. To better illustrate the concept of RNoC, three examples are de-
scribed below, each of which represents a class of techniques having the similar goal of
leveraging non-uniformity in many-core chips.
Example 1: Application-to-core mapping
The first class of techniques that leads to RNoC is application-to-core mapping. This
class of optimizations is based on the observation that, in MPSoCs or CMPs with multi-
ple concurrently running applications, each application may in turn spawn a few parallel
threads working simultaneously on multiple cores. With non-uniform core-to-core dis-
tance in large many-core chips, significant performance improvement can be achieved
when frequently communicating threads belonging to the same application can be placed
closer by mapping them to cores
4
with fewer hop distance. For example, by using map-
ping policies based on the above intuition instead of randomly assigning threads to cores,
over 53% bandwidth savings and 23% delay reduction are achieved in [94] and [117],
respectively. This is mainly because most of the previously chip-wide long-distance traf-
fic is now converted to short-distance communications, thereby reducing both traffic vol-
4
The terms core and node are used interchangeably as a core in a chip is abstracted to a node in the corresponding NoC.
163
ume and latency. A direct result of these proximity-based optimizations is the formation
of regionalized NoC as, essentially, the multiple collaborating threads of an application
are clustered into physically-close groups through the mapping process, and the majority
of traffic occurs within each application’s region with a small fraction of traffic travers-
ing among regions.
Example 2: Cooperative cache structure
At the system level, there are a number of recent proposals that initially target critical
problems in system components other than the NoC but in fact, indirectly precipitate the
occurrence of RNoC. For instance, several techniques have been proposed to optimize
cache structure for many-core chips [52, 58, 69]. The idea of these approaches is that,
instead of uniformly distributing data in cache banks across the chip, active data that are
needed by an application are adaptively moved closer to the running threads of that appli-
cation. In this way, cache access time can be greatly reduced. In [52], for example, L2
cache access latency is reduced by 33%~54%, most of which comes from the reduced
traffic hops (e.g., request and reply messages). Notice that, as more cached data can be
accessed in the local or nearby nodes, a considerable amount of chip-wide traffic is trans-
formed into regional traffic, essentially resulting in an RNoC configuration.
Example 3: Coherence protocol optimization
Another example is coherence protocol optimization for on-chip server consolidation,
where multiple virtual machines (VMs) run concurrently in a CMP, with each VM having
a designated region. In [86], a two-level coherence hierarchy is proposed to accelerate
coherence transactions based on dynamic home nodes. The basic idea is to select the
164
home node of cache lines wisely so that the amount of protocol messages that need to
traverse outside their designated region are minimized. As a result, the cycle-per-
transaction is reduced by 15%~65% depending on the applications, indicating that a large
percentage of the protocol transactions have been satisfied within a region. Therefore,
although the intention is to optimize coherence protocols, a side-effect is an increase in
intra-region traffic and a decrease in inter-region traffic, thus indirectly changing the traf-
fic pattern toward RNoC.
7.1.2 Regional Behaviors of RNoC
In the above examples, while the degree of “regionality” may differ, there are some
common influences and behaviors affecting regionality that can be identified and summa-
rized by the following regional behaviors (RBs):
RB-1: Multiple applications run concurrently in a many-core chip, each consisting of
one or more nodes;
RB-2: Nodes belonging to the same application are often clustered into a region;
RB-3: Majority of traffic becomes intra-region with a small fraction being inter-
region, and
RB-4: Different regions may have heterogeneous traffic characteristics (e.g., different
intensity)
These regional behaviors raise new challenges and opportunities for traffic interfer-
ence reduction in on-chip networks. For instance, not only can an application running in
one region benefit from interference reduction, but it may also require interference reduc-
165
tion from applications in other regions. This is particularly true in the above server con-
solidation example where if one VM goes awry or is under malicious attack, the remain-
ing VMs should be minimally affected. However, interference reduction in RNoCs is
much harder to achieve than before as we can no longer strictly confine packets to flow
within a region; otherwise, an application would be unable to access certain shared re-
sources, such as memory controllers located outside its region, to perform normal opera-
tions. In addition, by taking into account the regional behaviors exhibited by RNoCs, it is
possible to achieve more effective interference reduction and, thus, higher performance
improvement for multiple concurrently running applications.
7.1.3 Intra-region Traffic vs. Inter-region Traffic
To facilitate the following discussion, a few terms are defined that are helpful in illus-
trating the traffic properties in RNoC. The terms regional traffic and global traffic refer
to the intra-region traffic portion and inter-region traffic portion of an application, respec-
tively. Using the previous examples, global traffic can be the traffic to and from memory
controllers located outside the region, the requests and replies of cooperative cache data
in other regions due to misses in the local region, and the inter-VM data sharing in server
consolidation. In addition, for application A mapped to region R of the RNoC, native traf-
fic refers to the traffic in region R that belongs to application A, and foreign traffic refers
to the traffic currently traversing region R but belonging to applications not mapped to
that region. For example, global traffic of application B (not mapped to region R) is for-
eign traffic of region R when traversing R.
166
In many-core chips, global traffic is usually more critical to performance than region-
al traffic. Figure 7.1 depicts a simple example of instruction execution in a core during a
program’s execution. Suppose the two LOAD instructions on the left miss in the local
cache and, therefore, packets P1 and P2 are sent to request data on other nodes. ADD
cannot be computed until both requested data are returned. Assume P1 is regional traffic
(i.e., the request can be satisfied within the region where the core belongs). Now consider
the request in P2. As shown in the figure, if P2 is also regional traffic, its latency should
be similar to that of P1, and the reply of P2 is back only slightly after the reply of P1. In
other words, the latency of P2 can be largely overlapped with the latency of P1 (in a bet-
ter case, the reply of P2 may be back sooner than that of P1, and the latency is complete-
ly overlapped). However, if P2’ is global traffic (i.e., the request can only be satisfied by
the node in some other region), then the reply of P2’ comes back much later than that for
P1. As a result, a large portion of the latency of P2’ cannot be overlapped, which incurs
additional stall cycles directly on the critical path of the program’s execution. Therefore,
compared with regional traffic, global traffic is more likely to have higher criticality. An-
other factor that affects the criticality of regional and global traffic is load intensity. As
observed in [28], low intensity traffic is usually more critical than high intensity traffic
(details will be discussed in later sections). According to RB-3, global traffic in the
Figure 7.1: Example of performance criticality for regional and global traffic.
167
RNoC is likely to have lower load than regional traffic, thus making global traffic even
more critical.
7.2 Problems with Existing Related Techniques
A few techniques have been proposed to tackle interference reduction in on-chip net-
works, but they all have either no or very limited region-awareness, as discussed below.
7.2.1 Region-oblivious Techniques
Region-oblivious techniques are those that do not distinguish the characteristics of
different regions in the on-chip network, thus are inherently unable to exploit regional
behavior to achieve effective interference reduction. The early proposals in this category,
such as round-robin and oldest-first, are both application- and region-oblivious. In these
techniques, due to application-unawareness, packets that are critical to an application are
treated equally with packets that are not critical to another application such that more
performance-critical packets are not accelerated over less critical ones. When applied in
RNoCs, these techniques are even more ineffective as they also treat the regional traffic
equally with global traffic, which is usually not the case considering their differences in
packet latency and traffic intensity.
Recent region-oblivious techniques incorporate application-awareness, which avoids
the disadvantages of the application-oblivious ones by taking application characteristics
into consideration. They are, however, still subject to the limitations caused by their re-
168
gion-oblivious nature. For example, STC [28] is an application-aware interference reduc-
tion technique for conventional NoCs, based on ranking the relative importance of appli-
cations. Among concurrently running applications, STC prioritizes network non-intensive
applications (in terms of L1 misses per instruction) over network intensive applications,
and for packets belonging to the same application, a round-robin technique is used. The
rationale behind STC is the following: 1) low intensity applications issue requests rela-
tively infrequently, so prioritizing these requests allows the applications to make faster
progress without burdening the network much; 2) low intensity applications are likely to
have low MLP and, hence, their requests are likely to be stall-time critical.
While STC has been shown to be effective for conventional NoCs, it does not consid-
er the regional layout (RB-1 and RB-2 regional behaviors) of RNoC and the resulting
regional and global traffic classification (RB-3). Consequently, if used in RNoCs, the
prioritization of STC is suboptimal both within and among applications. First, within an
application, regional traffic and global traffic are always treated equally in STC when
they should not be because of the different criticalities of each traffic type as mentioned
before. Second, among applications, as Figure 7.2 shows, STC prioritizes both regional
Left Right
Figure 7.2: Prioritization on per-application basis is possible in STC (left) but
not on per-region basis (right).
Priority
High
Low
A.regional, A.global
B.regional, B.global
A.global, B.global
A.regional, B.regional
169
and global traffic of network non-intensive application A over network intensive applica-
tion B (left scenario). However, because of the region unawareness, STC cannot recog-
nize and achieve prioritizations that differentiate regional and global traffic, such as the
one shown in the right scenario. Also, to avoid starvation, packets in STC are divided into
batches (e.g., based on time) and older batches always have higher priority than younger
batches. In a regionalized NoC, for example, if a VM assigned to a region becomes faulty
and injects a considerable amount of packets, batching may unnecessarily prioritize those
packets over other normal but younger packets.
In summary, without further work to incorporate region-awareness, the above tech-
niques proposed for conventional NoCs would have limited effectiveness when applied to
regionalized NoCs.
7.2.2 Region-aware Techniques
Region-aware techniques aim to reduce interference in regionalized NoCs by taking
into account regional behaviors in the first place. Techniques in this category can be fur-
ther classified based on whether they place restrictions on traffic patterns or not. Restrict-
ed interference reduction techniques assume that certain traffic patterns are disallowed in
the network in order to minimize interference. For example, global traffic may not be
allowed or can only be entered into certain areas of the NoC. In contrast, non-restricted
interference reduction techniques do not place any restrictions on traffic patterns to re-
duce interference, hence have broader application but are also much harder to realize.
170
Logic-Based Distributed Routing (LBDR) [38, 113] reduces region-aware interfer-
ence by confining packets to traverse only within a designated region via routing re-
strictions. Since no global traffic outside a region is allowed, LBDR is a restricted tech-
nique. As each application must also access the memory controllers (MCs) for on-chip
cache misses, LBDR requires each region to contain at least one MC. For example, the
mapping in Figure 7.3(a) is a viable configuration but the mapping in Figure 7.3(b) is
invalid as the two middle regions cannot access any MC. In fact, with 16 cores, 4 MCs
and 4 applications (each having 4 threads), only
4!
14%
of all possible configurations is allowed, which greatly restricts the opportunity to find
the optimal application-to-core mapping. Moreover, the number of regions that can be
accommodated on the chip is at most the number of MCs. As a reference, Intel’s recent
48-core SCC chip [50] has 4 MCs, supporting only a maximum of 4 regions if LBDR is
used. Therefore, the restrictions placed on the global traffic in LBDR result in severe
limitations.
(a) (b)
Figure 7.3: (a) Valid mapping and (b) Invalid mapping in LBDR.
171
In [46], Kilo-NoC is proposed as a low overhead quality-of-service (QoS) scheme.
Although it is designed for on-chip service guarantees, it can be extended for the purpose
of interference reduction. Kilo-NoC proposes an elaborate design by taking advantage of
certain topologies. QoS rules are imposed only on a few selected routers in the so-called
shared regions (SRs). Nevertheless, the global traffic in Kilo-NoC is restricted. For in-
stance, the global traffic may need to use a single-hop long-distance connection to “by-
pass” the intermediate region and reach the SR. In addition, global traffic may need to
detour through the SR in order to enforce QoS. More importantly, Kilo-NoC greatly de-
pends on MECS-like topologies that can provide rich connectivity. These restrictions that
Kilo-NoC places on the flow of global traffic and on the underlying network topologies
limit its usefulness to only particular RNoC scenarios.
A non-restricted technique, DBAR, is recently proposed in [80] by Ma et al. As
pointed out by the authors, DBAR not only is a mechanism for load-balanced routing but
also serves as a technique to reduce inter-region interference. The novelty is that, in the
selection function, DBAR discards the redundant information generated by other regions,
thus reducing interference among different regions. Figure 7.4 illustrates this idea. Sup-
pose region R0 has low load and regions R1-R3 have high load as given by their shading.
A packet destined to X is currently in router S, so it needs to evaluate the congestion sta-
tus of the east and south directions. Different from a prior congestion-aware technique
[42] that uses congestion information along the entire path from S-to-C and S-to-E,
DBAR only uses the congestion information along the path from S-to-A and S-to-D. In
this way, the high-load status of regions R1 and R2 will not interfere with packets in re-
172
gion R0. Using the previously defined terms, DBAR successfully avoids interference
between native traffic of different regions (e.g., no interference between the native traffic
in R0 and the native traffic in R1).
However, DBAR cannot reduce interference between the native and foreign traffic of
a given region. Consider a packet that is sourced from S, destined to Y and currently in B
(i.e., it becomes foreign traffic with respect to R1). Now, even with DBAR, regardless of
which direction the packet will take next, it will inevitably be slowed down by the heavy
load condition in R1. Therefore, although inter-region traffic is not strictly disallowed
(i.e., unlike LBDR), DBAR only works best when all packets are intra-regional. As inter-
region traffic must still be supported in RNoCs, a more effective way would be to recog-
nize all the four regional behaviors involving both intra-region and inter-region traffic
and reduce their interference accordingly.
In summary, on the one hand, priority-based region-oblivious techniques do not place
any restrictions on traffic patterns, but their inherent unawareness of regional behaviors
greatly limits their usefulness in RNoCs. On the other hand, current region-aware tech-
Figure 7.4: DBAR becomes ineffective when packets traverse outside the originating
region (more heavily loaded regions have darker shade).
R0 R1
R2 R3
S
X
A C B
D
E
Y
173
niques are built based on the regional behaviors of RNoCs, but they either place strict
restrictions on traffic patterns, or reduce only part of the possible interference in RNoCs,
which limit their effectiveness in generic RNoCs. A more effective technique is proposed
in the next section.
7.3 Region-aware Interference Reduction
7.3.1 The Basic Idea
We propose RAIR (region-aware interference reduction) that captures the regional
behaviors of RNoC to minimize interference for generic RNoCs without any restrictions
on traffic patterns. To achieve this, we propose three mechanisms that work cooperatively
to meet existing challenges. Our first mechanism VC regionalization answers the ques-
tion of how inter-region traffic can traverse freely across the chip while still being treated
differently from intra-region traffic. The second mechanism Multi-stage prioritization
solves the problem of how to efficiently and effectively reduce traffic interference in dif-
ferent stages of the pipelined router microarchitecture, and the third mechanism Dynamic
priority adaptation addresses the issue of how to recognize and utilize load heterogeneity
among regions and provide starvation avoidance. All three mechanisms take advantage of
the regional behaviors exhibited in RNoCs, thus improving the effectiveness of interfer-
ence reduction.
174
7.3.2 Removing Restrictions
To allow inter-region traffic to use any physical channel freely in the chip but still be
differentiated from intra-region traffic of other applications, we need some way of sepa-
rating the shared physical resources to reduce interference. In the first mechanism, VC
regionalization, virtual channels (VCs) associated with a physical channel are classified
into regional VCs and global VCs. Both classes of VCs can be used by any of the native
or foreign traffic, but different prioritization is imposed in each class so that native traffic
and foreign traffic are treated differently. Specifically, a 1-bit field is tagged to each VC
to indicate the classification. Figure 7.5 shows one possible example where VC0 and
VC1 are global VCs and VC2 and VC3 are regional VCs. The prioritization policy is that,
in the global VCs, foreign traffic always has higher priority than native traffic to reflect
that foreign traffic is typically more performance critical because of its global nature. In
the regional VCs, however, the priority between native and foreign traffic is configured
dynamically to reflect the intensity difference between native and foreign traffics at
runtime, in addition to the latency factor. This dynamic priority configuration is set by
DPA logic (dynamic priority adaptation) shown in Figure 7.5, which makes the decision
based on the relative criticality among different traffic types (details will be discussed in
Section 7.3.4).
Overall, by classifying virtual channels and adopting dynamic prioritization, VC re-
gionalization can achieve the following advantages. First, inter-region traffic can freely
traverse any physical channels allowed by the routing algorithm. Meanwhile, when nec-
essary, it can be minimally affected by the intra-region traffic by traversing global VCs
175
(always have higher priority). Second, VC classification is realized using priority instead
of strict partitioning, which allows each type of traffic to access any virtual channels
(though with different priority). Hence, no VC resource is wasted even when one type of
traffic is absent. Third, each region only needs to maintain information for two flows,
namely native and foreign traffic. If the foreign traffic consists of global traffic from mul-
tiple applications, round-robin is used within the foreign traffic based on two insights: 1)
if multiple contending flows have light loads, priority-based policies can only reap mar-
ginal benefit as the contention is minimal in the first place; 2) global traffic indeed has
low load according to RB-3, which results from the initial intention of RNoC to minimize
chip-wide communication. Therefore, RAIR targets at reducing the primary contention
between native and foreign traffic, and use simple fair arbitration within the foreign traf-
fic to reduce complexity.
Figure 7.5: RAIR router microarchitecture. Light-shared blocks are added; dark-
shared blocks are modified.
····
Input n
Regional VC
VC 3
VC 2
VC 1
VC 0
Global VC
Output n
Output 0
OVC_n
OVC f
DPA
VC Allocator Switch Allocator
Routing Computation App#
····
176
7.3.3 Enforcing Prioritization
While VC regionalization fulfills the objective of removing restrictions on traffic pat-
terns by making virtual channels “aware” of the existence of regions, it accomplishes
only part of the objective of reducing interference as VCs are not the only resource for
which traffic flows are sharing and competing. There are multiple arbiters within a router
to allocate different shared resources to consumers. In order to reduce interference effec-
tively among traffic flows, a second mechanism, multi-stage prioritization (MSP) is pro-
posed, which not only enforces prioritization in multiple arbitration steps in the router,
but also takes into account the characteristics of each step and the regional behaviors in
optimizing the prioritization. In a NoC composed of canonical routers, each hop consists
of routing computation (RC), VC allocation (VA), switch allocation (SA), switch tra-
versal (ST) and link traversal (LT). There are four major arbitration steps in these stages,
and the design choices for each step are explained below.
VA input arbitration (VA_in). In general, the routing function may return multiple
valid output VCs for a given input VC. For example, in Figure 7.6(a), input VC0 is al-
lowed by the routing function to request output VC0, 2 and 4. The function of VA_in is
to return one of these requests for each input VC. For example, for input VC0, the request
to output VC0 is granted (solid arrow) and the other two are denied (dotted arrows). Note
that each input VC performs arbitration independently and is free to request any desired
output VC without contention from other input VCs. Therefore, in the proposed MSP, no
change is made to VA_in as different traffic flows do not content with each other yet at
this arbitration step. Hence, MSP incurs no additional performance loss for VA_in.
177
VA output arbitration (VA_out). After VA_in, an output VC may receive requests
from multiple input VCs. For example, in Figure 7.6(a), output VC0 may be requested by
input VC0, 3 and 4. The function of VA_out is to arbitrate among these requests and
grant at most one winner. This is the arbitration step where we implement the priority
policy of VC regionalization. As discussed in the previous subsection, the output VCs
tagged as global VCs are allocated with higher priority to the input VC that contains
packets of foreign traffic, whereas the output VCs tagged as regional VCs are allocated to
the input VCs with priority determined by the DPA logic. These two classes of output
VCs with differentiated priorities implicitly act as two non-strictly separated resources
for native traffic and foreign traffic. Such separation prior to the SA stage greatly reduces
the chances of priority inversion and increase the effectiveness of interference reduction.
Also, in this step of VA_out, MSP still maintains high VC utilization as the prioritization
will not leave an output VC idle if it is requested by any input VC.
(a) (b)
Figure 7.6: Arbitration steps in (a) VA; (b) SA stage.
VC 0
VC 1
VC 2
VC 3
0
1
2
3
VC 4
VC 5
VC 6
VC 7
4
5
6
7
·
VC x
VC x
VC x
VC v
x
x
x
v
VA_in VA_out
·
VC 0
VC 1
VC 2
VC 3
VC 4
VC 5
VC 6
VC 7
·
VC x
VC x
VC x
VC v
SA_in SA_out
Output 0
Output 1
Output n
0
1
n
178
SA input (SA_in) and SA output (SA_out) arbitration. Now that each requesting
input VC has been allocated a distinctive output VC, the SA stage will set up the crossbar.
Each input port has multiple input VCs, so the function of SA_in is to select one request-
ing input VC within an input port. For example, in Figure 7.6(b), input port 0 selects in-
put VC0 among inputs VC0~3. After SA_in, as multiple input ports may request the
same output port, SA_out is used to choose one winner from these requests. For example,
input port 1 is granted to traverse the crossbar to output port 0. During both SA_in and
SA_out arbitrations, MSP chooses either the native or foreign traffic to have higher prior-
ity as determined by DPA logic to enforce prioritization. Again, prioritization in this step
of MSP neither degrades crossbar utilization nor wastes any bandwidth compared with a
round-robin policy, as no resource is left idle if it is requested by any input VC. Note that
the same priority produced by DPA logic is used for VA_out, SA_in and SA_out at a
given time, so the priority is consistent among the different stages.
7.3.4 Utilizing Load Heterogeneity
We now discuss why it is indispensable to have the DPA logic to dynamically deter-
mine priority between native and foreign traffic, and how the priority can be configured
appropriately. On-chip networks typically exhibit load heterogeneity as long as more than
one application is running simultaneously. Recall that, in conventional NoCs, STC utiliz-
es load heterogeneity by prioritizing network non-intensive applications over network
intensive applications. In regionalized NoCs, while a similar relationship exists between
network intensity and criticality, additional care should be paid to address the new load
179
heterogeneity across regions (RB-4) to avoid any potential abnormalities and perfor-
mance degradation. To achieve this, a third mechanism, dynamic priority adaptation
(DPA) prioritizes native and foreign traffic according to their relative criticality. To fa-
cilitate illustration, we analyze the prioritization of DPA from the viewpoint of any cho-
sen region R with application A assigned to it. Assume currently R also has inter-region
traffic of another application B (i.e., B is assigned to another region). As defined, traffic
belonging to A is the native traffic to R and traffic belonging to B is the foreign traffic to
R. Since the relative performance criticality depends on the nature of the traffic (i.e., re-
gional or global) and on the traffic intensity, there are three different cases:
(1) A and B have similar overall network intensity (i.e., both low load or both high
load). Since the foreign traffic is only a small portion of the overall traffic of B, the load
of foreign traffic in R would be lower than the load of native traffic. Therefore, to benefit
most from the prioritization, DPA gives higher priority to the foreign traffic as it has low-
er intensity (indicating higher criticality) and is global traffic (also indicating higher criti-
cality).
(2) A is more network intensive than B (i.e., A has high load whereas B has low load).
Considering that the foreign traffic is a small portion of the overall traffic of B, the load
of foreign traffic would be much lower than the load of native traffic. Thus, DPA gives
higher priority to the foreign traffic for the similar reasons as in (1).
(3) A is less network intensive than B (i.e., A has low load whereas B has high load).
In this case, it is subtle to determine the relative criticality between native traffic and for-
eign traffic. On the one hand, the low intensity of A would signify to give higher priority
180
to the native traffic, similar to the motivation of STC. On the other hand, the global na-
ture of the foreign traffic would indicate to give higher priority to foreign traffic. In order
to balance between these two factors and at the same time utilize the criticality character-
istics of global traffic, DPA gives higher priority to foreign traffic by default, but will
reverse the priority as soon as the intensity of native traffic exceeds that of foreign traffic.
Combining (1)~(3), for a given region with DPA, native traffic is given higher priori-
ty only when the relative intensity of foreign traffic is larger than that of native traffic;
otherwise, foreign traffic is given higher priority. As foreign traffic is the minority com-
ponent of total traffic according to RB-3, foreign traffic will have a larger chance of be-
ing higher priority than native traffic, which is consistent with the observation that global
traffic is usually more critical than regional traffic. To estimate the relative intensity, pri-
or work [11] shows that the number of occupied VCs in an input port is a strong indicator
of the load status. Therefore, DPA uses similar VC information to assess intensity, but
with two additional techniques to tolerate variance. First, the status of all VCs in a router
is accounted for in counting the number of occupied VCs for native (OVC_n) and foreign
traffic (OVC_f), instead of only the input port in which the requesting packet resides. This
mitigates the inaccuracy caused by the non-uniform VC status among different ports.
Second, hysteresis is used for priority transition. As shown in Figure 7.7, the priority of
native traffic does not transition from low to high immediately after the ratio r of OVC_f
over OVC_n is greater than 1; instead, there is a hysteresis process in which the priority
only transitions after the ratio is greater than (1+ ∆). Similarly, the priority transits from
high to low only after the ratio becomes smaller than (1- ∆). As can observed from simu-
181
lation that, values of ∆ between 0.1~0.3 typically render better performance with the best
case achieved at around 0.2, which is assumed for ∆ in our evaluation. This hysteresis
transition helps to tolerate the temporal variation of VC utilization in a router.
7.3.5 Avoiding Starvation and Deadlock
The above implementation of DPA already avoids the starvation induced by prioriti-
zation. This is because the relative priority and the ratio consist of a negative feedback
loop. For example, if native traffic occupies too many resources as indicated by a very
low ratio, it will be switched into low priority. A similar negative feedback loop exists for
foreign traffic as well. Thus, no starvation can occur in DPA due to this self-throttling
attribute.
Regarding deadlock, unlike [38, 46, 80], none of the three proposed mechanisms con-
stituting RAIR places any restrictions on traffic patterns or routing, so virtually any dead-
lock avoidance or recovery routing algorithms can be incorporated in RAIR to achieve
load balance. We evaluate the effectiveness of RAIR with two different adaptive routing
algorithms. In case of using deadlock-avoidance routing algorithms based on Duato’s
theory [33], if a coherence protocol has multiple message classes such as MOESI, each
Figure 7.7: Hysteresis priority transition for native traffic.
priority
high
low
1- ∆ 1 1+ ∆
r
182
message class is provided with additional one set of escape VCs. However, all message
classes can share the same set of regional VCs and global VCs.
7.3.6 Putting It All Together and Router Microarchitecture
The three mechanisms with any deadlock-free routing algorithm compose the pro-
posed RAIR technique to reduce interference for RNoCs. Figure 7.5 shows the modifica-
tions needed to implement RAIR. Each router is tagged with the application number that
is assigned to it, and each packet carries the application number to which it belongs.
When traversing a router, a packet is identified as either native traffic if the above two
application numbers match or foreign traffic if not. The DPA logic keeps track of OVC_n
and OVC_f in two registers and determines the relative priority between native and for-
eign traffic by comparing the two register values in the hysteresis manner. The calculated
priority is used in VA_in, SA_in and SA_out arbitration steps. Packets then go through
these steps similar to the pipeline of canonical routers but using the region-aware prioriti-
zation rules of VC regionalization and MSP. As RAIR introduces additional control de-
pendence between DPA and VA/SA, to remove the delay of DPA logic from the critical
path, we use the priority calculated from the previous cycle. This is based on the fact that
the intensity difference between two consecutive cycles is insignificant and can largely be
filtered by the hysteresis transition. Overall, RAIR does not impose any particular re-
striction on the traffic patterns and improves the effectiveness of interference reduction in
RNoCs by recognizing and utilizing the regional behaviors throughout the prioritization
process.
183
7.4 Evaluation
7.4.1 Simulation Methodology
A cycle-accurate interconnection network simulator, GARNET [6], is used to model
the microarchitecture and router pipeline discussed in Section 7.3. A 64-node mesh net-
work configured with 2, 4 and 6 regions is evaluated. To provide fair comparison, all
schemes under evaluation are augmented with adaptive routing algorithms based on Dua-
to’s theory [33].
Both synthetic traffic patterns and application traces from multi-threaded PARSEC
benchmarks [13] are used. For synthetic traffic patterns, the simulator is warmed up for
10K cycles and then the average performance is measured over another 100K cycles.
Uniform random (UR), transpose (TP), bit complement (BC) and hotspot (HS) traffic [26]
are simulated. Packets are uniformly assigned two lengths: short packets are 16B single-
flit while long packets carrying 64B data plus a head flit have 5 flits. For application trac-
es, traffic is obtained from full-system simulation (SIMICS [83] plus GEMS [84] with
sufficient warmup) configured as shown in Table 7.1. The simulation infrastructure sup-
ports all 13 applications in PARSEC 2.0; of these, we examine results for blackscholes,
swaptions, fluidanimate and raytrace as a representative subset containing both low and
high intensity traffic.
184
In the following subsections, the various mechanisms and techniques composing
RAIR are first evaluated individually. We then combine them together as the complete
RAIR and compare with other region-oblivious and region-aware techniques.
7.4.2 Impact of Multi-stage Prioritization
The contention among multiple concurrently running applications is the combined ef-
fects of a series of interferences: the inter-region traffic of an application interferes with
the native traffic of several regions and in the meantime its own region has interference
from the global traffic of multiple other applications. In order to study the separate im-
pact of MSP on contention, we start with a simpler scenario consisting of two applica-
tions. As shown in Figure 7.9, App 0 and App 1 are running on the left half and right half
of the chip, respectively. App 0 is configured with low load uniform traffic (10% of its
saturation load) and a certain percentage p of its traffic is inter-region traffic. App 1 is
configured with high load (90% of its saturation load) but all of its traffic is intra-regional.
In this way, the only contention that can occur is between the inter-region traffic of App 0
and the intra-region traffic of App 1. We sweep the inter-region percentage p from 0% to
100% to assess the impact of MSP under varying degree of contention.
Table 7.1: Full-system simulator configuration for RAIR.
Cores 64 Sun UltraSPARC III+, 1GHz
Private I/D L1$ 32KB, 2-way, LRU, 1-cycle latency
Shared L2$/bank 256KB, 16-way, LRU, 6-cycle latency
Memory latency 128 cycles
Block size 64 Bytes
Virtual Channel 4 per protocol class, atomic, 5-flit/VC
Link bandwidth 128 bits/cycle
185
Figure 7.9: Two applications with var-
ying percentage of inter-region traffic.
App 0 App 1
Figure 7.8: Impact of multi-stage prioritization.
25
35
45
55
65
75
0% 20% 40% 60% 80% 100%
Average packet latency (cycles)
Inter‐region traffic percentage of App 0
RO_RR (App 0)
RAIR_VA (App 0)
RAIR_VA+SA (App 0)
RO_RR (App 1)
RAIR_VA (App 1)
RAIR_VA+SA (App 1)
Figure 7.8 plots the average packet latency (APL) of both applications for different
schemes. RO_RR is a region-oblivious technique based on round-robin. RAIR_VA is the
RAIR technique with the region-aware rules of MSP performed only at the VA stage,
whereas in RAIR_VA+SA, the rules of MSP are enforced at both VA and SA stages. It
can be seen that as p increases, all APLs increase due to two reasons: 1) more traffic of
App 0 becomes inter-regional, thereby increasing the average hop count and the packet
propagation delay of App 0; 2) more contention takes place between App 0 and App 1,
increasing the contention delay of both applications.
Compared with RO_RR, RAIR techniques with MSP can reduce the APL of App 0
significantly while incurring little increase in the APL of App 1. This is because MSP
prioritizes the low intensity inter-region traffic of App 0 over the high intensity intra-
region traffic of App 1, so that the contention that App 0 experiences can be greatly re-
duced while the contention that App 1 experiences is not affected much. This effect be-
comes more evident as p increases. For example, when p is 100%, RAIR_VA+SA reduc-
186
Figure 7.10: Impact of routing algorithm.
25
35
45
55
65
75
0% 20% 40% 60% 80% 100%
Average packet latency (cycles)
Inter‐region traffic percentage of App 0
RO_RR_Local (App 0)
RO_RR_DBAR (App 0)
RAIR_Local (App 0)
RAIR_DBAR (App 0)
RO_RR_Local (App 1)
RO_RR_DBAR (App 1)
RAIR_Local (App 1)
RAIR_DBAR (App 1)
es APL by 18.9% for App 0 with less than 3% increase in APL for App 1. In addition,
RAIR_VA+SA is more effective than RAIR_VA across the range of p, illustrating the
necessity of enforcing prioritization in multiple arbitration steps.
7.4.3 Impact of Routing Algorithm
The above evaluation adopts a typical adaptive routing algorithm that uses the infor-
mation available at the local router (e.g., # of free VCs). To demonstrate the ability of
RAIR being compatible with other routing algorithms, we evaluate RAIR with an en-
hanced adaptive routing algorithm, DBAR [80], that leverages both local and non-local
information to improve load balance.
Figure 7.10 presents the average packet latency of RO_RR and RAIR with local
adaptive routing and with DBAR under the same two-application scenario. As can be
seen, by using DBAR, RAIR_DBAR can reduce the APL of both App 0 and App 1 com-
187
pared to RAIR_Local because of the better load balance. Note that the APL of App 1 in
RAIR_DBAR is even lower than that of RO_RR_Local, indicating that RAIR can well
utilize advanced adaptive routing algorithms to restore its slowdown in intra-region traf-
fic of App 1. For example, when p is 100%, RAIR_DBAR avoids any latency degrada-
tion compared to RO_RR_Local and reduces APL by 24.8% and 3.3% for App 0 and
App 1, respectively. Figure 7.10 also plots the APL of using DBAR alone on top of
RO_RR. Compared with RO_RR_DBAR, RAIR_DBAR improves the APL of App 0 by
12.8% with only 1.8% degradation in the APL of App 1. This illustrates that, while an
adaptive routing algorithm can reap additional benefits from better route selection, the
largest performance improvement comes from the contention reduction offered by the
RAIR technique.
7.4.4 Impact of Dynamic Priority Adaptation
To validate the need for DPA and examine its effectiveness in utilizing load hetero-
geneity among regions, we consider two contrasting scenarios. As depicted in Figure
(a) (b)
Figure 7.11: Two contrasting scenarios to evaluate DPA.
App 0 App 1
App 2
App 3
App 0 App 1
App 2
App 3
188
7.11(a) and (b), both scenarios consist of four applications, in which App 0 ~ App 2 have
low loads and App 3 has high load. In (a), 30% of the traffic of App 0 ~ App 2 are inter-
regional and towards App 3, whereas all traffic of App 3 is intra-regional. In (b), all traf-
fic of App 0 ~ App 2 are intra-regional, whereas 30% of App 3’s traffic is inter-regional
and randomly towards other applications.
Figure 7.12(a) and (b) shows the reduction of average packet latency of each applica-
tion for different schemes, corresponding to Figure 7.11(a) and (b). RAIR_NativeH (al-
ternatively, RAIR_ForeignH) denotes the RAIR technique without DPA setting native
(b)
Figure 7.12: Impact of dynamic priority adaptation.
0
0.2
0.4
0.6
0.8
1
1.2
App 0App 1App 2App 3Average
Reduction of average packet latency
RO_RR_Local RO_RR_DBAR RAIR_NativeH RAIR_ForeignH RAIR_DPA
0
0.2
0.4
0.6
0.8
1
1.2
App 0App 1App 2App 3Average
Reduction of average packet latency
RO_RR_Local RO_RR_DBAR RAIR_NativeH RAIR_ForeignH RAIR_DPA
(a)
189
traffic (alternatively, foreign traffic) to higher priority for all regions at all times. It can be
seen that, for scenario (a), RAIR_ForeignH has lower average packet latency than
RAIR_NativeH. This is because, by prioritizing the low intensity foreign traffic from
App 0 ~ App 2 over high intensity native traffic of App 3 in region 3, the APLs of App 0
~ App 2 can be reduced substantially with little performance degradation of App 3. How-
ever, for scenario (b), RAIR_NativeH is actually better as most benefit comes from prior-
itizing the low-intensity native traffic of App 0 ~ App 2 over high-intensity foreign traffic
from App 3. Thus, neither RAIR_NativeH nor RAIR_ForeignH performs well for both
cases, so dynamic priority adaptation is indispensable. The fifth bar of each series shows
the reduction in APL for RAIR with DPA. Overall, RAIR_DPA reduces APL by 12.8%
and 12.2% for case (a) and (b), respectively. Note that there could be a slight improve-
ment of RAIR_DPA over the better of RAIR_NativeH and RAIR_ForeignH, as
RAIR_DPA also dynamically adjusts the relative priority between native and foreign
traffic among the three low load applications (i.e., App 0 ~ App 2). However, this differ-
ence is small due to the light contention among those applications.
7.4.5 Effects of RAIR on Synthetic RNoC
In this subsection, we evaluate the proposed RAIR technique as a whole and compare
it against other schemes in a generic RNoC environment consisting of six concurrently
running applications with differentiated load rates. As shown in Figure 7.13, App 0, 2, 3
and 4 have low to medium loads (10% to 30% of their corresponding saturation loads),
and App 1 and 5 have high load (90% of the saturation loads). Each application generates
190
Figure 7.13: Six-application scenario with various global traffic patterns.
0 0 1 1 1 2 2 0
0 0 1 1 1 2 2 0
0 0 1 1 1 2 2 0
0 0 1 1 1 2 2 0
3 3 3 5 5 5 5 3
3 3 3 5 5 5 5 3
4 4 4 5 5 5 5 4
4 4 4 5 5 5 5 4
three types of synthetic traffic: 75% intra-region uniform random traffic, 20% inter-
region global traffic with various traffic patterns (explained shortly), and 5% traffic to
and from the 4 corner nodes to mimic memory controller traffic.
Four interference reduction schemes are compared. RO_RR is a region-oblivious
technique based on round-robin. RO_Rank is an optimized version of STC – a region-
oblivious but application-aware prioritization technique. This optimized STC is assumed
to be able to always find the optimal application rankings during a given interval based
on load intensity. RA_DBAR is a region-aware technique built on DBAR (we choose
DBAR, as LBDR cannot allow any global traffic and Kilo-NoC relies on MECS in addi-
tion to other restrictions, which makes DBAR the least restrictive region-aware technique
available so far). Finally, RA_RAIR is the proposed region-aware interference reduction
technique.
Figure 7.14 shows the reduction of average packet latency compared to RO_RR for
different techniques. The synthetic pattern for the global traffic component is uniform
random for the moment. On average, RA_DBAR reduces average packet latency by 3.4%.
This is mainly because, although RA_DBAR recognizes the regional layout, it reduces
191
Figure 7.14: Average packet latency comparison of different techniques under uni-
form random global traffic.
0.5
0.6
0.7
0.8
0.9
1.0
1.1
App 0App 2App 3App 4App 1App 5Average
Low & medium load apps High load apps
Reduction of average packet latency
RO_RR RA_DBAR RO_Rank RA_RAIR
interference only when packets are in their originating regions. Therefore, in this generic
RNoC setting, it cannot reduce interference effectively for global traffic that traverses
unrestrictedly. In contrast, although being region-oblivious, RO_Rank actually performs
better than RA_DBAR. It makes a trade-off by prioritizing low to medium load applica-
tions over high load applications and reduces average packet latency by 5.8%. However,
it does not distinguish the regional and global traffic across regions and is also subjected
to batching to avoid starvation, which limits the maximum achievable latency reduction.
The best performance is achieved by RA_RAIR because of its awareness of regional
behaviors. In RA_RAIR, the foreign traffic of App 1 and App 5 can be prioritized over
the native traffic of other applications when DPA determines that the global traffic has
higher relative criticality for that region. As a result, the improvement of APL for App 1
and App 5 is only 1.3% less than RA_DBAR while the improvement in APL for App 0, 2,
3, 4 is 12.4% beyond that of RA_DBAR. Compared with RO_RR, RA_RAIR reduces
APL by 10.1% when averaged over all applications.
192
0.5
0.6
0.7
0.8
0.9
1
UR BC TP HS Average
Reduction of average packet latency
RO_RR RA_DBAR RO_Rank RA_RAIR
Figure 7.15: Reduction of average packet latency under different global traffic patterns.
7.4.6 Effects on Different Traffic Patterns
To demonstrate that RAIR removes the restrictions on traffic patterns that occur in
other restricted techniques, Figure 7.15 shows the average reduction in APL for different
synthetic global traffic patterns
5
, with other configurations the same as the previous sub-
section. As can be seen, RA_RAIR achieves an average APL reduction of 13.4% over all
traffic patterns compared to RO_RR, indicating that RAIR does not place any implicit
restrictions on the global traffic and can reduce interference effectively for different traf-
fic patterns.
7.4.7 Effects on Applications
We next examine an important capability of RAIR to protect normal applications
from adversarial traffic. As depicted in Figure 7.16, four PARSEC applications are run-
ning concurrently on the many-core chip. Malicious traffic (e.g., an elaborated attack, or
5
Only the average value is shown. The relative trend of individual applications is similar to Figure 7.14.
193
Figure 7.16: PARSEC simulation setup for RAIR.
blackscholes swaptions
fluidanimate
raytrace
simply an OS bug) is modeled by adding uniform chip-wide global traffic with a load rate
of 0.4 flits/cycle/node. Figure 7.17 shows the slowdown of average packet latency that is
experienced by each application when different techniques are used.
As can be seen, RO_RR performs worst with an average slowdown of 1.92 relative to
the no adversarial traffic case. RA_DBAR reduces the slowdown to 1.75 through limited
region -awareness. For RO_Rank, it is assumed that this STC-based RO_Rank can opti-
mally rank applications, so all packets from the adversarial traffic have the lowest priority.
However, as all packets are still subject to batching and RO_Rank does not allow the
global traffic of normal low ranking applications to be prioritized over the regional traffic
of high ranking applications, RO_Rank only reduces the slowdown to 1.47, on average.
Finally, RA_RAIR can identify the adversarial traffic as foreign traffic to every region
and assign it a lower priority than the native traffic through dynamic priority adaptation,
thereby accelerating packets of the native traffic. The average slowdown is reduced to
1.18 for RA_RAIR, which is 38%, 32% and 19% better than RO_RR, RA_DBAR and
RO_Rank, respectively.
194
Figure 7.17: Average packet latency slowdown on PARSEC workloads
with adversarial traffic.
0
0.5
1
1.5
2
2.5
blackscholes swaptions fluidanimate raytrace Average
Slowdown of average packet latency
RO_RR RA_DBAR RO_Rank RA_RAIR
7.4.8 Discussion
Number of Regional and Global VCs
The impact of the relative quantity between regional and global VCs mainly depends
on the traffic patterns. When there are more regional VCs than global VCs, native traffic
will have a larger chance of getting high priority so that if foreign traffic has lower load,
it cannot be effectively accelerated over native traffic. Similarly, when there are more
global VCs than regional VCs, in the case of foreign traffic having much higher load, the
native traffic needs to wait for a long delay before successfully acquiring high priority.
Therefore, to support generic traffic patterns and simplify the implementation in practice,
the number of regional VCs and global VCs are assumed to be configured roughly the
same in RAIR.
Scalability and Overhead
We examine scalability in two dimensions: number of cores and number of regions.
First, the DPA logic can be implemented with a couple of registers and comparators such
that overhead is small and remains constant per router regardless of the NoC size. Also,
195
unlike STC, RAIR does not require any central control logic for batching or determining
application ranking. Therefore, the number of cores can be easily scaled up without in-
curring much overhead in RAIR. Second, each router in RAIR maintains only two-flow
status instead of per-region or per-application status, so there is no additional overhead
with increased number of regions. Therefore, the number of regions can be as much as
the number of cores on a chip. From the above two aspects, we conclude that RAIR has
good scalability.
Relation to Quality-of-Service
Interference reduction is an important component of QoS, but QoS typically requires
more than that. As the name implies, QoS provides applications with equal or differenti-
ated service guarantees in addition to interference reduction. For example, it is able to
enforce the pre-determined bandwidth allocation set by the OS, or provide end-to-end
delay guarantees. Therefore, QoS has broader objectives and is also more complicated to
implement than interference reduction alone (e.g., QoS polices typically need to record
and update per-flow information). Due to these reasons, RAIR is compared with other
interference reduction techniques here rather than QoS mechanisms (such as [45, 46]). It
is possible, however, to integrate RAIR with prior QoS mechanisms to further improve
service quality, which can be investigated in the future.
7.5 Summary
Many-core systems have enabled multiple applications to run concurrently in a chip.
In the meantime, interference among traffic from different applications arises due to the
196
shared nature of on-chip networks. To reduce interference effectively, traffic characteris-
tics exhibited in the NoCs need to be exploited. In this chapter, a case is presented for
interference reduction in regionalized NoCs, which results from a series of recent optimi-
zations that leverage non-uniformity in many-core chips. The formation of RNoC using
three representative examples is analyzed and four common regional behaviors are identi-
fied. To address the limited region awareness in existing techniques, a new region-aware
interference reduction technique (RAIR) is proposed. RAIR cleverly removes restrictions
on inter-region traffic patterns and, therefore, is applicable to generic RNoCs. More im-
portantly, RAIR can dynamically determine the relative criticality between native and
foreign traffic, and take into account the regional traffic characteristics in multi-stage
prioritization and starvation avoidance, thereby improving the effectiveness of interfer-
ence reduction. Simulation results show considerable improvement in both synthetic traf-
fic patterns and PARSEC benchmarks, as compared to other interference reduction tech-
niques.
197
Chapter 8
Conclusions and Future Research
8.1 Conclusions
This dissertation explores the opportunities, challenges and viable solutions at the ar-
chitecture-level in designing low-power and resource-efficient on-chip networks. We
attack these open problems by reducing the minimal resources required for maintaining
correct NoC operation and by dynamically powering on/off resources as needed via ef-
fective power-gating schemes and resource utilization-enhancing schemes for upholding
performance guarantees.
This research reveals that, while the power-gating technique is promising for on-chip
routers, its effectiveness can be severely limited by the intensified BET limitation, cumu-
lative wakeup delay and connectivity problem. We demonstrate that these difficulties can
be addressed effectively by using a minimal alternative router bypass path to decouple
the dependence between processing nodes and routers. This allows an unrestricted set of
routers to be power-gated “off” while maintaining network connectivity to facilitate NoC
energy-proportionality.
This research is also the first known work that exploits the topological characteristics
of indirect networks in optimizing the use of power-gating techniques. The technique of
dynamic traffic steering based on runtime network load conditions illustrates the potential
198
benefits of co-optimizing applications and NoC architecture. The proposed MP3 scheme
reduces nearly half of the router static power with only less than 1% performance penalty.
This leads us to conclude that power-gating can be a powerful trade-off technique when
backed up by appropriate architecture support.
In addition, it is not only useful to reduce power in performance-aware manner but al-
so important to reduce resource requirements. However, we find that the primary difficul-
ty in minimizing NoC resource requirement is to prevent deadlock. Guaranteeing dead-
lock-freedom becomes particularly challenging when only minimal resources are allowed,
such as one virtual channel (VC) per message class, buffer size being smaller than packet
size, and atomic buffer allocation. This research proposes two schemes to solve these
difficulties based on the approach of providing global network information using only
local primitives. The first scheme uses critical bubbles to convey global network infor-
mation implicitly at each local node and requires only one packet-sized buffer per VC in
any torus network that uses virtual cut-through switching. The second scheme achieves
an even more ambitious goal of requiring only one flit-sized buffer per VC (a packet may
consist of multiple flits). Both schemes are applicable to off-chip and on-chip networks to
ensure freedom from routing-induced and protocol-induced deadlock. Moreover, any
VCT- or wormhole-switched topology that contains a ring(s) can use the proposed theo-
retical support and implementation to avoid deadlock within the ring(s). Therefore, this
leads us to conclude that the research outcomes in this dissertation have important theo-
retical value and practical applications.
199
Finally, this research presents a case for interference reduction in regionalized NoCs.
Four new regional traffic behaviors are identified as a result of recent optimizations that
leverage non-uniformity in many-core chips. One such behavior is the transformation of a
considerable amount of chip-wide traffic into short-range traffic within physically-close
regions, leaving a small amount of traffic needing to traverse more distant regions. These
properties are not only valuable in reducing NoC traffic interference without imposing
any explicit or implicit traffic restrictions as proposed in this work, but also useful in ex-
ploring other NoC optimizations such as reducing power and improving reliability.
8.2 Future Research
Multi-core and many-core processors will continue to proliferate in the next decade
across the entire computing landscape – from embedded and mobile devices to data cen-
ters and high-performance computing system. As the backbone of these processors, on-
chip networks will demand higher and higher efficiency for power and resource as the
number of cores on chip skyrockets. While this research has investigated a couple of fac-
ets of on-chip network, the proposed groundwork opens up many opportunities that are
worthy to be explored. Below are some promising lines of research that can be pursued
based on the work in this research.
Performance Criticality of NoC Routers
The node-router decoupling work [19] touches on the notion of power- and perfor-
mance-centric routers. That is, for certain topologies and constructions of the chip-wide
200
bypass paths, some routers may have greater impact on performance than others based on
their location and function in the NoC. Therefore, further research can be conducted to
better understand and exploit the performance criticality of NoC router to optimize pow-
er-performance resource efficiency. For example, it is interesting to investigate optimal
router classification schemes with comprehensive consideration of topology, traffic pat-
terns, bypass placement, routing algorithm and heterogeneity. In addition, as the paths
that packets take to reach their destinations may change to circumvent powered-off rout-
ers, in some cases, deadlock, livelock and other detour-induced routing anomalies may
arise. These issues all need to be handled correctly and efficiently.
Proactive Power-gating of On-chip Routers
In the proposed MP3 scheme [23], the opportunity of dynamic traffic steering was in-
vestigated. In fact, this technique can be generalized to a broader approach of proactive
power-gating of routers. Most existing power-gating techniques are applied passively
such that routers are powered “on” when packets arrive and powered “off” when routers
become idle. In other words, these techniques passively determine router power states
based on prevailing network traffic. Thus, power-gating effectiveness is limited by the
inherent length of idle periods exhibited in packet flows within the network. In this pas-
sive approach, whenever the router idle time is below the breakeven time as dependent on
traffic behavior, no power savings can be reaped. Therefore, it is possible to explore more
effective ways of applying power-gating for maximal benefit by proactively shaping net-
work traffic patterns, taking into account traffic activity of processor nodes and adaptivity
201
of the NoC routing algorithm. The objective is to create useful power-gating opportuni-
ties even for scenarios in which router idle time would otherwise be below the breakeven
time. Essentially, this management of NoC resources allows under-utilized routers to be
power-gated by intentionally steering packets around them, draining in-router pack-
ets/flits through bypass paths, and power-gating the router proactively.
One way of implementing proactive power-gating is to leverage the routing function
in routers, as every packet has to go through the route computation (RC) pipeline stage
and much information about the network has already been maintained in RC logic. The
basic principle is to concentrate traffic to a few routers and links as much as possible
when the load is low and gradually spread the traffic uniformly to incrementally pow-
ered-on routers as the load passes through threshold points until saturation when all rout-
ers are powered on.
Processor-aware NoC Design
Another line of research that will further improve NoC efficiency is to augment the
NoC with power-performance monitoring information from the processor node. Contem-
porary processors often exhibit varying degrees of heterogeneity at several levels. For
instance, heterogeneous cores can co-exist on the same chip, such as the big.LITTLE
processor from ARM [43]. Also, within a group of homogeneous cores, the cores can be
configured with various frequencies and operating voltages, thus having different compu-
ting capabilities. Additionally, regardless of the hardware differences, processor cores
may run distinct applications, and even within the same application the core running the
202
critical thread may behave differently than those running non-critical threads (e.g., busy
vs. idle, or fast vs. slow). Conveying these kinds of important processor runtime status
information to the NoC may substantially increase the NoC efficiency. For example, the
processor P-state and C-state are very good indicators of how fast the instructions are
executed and are consuming data, and private L1 cache miss rates are good indicators of
how often data requests need to access the NoC through routers. These kinds of infor-
mation can be used as helpful hints to predict packet arrivals to routers more accurately
and decide whether for certain idle intervals the router should be power-gated off or not,
thereby avoiding unnecessary power-gating energy and performance overhead. Moreover,
with pertinent information from the processor side and the active/idle history of local and
neighboring routers, it is possible to apply predication to power-gate only the idle periods
that are longer than the breakeven time plus the wakeup time, i.e., routers are guaranteed
to be back in normal state before the next packet comes, thus incurring negligible per-
formance penalty. Our recent work has proposed a core-state-aware NoC power-gating
scheme based on flattened butterfly that starts to utilize the active/sleep state information
of processing cores to improve power-gating effectiveness [119].
Further Reducing NoC Resource Lower-bound
Given the importance of minimizing required resources in on-chip networks, more re-
search is needed to investigate additional approaches that consider a number of other as-
pects. One possible approach is on how to apply CBS or WBFC to a broader set of net-
works that do not necessarily contain topological ring(s), including irregular topologies
203
and high-radix networks such as flattened butterfly and folded Clos. Preliminary work
has proposed a way to avoid deadlock using one shared VC for all message classes [114].
It also works for irregular networks with slight detour of packets. However, this scheme
assumes VCT switching. Thus, further research is still needed to reduce the minimal re-
source requirements from the current one-packet-sized buffer per VC to the ideal lowest
bound of one-flit-sized buffer per VC and one shared VC in total. One possible approach
that is being investigated is based on the notion of credit caching, which applies the con-
cept of caching in NoCs by caching remote end-to-end flow control credits in the network
to convey quasi-global information. It allows the network to use one virtual channel layer
and minimal endpoint buffers to avoid all deadlocks regardless of the number of message
classes in the system.
NoC Optimizations for Parallel Executions
The computing capability of many-core processors can easily support multiple appli-
cations to run concurrently. As applications share and compete for resources on a chip, it
becomes increasingly important for computing systems to optimize the parallel execution
of multiple applications, each of which may in turn consist of multiple threads. The pro-
posed RAIR technique [21] demonstrates how NoC resource utilization and system per-
formance can be improved by taking into account the emerging application traffic behav-
iors. This exemplifies a broader research approach that optimizes the NoC design by con-
sidering the characteristics of applications and programming models. Besides RAIR,
some of recent works have started to show the promise of this approach, such as consid-
204
ering the application communication patterns in the NoC mapping process [124, 125] and
co-optimize NoC with transactional memory [122, 123]. These works highlight the sig-
nificant energy savings and performance improvement potential of co-optimizing system
components. Thus, more research along this line is much needed.
205
Bibliography
[1] P. Abad, V. Puente, P. Prieto, and J. A. Gregorio, "Rotary router: An efficient
architecture for CMP interconnection networks," in 34th Annual International
Symposium on Computer Architecture (ISCA), pp. 116-125, 2007.
[2] D. Abts and D. Weisser, "Age-based packet arbitration in large-radix k-ary n-
cubes," in ACM/IEEE Conference on Supercomputing, 2007.
[3] D. Abts and C. R. Storm, "The Cray XT4 and Seastar 3-D torus interconnect,"
Encyclopedia of Parallel Computing, 2010.
[4] Adapteva. http://www.adapteva.com/epiphanyiv/, 2013.
[5] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Giampapa, P.
Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken, M. Tsao, and P.
Vranas, "Blue Gene/L torus interconnection network," IBM Journal of Research
and Development, vol. 49, pp. 265-276, 2005.
[6] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, "GARNET: A detailed on-chip
network model inside a full-system simulator," in International Symposium on
Performance Analysis of Systems and Software (ISPASS), pp. 33-42, 2009.
[7] W. Aiello, S. N. Bhatt, F. R. K. Chung, A. L. Rosenberg, and R. K. Sitaraman,
"Augmented ring networks," IEEE Transactions on Parallel and Distributed
Systems (TPDS), vol. 12, pp. 598-609, 2001.
[8] K. V. Anjan and T. M. Pinkston, "An efficient, fully adaptive deadlock recovery
scheme: DISHA," in 22nd Annual International Symposium on Computer
Architecture (ISCA), pp. 201-10, 1995.
[9] J. Balfour and W. J. Dally, "Design tradeoffs for tiled CMP on-chip networks," in
20th Annual International Conference on Supercomputing (ICS), pp. 187-198, 2006.
[10] E. Baydal, P. Lopez, and J. Duato, "A simple and efficient mechanism to prevent
saturation in wormhole networks," in 14th International Parallel and Distributed
Processing Symposium (IPDPS), pp. 617-622, 2000.
[11] E. Baydal, P. Lopez, and J. Duato, "A family of mechanisms for congestion control
in wormhole networks," IEEE Transactions on Parallel and Distributed Systems
(TPDS), vol. 16, pp. 772-784, 2005.
206
[12] D. Becker and W. Dally, "Allocator implementations for network-on-chip routers,"
in Conference on High Performance Computing Networking, Storage and Analysis
(SC), 2009.
[13] C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC benchmark suite:
Characterization and architectural implications," in 17th International Conference
on Parallel Architectures and Compilation Techniques (PACT), pp. 72-81, 2008.
[14] C. Bienia and K. Li, "Parsec 2.0: A new benchmark suite for chip-multiprocessors,"
in Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and
Simulation, 2009.
[15] T. D. Burd and R. W. Brodersen, "Design issues for Dynamic Voltage Scaling," in
International Symposium on Low Power Electronic DesignI (ISLPED), pp. 9-14,
2000.
[16] C. Carrion, C. Izu, J. A. Gregorio, F. Vallejo, and R. Beivide, "Ghost packets: a
deadlock-free solution for k-ary n-cube networks," in Proceedings of the Sixth
Euromicro Workshop on Parallel and Distributed Processing, pp. 133-9, 21-23 Jan.
1998.
[17] D. Chen, N. A. Eisley, P. Heidelberger, R. M. Senger, Y. Sugawara, S. Kumar, V.
Salapura, D. L. Satterfield, B. Steinmacher-Burow, and J. J. Parker, "The IBM blue
gene/Q interconnection fabric," pp. 32-43, 2012.
[18] L. Chen, R. Wang, and T. M. Pinkston, "Critical Bubble Scheme: an efficient
implementation of globally-aware network flow control," in 25th IEEE
International Parallel & Distributed Processing Symposium (IPDPS), pp. 592-603,
2011.
[19] L. Chen and T. M. Pinkston, "NoRD: Node-Router Decoupling for Effective
Power-gating of On-Chip Routers," in 45th IEEE/ACM International Symposium
on Microarchitecture (MICRO), pp. 270-281, 2012.
[20] L. Chen, R. Wang, and T. M. Pinkston, "Efficient implementation of globally-
aware network flow control," Journal of Parallel and Distributed Computing
(JPDC), vol. 72, pp. 1412-1422, 2012.
[21] L. Chen, K. Hwang, and T. M. Pinkston, "RAIR: Interference Reduction in
Regionalized Networks-on-Chip," in 27th IEEE International Parallel &
Distributed Processing Symposium (IPDPS), pp. 153-164, 2013.
[22] L. Chen and T. M. Pinkston, "Worm-bubble flow control," in 19th IEEE
International Symposium on High-Performance Computer Architecture (HPCA), pp.
366-377, 2013.
207
[23] L. Chen, L. Zhao, R. Wang, and T. M. Pinkston, "MP3: Minimizing Performance
Penalty for Power-gating of Clos Network-on-Chip," in 20th IEEE International
Symposium on High-Performance Computer Architecture (HPCA), 2014.
[24] X. Chen and P. Li-Shiuan, "Leakage power modeling and optimization in
interconnection networks," in International Symposium on Low Power Electronics
and Design (ISLPED), pp. 90-5, 2003.
[25] C. Clos, "A study of non-blocking switching networks," Bell System Technical
Journal, vol. 32, pp. 406-424, 1953.
[26] W. Dally and B. Towles, Principles and Practices of Interconnection Networks:
Morgan Kaufmann Publishers Inc., 2003.
[27] W. J. Dally and B. Towles, "Route packets, not wires: On-chip interconnection
networks," in Design Automation Conference (DAC), pp. 684-689, 2001.
[28] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Application-aware prioritization
mechanisms for on-chip networks," in 42nd Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), pp. 280-291, 2009.
[29] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Aergia: Exploiting packet
latency slack in on-chip networks," in 37th International Symposium on Computer
Architecture (ISCA), pp. 106-116, 2010.
[30] R. Das, S. Narayanasamy, S. K. Satpathy, and R. G. Dreslinski, "Catnap: Energy
Proportional Multiple Network-on-Chip," in 40th International Symposium on
Computer Architecture (ISCA), 2013.
[31] M. Demler, "Processors Fill Up on IP for 64-bit Era," Microprocessor Report, 2013.
[32] R. H. Dennard, F. H. Gaensslen, Y. Hwa-Nien, V. L. Rideout, E. Bassous, and A. R.
LeBlanc, "Design of ion-implanted MOSFET's with very small physical
dimensions," vol. sc-9, pp. 256-68, 1974.
[33] J. Duato, "A new theory of deadlock-free adaptive routing in wormhole networks,"
IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 4, pp. 1320-
31, 1993.
[34] J. Duato, "A necessary and sufficient condition for deadlock-free adaptive routing
in wormhole networks," IEEE Transactions on Parallel and Distributed Systems
(TPDS), vol. 6, pp. 1055-67, 1995.
208
[35] J. Duato, "A necessary and sufficient condition for deadlock-free routing in cut-
through and store-and-forward networks," IEEE Transactions on Parallel and
Distributed Systems (TPDS), vol. 7, pp. 841-54, 1996.
[36] J. Duato and T. M. Pinkston, "A general theory for deadlock-free adaptive routing
using a mixed set of resources," IEEE Transactions on Parallel and Distributed
Systems (TPDS), vol. 12, pp. 1219-1235, 2001.
[37] C. Fallin, C. Craik, and O. Mutlu, "CHIPPER: A low-complexity bufferless
deflection router," in 17th International Symposium on High Performance
Computer Architecture (HPCA), pp. 144-55, 2011.
[38] J. Flich, S. Rodrigo, and J. Duato, "An efficient implementation of distributed
routing algorithms for NoCs," in 2nd IEEE International Symposium on Networks-
on-Chip (NOCS), pp. 87-96, 2008.
[39] R. W. Floyd, "Algorithm 97: shortest path," Communications of the ACM, vol. 5, p.
345, 1962.
[40] K. Goossens, J. Dielissen, and A. Radulescu, "AEthereal network on chip:
Concepts, architectures, and implementations," IEEE Design and Test of
Computers, vol. 22, pp. 414-421, 2005.
[41] P. Gratz, K. Changkyu, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W.
Keckler, and D. Burger, "On-chip interconnection networks of the TRIPS chip,"
IEEE Micro, vol. 27, pp. 41-50, 2007.
[42] P. Gratz, B. Grot, and S. W. Keckler, "Regional congestion awareness for load
balance in networks-on-chip," in 14th International Symposium on High
Performance Computer Architecture (HPCA), pp. 203-214, 2008.
[43] P. Greenhalgh, "Big.LITTLE processing with ARM Cortex.-A15 & Cortex-A7,"
ARM Whitepaper, 2011.
[44] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, "Express cube topologies for on-
chip interconnects," in 15th International Symposium on High Performance
Computer Architecture (HPCA), pp. 163-74, 2009.
[45] B. Grot, S. W. Keckler, and O. Mutlu, "Preemptive virtual clock: A flexible,
efficient, and cost-effective QOS scheme for networks-on-chip," in 42nd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 268-279,
2009.
209
[46] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, "Kilo-NOC: a heterogeneous
network-on-chip architecture for scalability and service guarantees," in 38th annual
international symposium on Computer architecture (ISCA), pp. 401-412, 2011.
[47] W. Gu-Yeon, J. Kim, D. Liu, S. Sidiropoulos, and M. A. Horowitz, "A variable-
frequency parallel I/O interface with adaptive power-supply regulation," in IEEE
International Solid-State Circuits Conference (ISSCC), 2000.
[48] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, "A 5-GHz mesh
interconnect for a Teraflops processor," IEEE Micro, vol. 27, pp. 51-61, 2007.
[49] J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M.
Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V.
K. De, and R. Van Der Wijngaart, "A 48-Core IA-32 Processor in 45 nm CMOS
Using On-Die Message-Passing and DVFS for Performance and Power Scaling,"
IEEE Journal of Solid-State Circuits, vol. 46, pp. 173-83, 2011.
[50] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H.
Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P.
Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M.
Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der
Wijngaart, and T. Mattson, "A 48-core IA-32 message-passing processor with
DVFS in 45nm CMOS," in 2010 IEEE International Solid-State Circuits
Conference (ISSCC), pp. 108-109, Feb. 2010.
[51] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose,
"Microarchitectural techniques for power gating of execution units," in
International Symposium on Lower Power Electronics and Design (ISLPED), pp.
32-37, 2004.
[52] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, "A NUCA
substrate for flexible CMP cache sharing," IEEE Transactions on Parallel and
Distributed Systems (TPDS), vol. 18, pp. 1028-1040, 2007.
[53] C. Izu, C. Carrion, J. A. Gregorio, and R. Beivide, "Restricted injection flow
control for k-ary n-cube networks," in 10th International Conference on Parallel
and Distributed Computing Systems, pp. 511-518, 1997.
[54] C. Izu, "Restrictive Turning: Deadlock freedom and Congestion Control for
Oblivious Cut-through Networks," in 4th Australasian Computer Architecture
Conference (ACAC) Auckland, pp. 251-262, 1999.
[55] S. A. R. Jafri, Y.-J. Hong, M. Thottethodi, and T. N. Vijaykumar, "Adaptive flow
control for robust performance and energy," in 43rd IEEE/ACM International
Symposium on Microarchitecture (MICRO), pp. 433-444, 2010.
210
[56] N. E. Jerger and L.-S. Peh, "On-Chip networks," Synthesis Lectures on Computer
Architecture, vol. 8, pp. 1-137, 2009.
[57] H. Jiang, M. Marek-Sadowska, and S. R. Nassif, "Benefits and costs of power-
gating technique," in International Conference on Computer Design (ICCD), pp.
559-566, 2005.
[58] C. Jichuan and G. S. Sohi, "Cooperative caching for chip multiprocessors," in 33rd
International Symposium on Computer Architecture (ISCA), pp. 264-275, 2006.
[59] O. Jin and X. Yuan, "LOFT: A High Performance Network-on-chip Providing
Quality-of-service Support," in 43rd Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), pp. 409-20, 2010.
[60] Y. Jin, R. Wang, W. Choi, and T. M. Pinkston, "Thread criticality support in on-
chip networks," in 3rd International Workshop on Network on Chip Architectures
(NoCArc), pp. 5-10, 2010.
[61] Y. Jin, E. J. Kim, and T. M. Pinkston, "Communication-aware globally-coordinated
on-chip networks," IEEE Transactions on Parallel and Distributed Systems (TPDS),
vol. 23, pp. 242-254, 2012.
[62] H. Jingcao and R. Marculescu, "Application-specific buffer space allocation for
networks-on-chip router design," in International Conference on Computer Aided
Design (ICCAD), pp. 354-61, 2004.
[63] A. Joshi, C. Batten, Y.-J. Kwon, S. Beamer, I. Shamim, K. Asanovic, and V.
Stojanovic, "Silicon-photonic clos networks for global on-chip communication," in
3rd ACM/IEEE International Symposium on Networks-on-Chip (NOCS), pp. 124-
133, 2009.
[64] A. Joshi, B. Kim, and V. Stojanovic, "Designing energy-efficient low-diameter on-
chip networks with equalized interconnects," in 17th IEEE Symposium on High
Performance Interconnects (HOTI), pp. 3-7, 2009.
[65] A. Kahng, L. Bin, L.-S. Peh, and K. Samadi, "ORION 2.0: A fast and accurate NoC
power and area model for early-stage design space exploration," in Design,
Automation and Test in Europe Conference and Exhibition (DATE), pp. 423-428,
2009.
[66] J. Kao, A. Chandrakasan, and D. Antoniadis, "Transistor sizing issues and tool for
multi-threshold CMOS technology," in 34th Design Automation Conference (DAC),
pp. 409-14, 1997.
211
[67] Y.-H. Kao, N. Alfaraj, M. Yang, and H. J. Chao, "Design of high-radix Clos
Network-on-Chip," in 4th ACM/IEEE International Symposium on Networks on
Chip (NOCS), pp. 181-188, 2010.
[68] S. Kaxiras and M. Martonosi, Computer architecture techniques for power-
efficiency: Morgan and Claypool Publishers, 2008.
[69] C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure
for wire-delay dominated on-chip caches," in 10th International Conference on
Architectural Support for Programming Languages and Operating Systems
(ASPLOS), pp. 211-222, 2002.
[70] E. J. Kim, K. H. Yum, G. M. Link, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, M.
Yousif, and C. R. Das, "Energy Optimization Techniques in Cluster Interconnects,"
in Proceedings of the International Symposium on Low Power Electronics and
Design (ISLPED), pp. 459-464, 2003.
[71] J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, and C. R. Das, "A low latency
router supporting adaptivity for on-chip interconnects," in 42nd Design Automation
Conference (DAC), pp. 559-564, 2005.
[72] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. S. Yousif, and C. R. Das, "A
gracefully degrading and energy-efficient modular router architecture for on-chip
networks," in 33rd International Symposium on Computer Architecture (ISCA), pp.
4-15, 2006.
[73] J. Kim, J. Balfour, and W. J. Dally, "Flattened butterfly topology for on-chip
networks," in 40th IEEE/ACM International Symposium on Microarchitecture
(MICRO), pp. 172-182, 2007.
[74] M. Koibuchi, H. Matsutani, H. Amano, and T. M. Pinkston, "A lightweight fault-
tolerant mechanism for network-on-chip," in 2nd ACM/IEEE International
Symposium on Networks-on-Chip (NOCS), pp. 13-22, 2008.
[75] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, "Express virtual channels: Towards
the ideal interconnection fabric," in 34th Annual International Symposium on
Computer Architecture (ISCA), pp. 150-161, 2007.
[76] J. W. Lee, M. C. Ng, and K. Asanovic, "Globally-synchronized frames for
guaranteed quality-of-service in on-chip networks," in 35th International
Symposium on Computer Architecture (ISCA), pp. 89-100, 2008.
[77] K. Lee, S.-J. Lee, and H.-J. Yoo, "SILENT: Serialized low energy transmission
coding for on-chip interconnection networks," in International Conference on
Computer-Aided Design (ICCAD), pp. 448-451, 2004.
212
[78] S. Li, P. Li-Shiuan, and N. K. Jha, "Dynamic voltage scaling with links for power
optimization of interconnection networks," in 9th International Symposium on
High-Performance Computer-Architecture (HPCA), pp. 91-102, 2003.
[79] A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin, "Dynamic power gating
with quality guarantees," in International Symposium on Low Power Electronics
and Design (ISLPED), pp. 377-382, 2009.
[80] S. Ma, N. E. Jerger, and Z. Wang, "DBAR: an efficient routing algorithm to support
multiple concurrent applications in networks-on-chip," in 38th annual international
symposium on Computer architecture (ISCA), pp. 413-424, 2011.
[81] S. Ma, N. E. Jerger, and Z. Wang, "Whole packet forwarding: Efficient design of
fully adaptive routing algorithms for networks-on-chip," in 18th IEEE International
Symposium on High Performance Computer Architecture (HPCA), pp. 467-478,
2012.
[82] N. Madan, A. Buyuktosunoglu, P. Bose, and M. Annavaram, "A case for guarded
power gating for multi-core processors," in 17th International Symposium on High-
Performance Computer Architecture (HPCA), pp. 291-300, 2011.
[83] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J.
Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system
simulation platform," IEEE Computer, vol. 35, pp. 50-58, 2002.
[84] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R.
Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's general
execution-driven multiprocessor simulator toolset," ACM SIGARCH Computer
Architecture News, vol. 33, pp. 92-99, 2005.
[85] J. F. Martinez, J. Torrellas, and J. Duato, "Improving the performance of bristled
CC-NUMA systems using virtual channels and adaptivity," Proceedings of the
International Conference on Supercomputing (ICS), pp. 202-209, 1999.
[86] M. R. Marty and M. D. Hill, "Virtual hierarchies to support server consolidation,"
in 34th Annual International Symposium on Computer Architecture (ISCA), pp. 46-
56, 2007.
[87] H. Matsutani, M. Koibuchi, W. Daihan, and H. Amano, "Run-time power gating of
on-chip routers using look-ahead routing," in 13th Asia and South Pacific Design
Automation Conference (ASP-DAC), pp. 55-60, 2008.
[88] H. Matsutani, M. Koibuchi, D. Wang, and H. Amano, "Adding slow-silent virtual
channels for low-power on-chip networks," in 2nd ACM/IEEE International
Symposium on Networks-on-Chip (NOCS), pp. 23-32, 2008.
213
[89] H. Matsutani, M. Koibuchi, D. Ikebuchi, K. Usami, H. Nakamura, and H. Amano,
"Ultra fine-grained run-time power gating of on-chip routers for CMPs," in 4th
ACM/IEEE International Symposium on Networks on Chip (NOCS), pp. 61-68,
2010.
[90] H. McGhan, "Niagara 2 opens the floodgates," Microprocessor Report, vol. 20, pp.
1-12, 2006.
[91] J. Miguel-Alonso, C. Izu, and J. A. Gregorio, "Improving the performance of large
interconnection networks using congestion-control mechanisms," Performance
Evaluation, vol. 65, pp. 203-211, 2008.
[92] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, "Guaranteed bandwidth using
looped containers in temporally disjoint networks within the nostrum network on
chip," in Design, Automation and Test in Europe Conference and Exhibition
(DATE), pp. 890-5, 2004.
[93] T. Moscibroda and O. Mutlu, "A case for bufferless routing in on-chip networks,"
in 36th Annual International Symposium on Computer Architecture (ISCA), pp.
196-207, 2009.
[94] S. Murali and G. De Micheli, "Bandwidth-constrained mapping of cores onto NoC
architectures," in Design, Automation and Test in Europe Conference and
Exhibition (DATE), pp. 896-901, 2004.
[95] C. A. Nicopoulos, D. Park, J. Kim, N. Vijaykrishnan, M. S. Yousif, and C. R. Das,
"ViChaR: a dynamic virtual channel regulator for network-on-chip routers," in 39th
International Symposium on Microarchitecture (MICRO), p. 12 pp., 2006.
[96] D. Pamunuwa, J. Öberg, L.-R. Zheng, M. Millberg, A. Jantsch, and H. Tenhunen,
"Layout, Performance and Power Trade-Offs in Mesh-Based Network-on-Chip
Architectures," in VLSI-SOC, p. 362, 2003.
[97] L.-S. Peh and W. J. Dally, "Flit-reservation flow control," in 6th International
Symposium on High-Performance Computer Architecture (HPCA), pp. 73-84, 2000.
[98] L. S. Peh and W. J. Dally, "A delay model and speculative architecture for
pipelined routers," in 7th International Symposium on High Performance Computer
Architecture (HPCA), pp. 255-66, 2001.
[99] T. M. Pinkston and S. Warnakulasuriya, "On deadlocks in interconnection
networks," in 24th International Symposium on Computer Architecture (ISCA), pp.
38-49, 1997.
214
[100] V. Puente, C. Izu, R. Beivide, J. A. Gregorio, F. Vallejo, and J. M. Prellezo, "The
adaptive bubble router," Journal of Parallel and Distributed Computing (JPDC),
vol. 61, pp. 1180-208, 2001.
[101] W. Qing, M. Pedram, and W. Xunwei, "Clock-gating and its application to low
power design of sequential circuits," IEEE Transactions on Circuits and Systems I:
Fundamental Theory and Applications, vol. 47, pp. 415-20, 2000.
[102] G. Ravindran and M. Stumm, "Performance comparison of hierarchical ring- and
mesh-connected multiprocessor networks," in 3rd International Symposium on
High-Performance Computer Architecture (HPCA), pp. 58-69, 1997.
[103] A. Samih, W. Ren, A. Krishna, C. Maciocco, C. Tai, and Y. Solihin, "Energy-
efficient interconnect via router parking," in IEEE 19th International Symposium on
High Performance Computer Architecture (HPCA), 23-27 Feb. 2013, pp. 508-19,
2013.
[104] S. Scott, D. Abts, J. Kim, and W. J. Dally, "The BlackWidow high-radix clos
network," in 33rd International Symposium on Computer Architecture (ISCA), pp.
16-27, 2006.
[105] Y. H. Song and T. M. Pinkston, "A progressive approach to handling message-
dependent deadlock in parallel computer systems," IEEE Transactions on Parallel
and Distributed Systems (TPDS), vol. 14, pp. 259-275, 2003.
[106] Y. H. Song and T. M. Pinkston, "Distributed resolution of network congestion and
potential deadlock using reservation-based scheduling," IEEE Transactions on
Parallel and Distributed Systems (TPDS), vol. 16, pp. 686-701, 2005.
[107] V. Soteriou and P. Li-Shiuan, "Design-space exploration of power-aware on/off
interconnection networks," in 2nd International Conference on Computer Design
(ICCD), pp. 510-17, 2004.
[108] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V.
Stojanovic, "DSENT - A tool connecting emerging photonics with electronics for
opto-electronic networks-on-chip modeling," in 6th IEEE/ACM International
Symposium on Networks-on-Chip (NOCS), pp. 201-210, 2012.
[109] C. Svensson, "Optimum voltage swing on on-chip and off-chip interconnect," IEEE
Journal of Solid-State Circuits, vol. 36, pp. 1108-1112, 2001.
[110] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H.
Hoffman, P. Johnson, L. Jae-Wook, W. Lee, A. Ma, A. Saraf, M. Seneski, N.
Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, "The Raw
215
microprocessor: a computational fabric for software circuits and general-purpose
programs," IEEE Micro, vol. 22, pp. 25-35, 2002.
[111] M. Thottethodi, A. R. Lebeck, and S. S. Mukherjee, "Exploiting global knowledge
to achieve self-tuned congestion control for k-ary n-cube networks," IEEE
Transactions on Parallel and Distributed Systems (TPDS), vol. 15, pp. 257-272,
2004.
[112] Tilera. http://www.tilera.com/products/processors, 2013.
[113] F. Trivino, J. L. Sanchez, F. J. Alfaro, and J. Flich, "Virtualizing network-on-chip
resources in chip-multiprocessors," Microprocessors and Microsystems, vol. 35, pp.
230-245, 2011.
[114] R. Wang, L. Chen, and T. M. Pinkston, "Bubble coloring: Avoiding routing- and
protocol-induced deadlocks with minimal virtual channel requirement," in 27th
ACM International Conference on Supercomputing (ICS), pp. 193-202, 2013.
[115] Z. Wang, J. Xu, X. Wu, Y. Ye, W. Zhang, M. Nikdast, and X. Wang, "Floorplan
Optimization of Fat-Tree Based Networks-on-Chip for Chip Multiprocessors,"
IEEE Transactions on Computers, vol. PP, pp. 1-1, 2012.
[116] S. Warnakulasuriya and T. M. Pinkston, "Formal model of message blocking and
deadlock resolution in interconnection networks," IEEE Transactions on Parallel
and Distributed Systems (TPDS), vol. 11, pp. 212-229, 2000.
[117] Z. Wenbiao, Z. Yan, and M. Zhigang, "An application specific NoC mapping for
optimized delay," in International Conference on Design and Test of Integrated
Systems in Nanoscale Technology, pp. 184-8, 2006.
[118] F. Worm, P. Ienne, P. Thiran, and G. de micheli, "An adaptive low-power
transmission scheme for on-chip networks," in 15th International Symposium on
System Synthesis (ISSS), pp. 92-100, 2002.
[119] S. Yue, L. Chen, D. Zhu, T. M. Pinkston, and M. Pedram, "Smart Butterfly:
Reducing Static Power Dissipation of Network-on-Chip with Core-State-
Awareness," in IEEE/ACM International Symposium on Low Power Electronics
and Design (ISLPED), 2014.
[120] B. Zafar, J. Draper, and T. M. Pinkston, "Cubic Ring networks: A polymorphic
topology for network-on-chip," in 39th International Conference on Parallel
Processing (ICPP), pp. 443-452, 2010.
216
[121] H. Zhang and J. Rabaey, "Low-swing interconnect interface circuits," in
International Symposium on Low Power Electronics and Design (ISLPED), pp.
161-166, 1998.
[122] L. Zhao, W. Choi, L. Chen, and J. Draper, "In-network traffic regulation for
Transactional Memory," in 19th IEEE International Symposium on High
Performance Computer Architecture (HPCA), pp. 520-531, 2013.
[123] L. Zhao, L. Chen, and J. Draper, "Mitigating the Mismatch between the Coherence
Protocol and Conflict Detection in Hardware Transactional Memory," in 28th IEEE
International Parallel & Distributed Processing Symposium (IPDPS), 2014.
[124] D. Zhu, L. Chen, S. Yue, and M. Pedram, "Application mapping for express
channel-based networks-on-chip," in 17th Design, Automation and Test in Europe
(DATE), 2014.
[125] D. Zhu, L. Chen, S. Yue, T. M. Pinkston, and M. Pedram, "Balancing On-Chip
Network Latency in Multi-Application Mapping for Chip-Multiprocessors," in 28th
IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014.
[126] A. Zia, S. Kannan, G. Rose, and H. J. Chao, "Highly-scalable 3D CLOS NOC for
many-core CMPs," in 8th IEEE International NEWCAS Conference, pp. 229-232,
2010.
Abstract (if available)
Abstract
Many‐core processors will continue to proliferate in the next decade across the entire computing landscape. While on‐chip networks provide a potentially scalable interconnection solution for many‐core chips, they present serious challenges in achieving the power and resource‐efficiency needed for satisfying the constraints of future chip multiprocessors. This dissertation explores the opportunities, challenges and viable solutions at the architecture‐level in designing low‐power and resource-efficient on‐chip networks. ❧ This research provides insight into the key factors affecting the effectiveness of power‐saving techniques, particularly as it relates to power‐gating on‐chip network routers. Two schemes are proposed that effectively decouple computational and communication resources to maximize static power savings while also minimizing performance penalty by dynamically powering on/off resources based on runtime application traffic load. The two schemes are generally applicable to direct network and indirect network topologies, respectively. ❧ This research also addresses the challenging problems that remain in reducing the resource requirements of on‐chip networks, which also affect power efficiency. This work investigates different ways of conveying global information using only local resources to solve a couple of difficulties that have hindered efficient flow control designs in virtual cut‐through and wormhole‐switched networks for over a decade. The proposed theoretical support and implementation schemes are applicable to a broad set of network designs to ensure freedom from both routing‐induced and protocol‐induced deadlock, thus having important theoretical value and practical applications. ❧ This research also investigates the compelling opportunities for recognizing and exploiting the emerging regional traffic behaviors exhibited in on‐chip networks. The proposed schemes for enhancing resource utilization, along with the previous resource minimizing schemes, allows more on‐chip resources to be powered off dynamically, thus widening the entire spectrum of trade‐offs among power, resource and performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Parallel simulation of chip-multiprocessor
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
A framework for runtime energy efficient mobile execution
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Thermal modeling and control in mobile and server systems
PDF
Theoretical and computational foundations for cyber‐physical systems design
Asset Metadata
Creator
Chen, Lizhong
(author)
Core Title
Design of low-power and resource-efficient on-chip networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
07/02/2014
Defense Date
06/09/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deadlock,Energy,flow control,network‐on‐chip,OAI-PMH Harvest,Power,power‐gating
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pinkston, Timothy M. (
committee chair
), Annavaram, Murali (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
lizhongc@usc.edu,vlsileon@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-429487
Unique identifier
UC11286754
Identifier
etd-ChenLizhon-2607.pdf (filename),usctheses-c3-429487 (legacy record id)
Legacy Identifier
etd-ChenLizhon-2607.pdf
Dmrecord
429487
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chen, Lizhong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deadlock
flow control
network‐on‐chip
power‐gating