Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Performance improvement and power reduction techniques of on-chip networks
(USC Thesis Other)
Performance improvement and power reduction techniques of on-chip networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PERFORMANCE IMPROVEMENT AND POWER REDUCTION TECHNIQUES
OF ON-CHIP NETWORKS
by
Di Zhu (dizhu@usc.edu)
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
May 2016
Copyright 2016 Di Zhu (dizhu@usc.edu)
To my parents, Xu Ke and Xiaoming Zhu
and my fianc ´ e, Siyu Yue.
Acknowledgments
First and foremost, I am deeply indebted to my Ph.D. advisor, Prof. Massoud Pedram. Prof. Pedram
has educated me, both consciously and unconsciously, what it takes to become a good researcher and
teacher. I enjoy having technical discussions with him, as his quick thinking and novel ideas always
enlighten me on my research ideas and greatly improve the quality and originality of my projects.
I would like to thank other members of my defense and qualifying exam committee, Prof. Timothy M.
Pinkston, Prof. Murali Annavaram, Prof. Paul Bogdan, and Prof. Aiichiro Nakano, for their time, interest,
and insightful comments. I have been working in collaboration with Prof. Pinkston’s group for a few
years, and have received substantial suggestions not only on my research projects but also about my career
paths from him. Special thanks go to Prof. Nazarian for his excellent teaching in VLSI design course, and
I had great teaching assistant experience with him, in which I got to understand how to communicate with
students and how to become an informative and responsible instructor.
In addition, I would like to thank Prof. Naehyuck Chang’s group at KAIST University. In the first
few years of my Ph.D. life, I worked in close collaboration with Prof. Chang and his students Younghyun
Kim, Donghwa Shin, Sangyoung Park, and Jaehyun Park Kim. Their enthusiasm and engagement on the
research projects motivated and encouraged me to move on in my Ph.D. studies.
Much gratitude is owed to the many excellent researchers at USC who I am fortunate to work with
over the years, including Prof. Lizhong Chen, and Prof. Yanzhi Wang, Siyu Yue, Qing Xie, Xue Lin, and
Shuang Chen. During my time at USC, I enjoy myself very much as a member of the large SPORT family.
My labmates are all talented, skilled, and determined students, and I cherish the days working with and
learning from them. We have group meeting almost every Friday, where we present our ideas, improve
the ideas based on Prof. Pedram and other labmates’ comments, and share our latest news about not only
research but also our everyday life. Our SPORT lab is like a big family to me.
Finally, I would like to express my sincerest appreciation to my family for their unconditional love and
support. For my caring, guiding, and supportive parents, Ke Xu and Xiaoming Zhu, who raised me and
educated me with continuous encouragement to follow my dreams. For my loving, inspiring, and patient
fianc ´ e, Siyu, who is always by my side and help me through all the pressures and challenges. Without the
understanding, encouragement, and sacrifice of my family, this dissertation would not be possible.
iii
Contents
Acknowledgments iii
List of Tables vii
List of Figures viii
Abstract x
1 Introduction 1
2 Background 5
2.1 Basic Structures of On-Chip Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 On-Chip Packet Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Application Mapping on NoC-Based Multi-Core Platforms . . . . . . . . . . . . . . . . . 6
2.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 NoC Power Consumption and Power Saving Techniques . . . . . . . . . . . . . . . . . . 7
2.4.1 Power Consumption of On-Chip Networks . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Power Gating Techniques of On-Chip Networks . . . . . . . . . . . . . . . . . . 8
3 Application Mapping in Express Link-Based NoC Topologies 10
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Difference of Zero-Load Latency between the Mesh and Flattened Butterfly . . . . 10
3.1.2 Contention Latency Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Summary of Observations for Application Mapping on Flattened Butterfly . . . . 14
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Network, Application, and Latency . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Proposed Solution: Turn Reduction Algorithm for Mapping . . . . . . . . . . . . . . . . . 16
3.3.1 Contention Latency Penalty Estimation by Router Load . . . . . . . . . . . . . . 16
3.3.2 Contention-Aware Turn-Reduction Mapping Algorithm . . . . . . . . . . . . . . . 17
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Average Packet Latency Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
4 Balancing NoC Latency in Multi-Application Mapping 27
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Many-Core Chip Multiprocessor Architecture . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Uneven Packet Latencies in Cache and Memory Controller Traffic . . . . . . . . . 28
4.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Difficulty in Utilizing Existing Mapping Algorithms . . . . . . . . . . . . . . . . 30
4.2.2 Variations in Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Problem Statement: Mapping For On-Chip Latency Balancing . . . . . . . . . . . . . . . 32
4.3.1 Selecting Metrics for Latency Balancing . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.3 NP-Completeness of OBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Subproblem: Single Application Mapping . . . . . . . . . . . . . . . . . . . . . . 36
4.4.2 Variation-Aware Heuristic Algorithm for OBM . . . . . . . . . . . . . . . . . . . 37
4.4.3 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.4 Dynamic Application Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Impact of Application-Level Assignment Order . . . . . . . . . . . . . . . . . . . 42
4.5.3 Max-APL Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.4 Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.5 Performance and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.6 Algorithm Runtime Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.1 Dynamic Application Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.2 Impact of Memory Controller Placement . . . . . . . . . . . . . . . . . . . . . . 47
4.6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6.4 Impact on Application Execution Time . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Temperature-Aware Application Mapping 49
5.1 Motivation: Thermal Impacts of NoC Routers . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Problem Statement: Temperature-Aware Application Mapping . . . . . . . . . . . . . . . 50
5.3 Proposed Algorithm: Temperature-Aware Partition-and-Placement (TAPP) . . . . . . . . . 50
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.1 Temperature and Latency Results . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Performance Improvement Through Designing Express Link Topologies 55
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.2 Impact of Express Links on Latency . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Problem Statement: Optimal Express Link Topology . . . . . . . . . . . . . . . . . . . . 56
6.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3.1 Cross-Section Link Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3.2 Reduction from 2D to 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
v
6.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.4 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.4.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.4.2 PARSEC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.3 Synthetic Traffic Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Smart Butterfly: Proactive Power Gating on Flattened Butterfly 66
7.1 Motivation: Connectivity Analysis of Mesh and Express Link-Based Topologies . . . . . . 66
7.2 Problem Statement: Proactive Power Gating on Flattened Butterfly Topology . . . . . . . 67
7.3 Proposed Solutions: Exact Cost and Merit Value-Based Approaches . . . . . . . . . . . . 67
7.3.1 Exact Cost-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3.2 Merit Value-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8 Trade-Off between Dynamic and Static Power 72
8.1 Motivation: Trade-offs Between Dynamic and Static Power . . . . . . . . . . . . . . . . . 72
8.2 Problem Statement: Minimum Static Power, Minimum Dynamic Power, and Minimum
Overall Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3 Proposed Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.1 Minimal Active Router Set to Maintain Connectivity . . . . . . . . . . . . . . . . 74
8.3.2 Minimum Active Router Set to Minimize Hop Count . . . . . . . . . . . . . . . . 75
8.3.3 Active Router Set to Minimize Overall NoC Power . . . . . . . . . . . . . . . . . 77
8.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.4.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.4.2 Comparison of the Power Gating Schemes . . . . . . . . . . . . . . . . . . . . . 80
8.4.3 Trade-off between Static Power and Dynamic Power . . . . . . . . . . . . . . . . 81
8.4.4 Impact on Application Performance and NoC Energy . . . . . . . . . . . . . . . . 82
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9 Conclusions and Future Research 83
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2 Recommendation for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.2.1 Thermally aware application mapping and express link placement . . . . . . . . . 84
9.2.2 Thread migration and multi-threaded cores in application mapping . . . . . . . . . 84
9.2.3 Routing implementation for core state-aware power gating . . . . . . . . . . . . . 84
Reference List 85
vi
List of Tables
3.1 Percentage of Traffic that Needs to Make Turns. . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Average packet latencies (in cycles) of toybox under different pipeline stages. . . . . . . . 25
4.1 Exacerbated imbalance by Global. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Key parameters in the evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Application configurations for the evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Dev-APLs for the ten configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 PARSEC benchmark configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 TGFF benchmark configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1 Simulated network configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.1 Key Parameters in Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
vii
List of Figures
1.1 Some many-core chips: (a) MIT RAW chip; (b) Intel Teraflop; (c) SCORPIO. . . . . . . . 1
1.2 Normalized system runtime as a function of NoC latency. . . . . . . . . . . . . . . . . . . 2
1.3 Mesh and flattened butterfly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 NoC static power ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 4x4 NoC-based many-core system structure. . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 A canonical router structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Dynamic and static power consumption in a NoC router. . . . . . . . . . . . . . . . . . . 8
2.4 Router power gating block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 On-chip networks without express channels: (a) and (b); and with express channels: (c). . 11
3.2 Head-flit latencies of packets with source at tile l
1
. The larger number on a tile stands
for the head-flit latency, and the smaller number stands for the destination tile numbering.
Darker tiles means higher head-flit latencies from tile l
1
. . . . . . . . . . . . . . . . . . . 12
3.3 Latency breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Contention latency with increasing injection rates. . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Self-load L
0
and forwarding load L
F
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Packet turns add to forwarding traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 The first and second steps of CALM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 The first and second steps of CALM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.9 Task communication graph of mpeg and toybox. . . . . . . . . . . . . . . . . . . . . . . . 22
3.10 Mapping results of CALM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.11 Percentage of forwarding traffic associated with high-load routers. . . . . . . . . . . . . . 22
3.12 Average packet latency results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.13 Router dynamic power consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.15 Latency results for different traffic loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.14 8x8 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.16 MECS topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.17 Results on MECS topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 A typical 64-core NoC-based CMP architecture. . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Physical memory address breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Uneven packet latencies exhibited in cache and memory controller traffic. . . . . . . . . . 29
4.4 Mapping results of C1 and C9 under Global. . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Average cache access and memory access rates of applications in the PARSEC benchmark
suite. (MPKI = misses per kilo instruction). . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Different memory controller configurations on an 8x8 mesh network [3]. Shaded tiles
represent a memory controller co-located with the core/cache structure. . . . . . . . . . . 33
viii
4.7 Comparison of two optimal mapping results. . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8 Application-level assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.9 Two assignment orders: (a) poor balancing and (b) good balancing. . . . . . . . . . . . . . 38
4.10 Results with different application assignment orders. . . . . . . . . . . . . . . . . . . . . 43
4.11 Results of C1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.12 Max-APLs for the ten configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.13 g-APLs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.14 Dynamic power consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.15 Runtime comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.16 Dynamic mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.17 Different memory controller placements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.18 Speedup of hOBM over Global in C1 and C9. . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 (a) Power densities of different components on a tile in Scorpio. (b) Thermal map of one
tile in Scorpio (the core has three parts due to the rectangular restriction in HotSpot). . . . 49
5.2 First two steps of TAPP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 PARSEC benchmark results (average latency of Random: 28.7 cycles) and TGFF bench-
mark results (average latency of Random: 30.7 cycles). . . . . . . . . . . . . . . . . . . . 53
6.1 (a) Connection matrix. A solid dot means the two links of both sides are connected as
one and a hole means disconnected. (b) The corresponding express link placement. From
top to bottom, the blue, green, and red express links are denoted by the three layers of
corresponding colors in the connection matrix, respectively. . . . . . . . . . . . . . . . . . 60
6.2 Router implementation example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Average packet latencies as a function of link limit C. . . . . . . . . . . . . . . . . . . . . 65
6.4 Average packet latency results on 8x8 network. . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Router power consumption comparison on an 8x8 network. . . . . . . . . . . . . . . . . . 65
6.6 Router power consumption comparison on an 8x8 network. . . . . . . . . . . . . . . . . . 65
7.1 Trade-off curves between overall NoC power and average packet latency. . . . . . . . . . 69
7.2 4x4 results: (1-3) Mesh BB, Mesh RPA, Mesh RPC, (4-5) FB RPA, FB RPC, (6-8) FB BB,
FB EC, FB MV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 (1-2) Mesh RPA, Mesh RPC, (3-4) FB RPA, FB RPC, (5-6) FB EC, FB MV . . . . . . . . 71
8.1 Mesh Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 Application-level assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3 Two assignment orders: (a) poor balancing and (b) good balancing. . . . . . . . . . . . . . 81
8.4 Full system simulation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.5 x264 result with 16 active cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ix
Abstract
On-chip networks (a.k.a. networks on-chip or NoCs) have become the key communication media for
modern many-core platforms. With tens to hundreds of cores integrated onto current and future many-
core processors, a scalable NoC design with high performance and low power consumption is crucial for
researchers to better utilize on-chip cores and achieve an efficient computing system. First and foremost,
NoC performance improvement is of paramount importance. The performance of a NoC mainly refers to
its average packet latency, which greatly influences the actual runtime of the attached many-core system.
Therefore, on-chip packet latency is always among the most principal criteria of NoC designs. Another
NoC design factor which is equally important, if not more, is the NoC power consumption. Recent studies
as well as real-chip experiments show that NoC components can draw a substantial percentage of the over-
all chip power. Furthermore, this power consumption percentage of traditional on-chip networks increases
with the growing core count and technology scaling.
This research aims at providing performance improvement and power consumption reduction solutions
for on-chip networks. First, several methods to reduce average on-chip packet latencies through applica-
tion mapping are proposed, which intelligently assign running threads onto physical tiles to improve NoC
performance, considering various scenarios and constraints of many-core platforms. Second, this research
presents a novel NoC topology design methodology by selectively inserting express links onto mesh-based
networks, aiming at minimizing on-chip latency for general-purpose many-core processors with no power
overhead. These two topics focus on NoC performance improvement. Third, this research provides a
proactive power gating method for on-chip routers. Specifically, a core state-aware power gating method
for express link-based NoCs is first proposed, which utilizes the rich connectivity of express link-based
networks as well as the knowledge of currently sleeping cores to selectively power gate routers, reduc-
ing NoC static power consumption with minimum latency overheads. Finally, a generic analysis of NoC
proactive power gating is conducted for mesh-based NoCs, which identifies the trade-off between dynamic
power consumption and static power consumption. Three efficient heuristic algorithms are then proposed
for mesh-based NoC power gating, targeting the situations of minimum active routers, high-performance
applications, and minimum overall power consumption, respectively.
x
Chapter 1
Introduction
Since recently, the number of cores on a single die has been increasing and quite a few commercialized
and research chips integrated with multiple cores have been fabricated. To name a few, MIT RAW chip
has a prototype with 16 cores [69], SCORPIO has 36 Freescale e200 cores at 45nm [23], Intel’s Teraflops
has 80 cores integrated [35], and Tilera’s TILE-Gx100 integrates 100 general purpose cores at 40 nm in
its latest developed system [2]. Figure 1.1 depicts the layouts of several popular chips.
The advent of many-core system era has raised high communication demand among the on-chip cores.
Traditional bus, point-to-point wire connections, or crossbar-based interconnects are unlikely to satisfy the
needs due to their limited scalability. The bus structure has very limited throughput, therefore introducing
large congestion latency even with a small amount of traffic. The area required by point-to-point and
crossbar-based interconnect structures increases quadratically with the number of on-chip cores, and is
therefore not scalable. As a result, on-chip networks (a.k.a., OCNs or NoCs) [17] have been proposed as
a scalable substrate to efficiently deliver data between memory arrays, registers, and arithmetic units.
Several crucial issues of on-chip networks have attracted the majority of NoC research attention. First
and foremost, on-chip networks are among the most latency-sensitive subsystems on chip. One of the
most important goals of communication structures is to provide fast message delivery. The on-chip packet
latency has considerable influence on the overall system performance (i.e., overall application runtime),
as shown in Figure 1.2. While a lot of research efforts have been made to accelerate processing cores,
the performance improvement of on-chip networks should as well keep up to avoid becoming the system
performance bottleneck.
Researchers have explored several possible ways to help improve NoC performance. One of them is
to improve the quality of application mapping onto NoC-based multiprocessors. Application mapping,
which assigns running threads of one or more applications onto physical tiles, is an inevitable component
in the design of NoC-based multiprocessor systems. MPSoC or CMP applications such as video
(a) (b) (c)
Figure 1.1: Some many-core chips: (a) MIT RAW chip; (b) Intel Teraflop; (c) SCORPIO.
1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Normalized runtime
32 44 56
Average packet latency (cycles)
Figure 1.2: Normalized system runtime as a function of NoC latency.
encoder/decoder typically consist of many tasks (threads) that are working collaboratively to perform
certain functions. By mapping frequently or heavily communicating tasks to physically close tiles, the
average packet delay can be greatly reduced. However, the application mapping strategy may differ a lot
for various situations and constraints. First, this research discusses the application mapping method for
express link-based topology or the high-performance flattened butterfly topology [42], as shown in Figure
1.3 (b), which utilizes the rich connectivity provided by express links which connect two non-neighbor
tiles directly. Second, this research demonstrates the importance of on-chip latency balancing when
multiple applications are executing on a single chip, and then elaborates a heuristic algorithm which
balances the average on-chip latencies of each application with negligible overall NoC performance
degradation. Third, this research considers the application mapping with the constraint of maximum
on-chip temperature, and presents a mapping algorithm which minimizes on-chip latency while ensuring
no maximum temperature violation.
Another potential method to improve NoC performance is to optimize the NoC topologies. The
most traditional and popular NoC topology in both research and real chips is the mesh network. It is
area-efficient, easy to implement in physical design, achieves deadlock-free routing with XY routing, and
provides acceptable scalability in area, power consumption, and latency[18]. While mesh topology has
traditionally been used for tile-based NoCs, packets in mesh networks must be forwarded hop-by-hop,
which exposes the router delay (e.g., 3 to 4 cycles) and link delay (e.g., 1 cycle) at every hop to the packet
latency. To mitigate the latency increase of meshes, particularly for large networks, recent studies show
promise of adding express channels on top of concentration to accelerate packet transfer [19][42][29],
of which one popular example is the above-mentioned flattened butterfly (Figure 1.3 (b)). While these
proposals highlight the promise of adding physical express links, they represent only a few specific
example schemes without considering whether these schemes are optimal and how optimal placements
can be found. If flexibility of placing these express links is allowed, one may achieve higher performance
improvement by optimizing link placement designs.
While performance improvement is crucial, another issue in NoC design, the NoC power consumption,
has also attracted considerable attention, as on-chip designs are highly constrained by the tight power
budget. Recent research as well as real chip testing show that on-chip networks consume a substantial
percentage of overall chip power (e.g., 28% in Intel’s Teraflops [35], 36% in MIT’s RAW chip [69],
and 19% in SCORPIO [23]. An increasing number of cores on chip leads to more frequent on-chip
communication and therefore introduces higher dynamic power consumption of on-chip networks.
Another important aspect is the static power consumption of on-chip networks. The advancement in
manufacturing technology and the difficulty in continuing to downscale supply voltage has resulted in the
2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
(a) 4x4 Mesh (b) 4x4 Flattened Butterfly
Figure 1.3: Mesh and flattened butterfly.
0
10
20
30
40
50
60
NoC static power percentage (%)
65nm 45nm 32nm
1.2 V
1.1 V
1.0 V
Figure 1.4: NoC static power ratio.
growing proportion of static power consumption, as shown in Figure 1.4.
An effective approach to reduce NoC static power consumption is to apply power gating techniques,
which cut off the supply voltage to a router when it is idle to save static power. Most state-of-the-art
NoC power gating schemes are traffic oriented, in which routers are put into sleep state when there is
no traffic that needs to go through the routers [53, 54, 13]. However, the traffic-oriented power gating
strategies use only traffic information and are unable to take full advantage of inactive cores. For example,
even if a core is in sleep state and, thus, generates no incoming or outgoing packets, its attached router
cannot stay in powered-off state for long. This is because the router must be awoken intermittently to
forward the passing packets that are communicated among other active cores. In typical applications, the
length of router idle periods is on the order of tens to hundreds of cycles, short enough to cause frequent
wakeups and the associated energy overhead. On the contrary, this overhead may be entirely avoided if
we selectively turns off some routers that are connected to inactive cores and keep these routers power
gated during the core sleeping periods, since cores can be inactive for several milliseconds. With express
link-based NoC topologies, fewer routers are needed to achieve minimum latency penalties, therefore
introducing higher power savings.
In this research, we present design methods for high-performance and low-power on-chip net-
works. First, several methods to improve NoC performance through application mapping are proposed,
considering different situations and environments of many-core platforms. Second, a novel topology
design based on express link insertion is presented, aiming at minimizing on-chip latency for general-
purpose many core processors with no power overhead. Finally, a core-state-aware power gating method is
proposed, which utilizes the periods of cores sleeping to proactively turn off routers to reduce static power.
On-chip networks have become a hot research topic in recent years. This research distinguishes itself
from other research efforts with the following contributions.
Identified several key problems and situations in NoC application mapping, namely mapping on
express link-based topologies, latency balancing between multiple applications, and mapping with
the awareness of on-chip hotspots, and further presented efficient algorithms for these different
environments;
3
Utilized the flexibility in express link-based topology to improve NoC performance with the band-
width being the primary constraint, and proposed an optimally express link design to improve NoC
performance with little area/power overheads;
Discovered the potential of NoC static power saving with proactive power gating on express link-
based topologies, and presented power gating algorithm with quadratic time complexity to support
runtime execution;
Analyzed the trade-off between dynamic and static power consumption in proactive NoC power gat-
ing, and then proposed three algorithms to achieve a minimum number of active routers, maximum
performance, and minimum overall NoC power, respectively.
The remainder of the dissertation is organized as follows. Background knowledge about NoCs are
provided in Chapter 2. Chapters 3, 4, and 5 present three aspects of NoC performance improvement
in application mapping, namely efficient application mapping on flattened butterfly networks, latency
balancing in multi-application CMPs, and thermally aware application mapping algorithm. Chapter 6
elaborates how express links can be optimally placed to improve NoC performance. For NoC power
reduction techniques, Chapter 7 proposes efficient core state-aware power gating algorithms by choosing
active routers on flattened butterfly topologies, and Chapter 8 presents a detailed analysis of the trade-
offs between dynamic and static power followed by three heuristic solutions for proactive power gating.
Finally, Chapter 9 concludes this research and presents promising directions for future work.
4
Chapter 2
Background
This chapter provides some background knowledge about typical on-chip network structures and gen-
eral on-chip latency models.
2.1 Basic Structures of On-Chip Networks
A typical NoC-based many-core platform in 4x4 mesh networks is shown in Figure 2.1. Each
processing element (e.g., core and cache) is attached to an on-chip router to communicate with other
cores. The 16 routers connect to each other through links to form a 4x4 mesh.
Figure 2.2 depicts the internal architecture of one router, assuming credit-based flow control and a
canonical four-stage pipeline. A flow control digit (a.k.a., flit) is the smallest unit of information delivered
in NoCs, and on-chip packets are serialized into several flits to get transferred on chip. A flit travels
through four stages of router pipelines, which consist of route compute (RC), VC allocation (V A), switch
allocation (SA), and switch traversal (ST). Link traversal (LT) takes another cycle.
2.2 On-Chip Packet Latency
The on-chip latency of a packet from Router i to j is comprised of two components [18]:
T(i; j)= T
D
+ T
S
=(H(i; j)t
R
+ D(i; j)t
L
)+(S=b) (2.1)
where the first component T
D
= H(i; j)t
R
+ D(i; j)t
L
is the head latency, representing the time required
for the first flit of a packet to traverse the network. H(i; j) t
R
calculates the overall delay in router
pipelines. H(i; j) is the number of hops that the packet goes through. Note that a hop can be either a
bidirectional local link that connects two adjacent routers or a bidirectional express link that connects
PE
NI
Router
Figure 2.1: 4x4 NoC-based many-core system structure.
5
Input port
X
Switch
Output port
Output port Input port
Switch allocator
… …
Routing
Compute
Flit In
Flit In
Flit Out
Flit Out
VC allocator
Figure 2.2: A canonical router structure.
non-adjacent routers. Router delay t
R
is the number of cycles that a packet takes to pass through a
router. Overall link delay D(i; j) t
L
for network links is proportional to their lengths. D(i; j) is the total
Manhattan distance that a packet traverses from source to destination in the number of unit-length links.
Unit-length link delay t
L
is the time for a flit to travel on a local link (one cycle).
The second component, T
S
, is the serialization latency, representing the time for the rest of the packet
to complete transmission at the destination after the arrival of the first bit. The serialization latency is
calculated by S=b where S is the packet size in bits and b is the link width (or the flit size).
This section presents the general router structure and NoC latency models. However, the NoC latency
model might be different under different NoC system setups and/or constraints, as shown in the following
chapters.
2.3 Application Mapping on NoC-Based Multi-Core Platforms
In order to fully utilize the capabilities of multiple cores, the mapping methodologies are inevitably
required to efficiently map complex applications onto them. Applications executed on many-core systems
typically consist of many tasks that are working collaboratively to perform certain functions. By mapping
frequently or heavily communicating tasks to physically close tiles, the average packet latency and power
consumption can be greatly reduced.
6
2.3.1 Related Work
As the number of cores continues to grow with increased non-uniformity in on-chip latencies within
the same chip, the importance of application mapping has been rising rapidly and gaining increasing
attention [62]. Prior art of application mapping mostly targets NoC-based MPSoCs. Murali et al. focus on
overall latency minimization under minimum routing and traffic splitting for SoCs [58]. Hu et al. address
energy consumption in mapping tasks for tile-based MPSoC architectures [37]. Hansson et al. present a
combined mapping and routing approach for MPSoCs [32]. Singh et al. focus on accelerating algorithms
for run-time mapping [65]. Jang et al. propose mapping solutions for various chip layouts [39], and
Shafique et al. consider the situation of mixed-critical tasks in MPSoCs [1]. While the above works are
very effective in achieving their corresponding objectives for MPSoC systems, these algorithms are not
able to distinguish the differences in tile communication latency between mesh and high-radix networks.
There have been a few mapping solutions for NoC-based CMPs. Chen et al. present a set of
comprehensive mechanisms that optimize the mapping of one application onto CMPs [11]. Das et al.
introduces a memory controller traffic-aware application mapping method [20]. In contrast, our work
aims to address the multi-application mapping problem in CMPs to optimize overall NoC performance
and simultaneously balance NoC latencies among applications.
Many techniques have been proposed to provide quality-of-service support for various system compo-
nents including cache, memory, and on-chip networks [27][21][26][47][30][33]. This set of research has
very different objectives than NoC latency balancing and, therefore, are orthogonal and complementary to
our work. In fact, it is possible to integrate the approach developed in this work with previous mechanisms
to further improve the quality of service.
2.4 NoC Power Consumption and Power Saving Techniques
Compared with traditional buses, the relatively complex NoC architecture with routers and links can
draw a substantial percentage of the chip’s power (e.g., 28% in Intel Teraflop [35] and 19% in Scorpio
[23]).
2.4.1 Power Consumption of On-Chip Networks
The power consumption of an on-chip component is comprised of two parts, namely dynamic power
and static power. Figure 2.3 shows the power consumption of a NoC router at 3GHz under different
technologies and injection rates (flits per cycle), obtained by the DSENT NoC power model [68]. It can
be seen that dynamic power increases linearly with the injection rates of routers. This is because the
dynamic power consumption is proportional to the component’s activity, which is the router load in this
case. In contrast, static power consumption is independent of the router activity and strongly relates to
manufacturing technology. As shown in Figure 2.3(b), the average static power percentage of NoC routers
increases considerably from 45.37% in 45nm to 56.14% in 32nm when the injection rate r is 0.1 flits per
cycle. This urges researchers to find effective techniques to reduce not only the dynamic power but also
the static power consumption of on-chip networks.
7
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
(a) 45nm
r=0.05 r=0.1 r=0.15
Router power (W)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
(b) 32nm
r=0.05 r=0.1 r=0.15
Dynamic
Static
Figure 2.3: Dynamic and static power consumption
in a NoC router.
Input port
X
Switch
Output port
Output port Input port
Switch allocator
… …
Routing
Compute
Flit In
Flit In
Flit Out
Flit Out
VC allocator
Power Gating
Controller
VDD
GND
Figure 2.4: Router power gating block.
2.4.2 Power Gating Techniques of On-Chip Networks
One way that can dramatically reduce static power is to apply power gating techniques to each
NoC router. It is implemented by inserting appropriately sized header (or footer) transistor(s) with high
threshold voltage between the power supply and each targeted block (or between the block and GND), as
shown in Figure 2.4. By controlling the sleep signal to turn off the header transistor, the supply voltage
to the router is cut off, thus avoiding static power by removing the leakage currents in both subthreshold
conduction and reverse-biased diodes.
With the increasing static power as a result of technology scaling, recent research has been proposed
which applies power gating techniques to NoC routers to reduce the static power. A popular method to
apply power gating is to turn on and off the NoC routers reactively ([55, 13, 22, 14]), i.e., router states
are determined by incoming traffic, and idle routers that have no forwarded or injected traffic are powered
off. In these traffic-oriented strategies, router idle periods are inherently limited to what traffic patterns
allow, and the lengths of the idle periods tend to be fragmented by intermittent packet arrivals, with only
tens to hundreds of cycles in practice.
Another type of NoC power gating methods takes advantage of under-utilized cores and proactively
puts selected routers into sleep, detouring any traffic that might need to go through these routers [64].
Many-core processors often exhibit low core utilization (typically 15% to 55% [7]), with some cores sent
to deep sleep states periodically, making the NoC power percentage even higher if it is not powered off
proportionally to the core usage. In particular, a considerable percentage of the NoC power is contributed
by the static power component, and this proportion will continue to grow as manufacturing technology
scales further. This core state-aware power gating approach allows idle and under-utilized routers to
be powered off for longer periods of time, thus offers more savings in static power. However, detoured
packets may cause a larger overall hop count and increased router activities, leading to higher dynamic
power that may potentially offset the static power savings. Merely turning off as many routers as possible
does not necessarily lead to the maximum total power savings. Therefore, it is imperative to take into
8
account the penalty of dynamic power consumption during power gating in order to achieve the goal of
power minimization.
Reactive Power Gating of On-Chip Networks
Researchers have explored multiple ways to conduct reactive power gating for on-chip routers.
Matsutani et al. propose a look-ahead technique for NoC power gating to reduce the runtime latency
penalty [56, 55]. Chen et al. equip a mesh-based NoC architecture with a ring-based bypass channel,
which achieves high power saving with small latency penalty [13]. The same group presents a predictive
method to wake up sleeping routers to minimize latency overhead [14]. Das et al. use multiple narrow
networks to achieve different levels of NoC power gating [22]. These works are all traffic-oriented power
gating and do not exploit the long idle period of sleeping cores.
Proactive Power Gating of On-Chip Networks
Proactive power gating utilizes core idleness by using a central control unit to collect traffic information
and make decisions on which routers are power gated. Core status usually changes at the time scale of
milliseconds [5], which gives sufficient time for calculation and exchange of router status and routing
information. Along this line, Samih et al. propose Router Parking on mesh topology, with implementation
to support proactive power gating [63]. These works mainly target at minimizing static power with no or
limited consideration of the dynamic power penalties brought by packet detours.
9
Chapter 3
Application Mapping in Express Link-Based
NoC Topologies
With the emergence of many-core multiprocessor system-on-chips (MPSoCs), on-chip networks are
facing serious challenges in providing fast communication for various tasks and cores. One promising
solution shown in recent studies is to add express channels to the network as shortcuts to bypass inter-
mediate routers, thereby reducing packet latency. However, this approach also greatly changes the packet
delay estimation and traffic behaviors of the network, both of which have not yet been exploited in exist-
ing mapping algorithms. In this chapter, we explore the opportunities in optimizing application mapping
for express channel-based on-chip networks [73]. Specifically, we derive a new delay model for this type
of network, identify their unique characteristics, and propose an efficient heuristic mapping algorithm
that increases the bypassing opportunities by reducing unnecessary turns that would otherwise impose the
entire router pipeline delay to packets. Simulation results show that the proposed algorithm can achieve a
2X to 4X reduction in the number of turns and 10% to 26% reduction in the average packet delay.
3.1 Motivation
This section highlights the differences of designing application mapping solutions for mesh and flat-
tened butterfly networks. The differences in regards to the zero-load latency and contention latency are
discussed separately, with the potential challenges and opportunities of flattened butterfly elaborated. Two
observations on how we should optimize application mapping for on-chip flattened butterfly are summa-
rized in the end.
3.1.1 Difference of Zero-Load Latency between the Mesh and Flattened Butterfly
The rich link connectivity in flattened butterfly networks compared to that in the mesh leads to different
tile communication latency models. In mesh or CMesh networks, as long as two tiles (e.g., A and B in
Figure 3.1b) have the same Manhattan distances from a source tile (e.g., S in Figure 3.1b), the head-
flit latencies of packets from the source tile are the same; whereas in flattened butterfly networks, the
destination tile requiring fewer turns for packets to travel has shorter head-flit latency (e.g., 7 cycles from
S to A in Figure 3.1b) than the tile requiring more turns (e.g., 11 cycles from S to B in Figure 3.1c) .
Precisely, the head-flit latency of a packet initiated at tile l
i
and destined at tile l
j
on mesh topology is
calculated by
T
mesh
H
(i; j)=(M(i; j)+ 1)(t
R
)+ M(i; j)t
L
(3.1)
where M(i; j) is the Manhattan distance between tile l
i
and l
j
, and t
R
and t
L
are the per hop router
latency and per unit-length link latency, respectively. The number of routers encountered by a packet is
M(i; j)+ 1 as the packet needs to travel through both the pipeline of the source router for injection and
the pipeline of the destination router for ejection (e.g., packets go through 3 routers from S to A in Figure
3.1b).
In contrast, the head-flit latency on flattened butterfly is calculated by
10
(a) Mesh (b) CMesh (c) Flattened Butterfly
Figure 3.1: On-chip networks without express channels: (a) and (b); and with express channels: (c).
T
FB
H
(i; j)=(2+d(i; j))(t
R
)+ M(i; j)t
L
: (3.2)
d(i; j)=
8
>
<
>
:
0; i(i mod n)= j( j mod n)
0; i j mod n
1; otherwise:
(3.3)
where n is the number of routers in each row. The turn function, d(i; j), is used to identify whether
packets sent from tile l
i
to tile l
j
need to make a turn assuming XY routing, i.e., if tile l
i
and tile l
j
are on
the same row or column, packets travelling between them experience two hops(2t
R
), otherwise they go
through three routers.
Figure 3.2 exemplifies the head-flit packet latency from tile l
1
to all the other tiles in a CMesh-based
NoC and a flattened butterfly-based NoC, assuming t
L
= 1 and t
R
= 3 (the 3-cycle router follows a canoni-
cal pipeline design consisting of virtual channel allocation, switch allocation and switch traversal, with the
optimization of look-ahead routing to hide routing computation). Figure 3.2 highlights why algorithms
proposed for the CMesh-based NoC are less effective when applied to the flattened butterfly-based NoC
directly. In the CMesh latency model, tile l
4
, l
7
, l
1
0 are l
1
3 are considered to have the same T
0
for the
packets from l
1
, whereas in the latency model on the flattened butterfly, l
7
and l
1
0 have 33% larger delays
compared to the other two. It can be seen that packet latency can be reduced if the source and destination
routers are on the same row or column on flattened butterfly networks even if the Manhattan distances
are the same. Therefore, a mapping algorithm that can minimize the number of turns taken by packets
is much needed in order to better utilize the rich link connectivity of flattened butterfly and reduce the
average head-flit latency.
In addition to the difference in head-flit latency in zero-load latency, the serialization latency compo-
nent is also different in a flattened butterfly and is usually higher than that of a mesh. This is because the
total bisection bandwidth of a NoC is limited (e.g., due to chip dimension, manufacturing technology, and
energy constraints [60]), and adding express links results in smaller per-link width. For example, in the
flattened butterfly network shown in Figure 3.1(c), its link width is one fourth that of the mesh network
in Figure 3.1(a) as the 4x4 flattened butterfly requires four times the number of cross-section links of the
4x4 mesh. This important factor should also be accurately reflected when designing mapping algorithms
11
7
7 11
15
15 19
11
11 15
15 19
23
23 27
19
2 3 4 1
6 7 8 5
10 9 11 12
7
7 11
9
12 13
8
8 12
9 13
14
14 15
13
2 3 4 1
6 7 8 5
10 9 11 12
(a) CMesh (b) Flattened butterfly
14 13 15 16 14 13 15 16
Figure 3.2: Head-flit latencies of packets with source at tile l
1
. The larger number on a tile stands for the
head-flit latency, and the smaller number stands for the destination tile numbering. Darker tiles means
higher head-flit latencies from tile l
1
.
for flattened butterfly networks.
3.1.2 Contention Latency Awareness
Intuitively, mapping solutions for flattened butterfly networks should result in lower average packet
latency owing to the utilization of express channels to bypass routers. However, this may not necessarily
be the case if the mapping algorithm is not designed judiciously. This is because the smaller link width
in flattened butterfly not only leads to larger serialization latency T
S
, but also results in a larger number
of flits and potentially higher contention latency, which may in turn outweigh the reduction in head-flit
latency. Figure 3.3 illustrates this effect by comparing the latency breakdown of two mapping schemes.
SA cm employs a simulated annealing-based approach for generating a mapping solution for CMesh, and
SA fb uses the similar simulated annealing algorithm for flattened butterfly. We can see that although
SA fb is able to reduce head-flit latency T
H
by saving the number of hops, the increase in serialization
latency T
S
and contention latency T
C
makes the overall packet latency even higher than that of SA cm.
To address this issue, we first analyze contention latency in more detail and then discuss how application
mapping can be designed to reduce it.
Contention Latency Analysis
The contention latency experienced by a packet is the sum of all delays during which the packet is
blocked, waiting for a busy NoC component to become available. Contention latency T
C
consists of
two parts: 1) source contention latency T
C;s
which occurs when packets are queued in the NI (network
interface) buffer waiting to be injected into the network, and 2) in-network contention latency T
C;n
which
is caused by congestion when packets actually traverse routers in the network.
Both types of contention latency can be observed when the traffic load is high, but the source
contention latency is much larger due to the use of network flow control mechanisms (e.g., credit-based
flow control), which indirectly restrict the amount of packets that can be injected into the network.
12
0
5
10
15
20
25
SA_cm SA_fb
mpeg4
Latency (cycles)
T
C
T
S
T
H
Figure 3.3: Latency breakdown.
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.22
0
2
4
6
8
10
12
14
16
18
20
22
Latency (cycles)
Injection rate (flits per cycle)
T
C,s
T
C,n
Figure 3.4: Contention latency with increasing injection rates.
Specifically, when the receiving buffer in the downstream router is full, packets cannot be forwarded
(e.g., no available credit) and have to be stalled and accumulated in the current router. If the buffer
of the current router becomes full, packets in the upstream router cannot be forwarded to the current
router, which in turn causes packets to be stalled in the upstream router, and so on so forth. When
this backpressure is propagated to the source node, packets stop being injected and are queued up
in the NI buffer, resulting in source contention latency. As on-chip environment is area-constrained,
buffers in NoC routers are usually very limited to a size of around a few flits. Consequently, when
traffic load is very high, the majority of the packets are stored in the NI buffer of source nodes,
and source contention latency becomes the dominant part of contention latency. Figure 3.4 shows
the simulation result of contention latency for uniform random traffic using the cycle-accurate NoC
simulator GARNET [4] (evaluation methodology is described in Section 3.4.1). It can be seen that,
as network injection rate increases, in-network contention latency T
C;n
increases slowly while source
contention latency T
C;s
rises rapidly. This trend is also observed in all other traffic patterns that were tested.
Contention Latency Reduction
As contention latency is primarily determined by source contention latency, it is important for
mapping methods to reduce source contention latency effectively. There are two kinds of traffic load for
a router, which may affect source contention, namely self-load L
0
and forwarding load L
F
, as shown in
Figure 3.5 (express channels are omitted for simplicity, and four cores share one NI and router as the
result of concentration). Self-load L
0
refers to the amount of traffic that is initiated by the core cluster that
is connected to the NI. And forwarding load L
F
refers to the amount of traffic forwarded by the router,
which increases source contention latency by competing with and delaying self-load to be injected.
Application mapping methods can only change the amount of forwarding traffic L
F
so as to reduce
average packet latency, because the number of packets generated at a core and, thus, the self-load L
0
is fixed with given applications. Previously, we discuss in Section 3.1.1 that application mapping for
flattened butterfly networks should minimize the number of turns to reduce head-flit latency. Interestingly,
reducing turns also helps to reduce the forwarding traffic here. This is because, in flattened butterfly
networks, a router only forwards a packet when the router is the turning point between the packet’s source
and destination router. For example, in the flattened butterfly with 4x4 routers shown in Figure 3.6, a
packet sent from tile l
5
to tile l
8
does not make turns, thus only adding to the self-load from Router 5 to
Router 8. In contrast, a packet sent from l
5
to l
15
makes a turn at Router 7, thus adding to the forwarding
13
NI
Router
Figure 3.5: Self-load L
0
and forwarding load L
F
.
7
1 2 3 4
5 6 8
9 10 11 12
13 14 15 16
Figure 3.6: Packet turns add to
forwarding traffic.
load of Router 7. Therefore, a mapping method that reduces the percentage of packets making turns can
not only result in a smaller zero-load latency, but also potentially mitigate contention as well.
In addition to reducing turns, other methods to address L
F
are also needed as it may be impossible
to completely avoid forwarding traffic through turns at times such as when taking a turn is the only path
to reach the destination. In such cases, to further minimize source contention latency T
C;s
, we should
minimize the forwarding traffic of the routers with higher self-load. The reason is two-fold. First, the
average source contention latency per packet T
C;s
increases superlinearly with the overall router load. As
observed from Figure 3.4, T
C;s
grows slowly from 2 to 3 cycles at small injection rates, but increases
dramatically when the injection rate approaches throughput. For the same amount of forwarding traffic,
compared to being assigned to heavy routers with high self-load, adding the forwarding traffic load to
the routers with small L
0
may yield much smaller T
C;s
penalty. Second, a router with higher self-load L
0
generates a higher number of packets, and reducing the T
C;s
of this router (by assigning less forwarding
traffic) can benefit more packets and have higher performance gain on average, compared to assigning less
L
F
to routers with lower L
0
.
3.1.3 Summary of Observations for Application Mapping on Flattened Butterfly
Based on the above discussion, we can summarize the following two observations for designing appli-
cation mapping for flattened butterfly networks.
1. Application mapping should reduce the amount of traffic that needs to make turns. This reduces
head-flit latency and also helps to mitigate contention latency by reducing the forwarding traffic of
a router.
2. Application mapping should assign less forwarding traffic to the routers with higher self-load. This
can result in a smaller average contention latency.
These observations based on the characteristics of flattened butterfly networks provide important
insights in searching and evaluating good mapping solutions. The following sections present how we
actually design a mapping algorithm that utilizes these insights and achieves the main objective of reduc-
ing overall average packet latency.
14
3.2 Problem Statement
3.2.1 Network, Application, and Latency
Definition 1 - Network Topology:
A n n CMesh network has a network size of N= n
2
tiles.
Concentration degree c is the number of processing elements (PEs) that can be placed on one tile.
Therefore, a n n CMesh-based MPSoC with a concentration degree of c can hold at most n
2
c PEs.
Definition 2 - Application:
An application contains a set of tasksft
i
g, each executed on one PE. Tasks communicate with each
other during execution to exchange data, maintain coherency, etc.
A task cluster tc
i
is a set of tasks that are grouped together to be placed on one tile of a CMesh
network. Concentration degree c indicates a task cluster tc
i
contains at most c tasks.
Since the partitioning of tasks into task clusters greatly depends on the specific functionalities and
restrictions of each task in a particular application, in this section, we assume the task clustersftc
i
g are
given for an application, and focus on the main problem of mapping task clusters to tiles on the NoC.
Therefore, the mapping problem in this section focuses on mapping the task clusters onto the tiles.
In order to give a formal definition of average packet delay, we define the communication graph of an
application and the tile delay graph of a given NoC topology as follows.
Definition 3 - (Communication graph). A communication graph G
C
(V
C
;E
C
) is a directed graph, in
which each vertex v
C
i
represents a task cluster tc
i
and each edge e
C
i j
=(v
C
i
;v
C
j
) denotes the communication
from tc
i
to tc
j
. The weight associated with edge e
C
i j
denotes the communication rate r
i j
, i.e., the average
number of packets sent from tc
i
to tc
j
per unit time.
Definition 4 - (Tile delay graph). A tile delay graph G
L
(V
L
;E
L
) is a complete directed graph, in which
each vertex v
L
i
represents a tile l
i
. There is an edge e
L
i j
=(v
L
i
;v
L
j
) between any two vertices (tiles). The
weight associated with edge e
L
i j
represents the delay T(i; j) for a packet to travel from tile l
i
to tile l
j
.
Definition 5 - (Application mapping). An application mapping solution is a permutation
p(i)= j;i; j2f1;2;;Ng, such that task cluster tc
i
is mapped to tile l
j
.
Definition 6 - (Average packet latency). The average packet latency of an application is calculated by
¯
T=
¯
T
0
+
¯
T
C
=
N
å
i=1
n
å
j=1
r
i j
T
0
(p(i);p( j))
N
å
i=1
n
å
j=1
r
i j
+
N
å
i=1
n
å
j=1
r
i j
T
C
(p(i);p( j))
N
å
i=1
n
å
j=1
r
i j
: (3.4)
15
3.2.2 Problem Formulation
With the above definitions and delay models, we can formulate the application mapping problem as
follows.
Given:
1. an express channel-based network, containing N= n
2
tiles;
2. the application communication graph G
C
(V
C
;E
C
), with communication rate r
i j
as the edge weight;
and
3. the tile delay graph G
D
(V
D
;E
D
), with delay d
i j
as the edge weight;
Find: mappingp(i)= j, where i; j2 1;2;:::;N and
Minimize: the average packet latency
¯
T.
The above formulated problem has the form of a Quadratic Assignment Problem (QAP). A general
QAP is NP-hard [57]. Enumerating all(n
2
)! possible solutions is costly even for a simple 4x4 NoC, not to
mention larger networks. However, the special characteristics of the tile delay model of express-channel
networks may give us some insights for designing effective heuristic algorithms.
3.3 Proposed Solution: Turn Reduction Algorithm for Mapping
In this section, we first elaborate how we estimate the penalty due to contention latency in the objective
function, and then propose an efficient algorithm Contention-Aware Latency-Minimal (CALM), which
runs in polynomial time for application mapping in flattened butterfly networks.
3.3.1 Contention Latency Penalty Estimation by Router Load
The objective function
¯
T cannot be calculated exactly due to the unpredictable contention latency
¯
T
C
. As a result, we cannot perform optimization to minimize the contention latency directly. To solve
this issue, we introduce a penalty value P, as calculated by Equation 3.5, to reflect indirectly the NoC
performance degradation caused by contention. Specifically, as discussed in Section 3.1.1, for a task
cluster tc
i
(which is mapped onto p(i)), P(i) should first be positively related to the forwarding traffic
load L
F
, because the source contention latency T
C;s
of a router increases as its forwarding traffic load L
F
increases. This is reflected in the second item on the right-hand side of Equation 3.5. Furthermore, P(i)
should be positively related to the self-load of tc
i
, because a mapping solution should experience larger
penalty if forwarding traffic is assigned to the routers with higher self-load L
0
instead of those with lower
L
0
. This is reflected in the first item on the right-hand side of Equation 3.5. By taking into consideration
the above two factors, P(i) can be positively related to the source contention latency atp(i), which is the
dominant component of contention latency. Expressed mathematically, the penalty P(i) of each packet
initiated from a task cluster tc
i
is calculated by
P(i) L
0
(p(i)) L
F
(p(i)) (3.5)
where L
0
(p(i))=
n
å
j=1
r
i j
is the self-load of router p(i), i.e., the sum of all the communication rates
initiated at task cluster tc
i
that is mapped ontop(i). L
F
(p(i)) is the amount of forwarding traffic of router
p(i), calculated by
16
L
F
(p(i))
å
k6=i
å
j6=i
r
jk
r
p(i)
(p( j);p(k)); (3.6)
wherer
p(i)
(p( j);p(k)) equals one only if routerp( j) sends a packet top(k) through routerp(i), and
zero otherwise.
L
0
(p(i)) is introduced as a weight to L
F
(p(i)) so that routers with high self-load can have higher
penalty to accept the same amount of forwarding traffic than routers with low self-load. This leverages
the superlinearity of T
C;s
as a function of router self-load. Note that L
0
(p(i)) is constant for a particular
task cluster tc
i
, whereas L
F
(p(i)) is adjusted by the mapping algorithm to reduce the overall penalty.
With P(i) being the penalty incurred by every packet initiated at task cluster tc
i
, the following equation
calculates the overall penalty for all packets, where L
0
(p(i)) is the number of packets generated at tc
i
per
unit time and
n
å
i=1
L
0
(p(i))=
N
å
i=1
n
å
j=1
r
i j
is the total amount of traffic:
¯
P=
n
å
i=1
L
0
(p(i)) P(i)
n
å
i=1
L
0
(p(i))
: (3.7)
3.3.2 Contention-Aware Turn-Reduction Mapping Algorithm
The proposed algorithm CALM is designed by considering the observations summarized in Section
3.1.3. In addition, since link delay linearly depends on the Manhattan distance between source and
destination tiles for flattened butterfly according to Equation 3.2, it is still beneficial to put task clusters as
close to each other as possible, similar to the mapping methods on CMesh networks. The main steps of
CALM are described as follows.
Step 1. Partition n
2
task clusters into n sets and place each set on one row of the flattened butterfly
network.
The partitioning step is based on Kernighan-Lin (KL) algorithm, an efficient heuristic algorithm for
solving graph partitioning problems [41]. It attempts to partition a graph into two sets with equal sizes,
such that the sum of edge weights between the vertices in the two sets are minimized (min-cut).
We call the KL algorithm in a hierarchical fashion until we get n sets each with n task clusters, as
shown in Figure 3.7(a). After each two-way partitioning, we use a heuristic to determine the placement
of the two sets. Take the N=8 partitioning stage in Figure 3.7(a) as an example. We name each two sets
a KL section (i.e., KL sections are labeled 1 to 4). The order among these four KL sections is decided at
the previous stage, and KL has finished the partitioning in the current four KL sections. The orders of the
pair of sets within each KL section need to be determined. Consider the KL section 2, which contains the
third and fourth sets. Let m
high
3
denote the total communication rate between the third set and all the sets
above KL section 2 (i.e. section 1), and m
low
3
denote the total communication rate between the third set
and all the sets below section 2 (i.e. section 3 and 4). Similarly we define m
high
4
;m
low
4
for the fourth set.
We calculate and compare the differences between high/low communication rate, i.e. g
3
= m
high
3
m
low
3
and g
4
= m
high
4
m
low
4
, and then place the set with higher g in the third row and the other in the fourth
17
n columns
n
2
n
4
n
2
n
4
n
4
n
4
…
(c) Step 3: Column Rearrangement
...
Window
...
Step size 1:
Step size 2:
Window
(d) Step 4: Sliding window swapping
Tile list
Tile list
...
N
N/2
N/2
N/4
N/4
N/4
N/4
…
tc
1
tc
2
tc
n
…
(k-1) rows
k
th
row
1
2
3
4
h=1 h=2 h=3
… … … …
(a) Step 1: Row Placement (b) Step 2: Column arangement
Column 1 Column 2 Column n
N/8
N/8
N/8
N/8
N/8
N/8
N/8
N/8
Figure 3.7: The first and second steps of CALM.
ALGORITHM 1: Step 1. Row partitioning.
Input: n task clusters, communication ratesfr
kl
g
Output: n sets of task clusters, each placed on one row
1 for h 1 to log
2
n do
2 current number of sections is 2
(
h 1);
3 in this iteration we get 2
h
sets;
4 for i 1 to 2
(
h 1) do
5 in current section S
i
, call KL to get the new(2i 1)-th and(2i)-th sets;
6 m
high
2i
=å r
kl
(k2(2i)-th set, l2 S
j
; j< i);
7 m
high
2i1
=å r
kl
(k2(2i 1)-th set, l2 S
j
; j< i);
8 m
low
2i
=å r
kl
(k2(2i)-th set, l2 S
j
; j< i);
9 m
low
2i1
=å r
kl
(k2(2i 1)-th set, l2 S
j
; j< i);
10 g
2i
= m
high
2i
m
low
2i
;
11 g
2i1
= m
high
2i1
m
low
2i1
;
12 if g
2i
> g
2i1
then place(2i)-th set at(2i 1)-th row;
13 place(2i 1)-th set at(2i)-th row;
14 else place(2i)-th set at(2i)-th row;
15 place(2i 1)-th set at(2i 1)-th row;
16 end
17 end
row, so that the heavier communication is put closer to the outside of the KL section. The orders in other
sections are determined similarly. The complete pseudo code for Step 1 is shown in Algorithm 1.
The time complexity of KL algorithm is O(N
3
) since the graph has n vertices. Calculating m
high
and
m
low
takes O(n
4
)= O(N
2
) operations. Therefore the time complexity of Step 1 is O(N
3
) according to the
master theorem [16].
Step 2 Distribute task clusters in each set to the columns of the network with the consideration of both
zero-load latency and contention latency.
The first step determines positions of rows whereas the order of task clusters within each row remains
unfixed. In this Step 2, as depicted in Figure 3.7(b), we iteratively distribute the task clusters within each
18
ALGORITHM 2: Step 2. Column partitioning.
Input: n sets of tasks, each placed on one row of n tiles
Output: a mappingp(i)! j;i; j2f1;;Ng
1 Randomly assign tasks clusters in the first row to each column;
2 for k 2 to n (the k-th row) do
3 Calculate the n n cost matrixfc
i j
g;
4 Call Hungarian with the cost matrixfc
i j
g as input;
5 Assign task clusters in the k-th row to each column according to the Hungarian assignment
results;
6 end
row to the columns. The order of the task clusters in the first row is randomly assigned, of which the
possible performance loss can be restored in Step 3. At the k-th iteration, with the task clusters in the first
(k 1) rows already placed, the placement of the task clusters of the k-th set is determined to minimize
the average packet delay considering the communication rate between the current row and the first(k 1)
rows, as shown in (b). The above problem at each iteration is an assignment problem: In the cost matrix
fc
i j
g, c
i j
denotes the average packet latency contributed by tc
i
placed at the j-th column. This cost is
then calculated by the following equation, taking into consideration both the zero-load latency and the
contention latency,
c
i j
=
n(k1)
å
m=1
(r
im
T
FB
0
(i;m)+ r
mi
T
FB
0
(m;i)+a P(m)): (3.8)
The first two terms under the summation calculate the weighted zero-load latency to and from the
existing n(k 1) task clusters. In the third item, contention latency penalty P(m) takes into account
the forwarding traffic caused by the routers on the first k rows. As the penalty is an estimation which is
proportional to the contention latency, a parametera is introduced so that the penalty value can be added
with other items directly. a can also be seen as the degree of sensitivity for contention latency, which
mostly depends on different topology configurations. In other words, with a given flattened butterfly
network size n,a can be fixed, and it is not necessary for a designer to fine-tune the value ofa for each
application. In real simulations, we find that the optimal value ofa is approximately linear to the network
size n, and that the proposed algorithm can produce satisfying mapping results for different applications
in a wide range ofa, e.g., from 100 to 500 on 4x4 assuming T is measured in the number of cycles, L
0
in flits/cycle, and P(m) in square of flits/cycle. We use a= 200 in the evaluation of 4x4 network and
a= 400 in the evaluation of 8x8 network for various types of applications, and the proposed algorithm
provides satisfying results for all of them as shown in Section 4.5.
With c
i j
calculated, the above assignment problem is solved by Hungarian algorithm optimally [44].
The pseudo code for Step 2 is shown in Algorithm 2.
Hungarian algorithm has a time complexity of O(n
3
). Calculating the cost matrixfc
i j
g has a time
complexity of O(n
4
). Therefore the time complexity of Step 2 is O(n
5
).
Step 3 Rearrange the columns to minimize the link delay of communication traffic on horizontal links.
19
n columns
n
2
n
4
n
2
n
4
n
4
n
4
…
(c) Step 3: Column Rearrangement
...
Window
...
Step size 1:
Step size 2:
Window
(d) Step 4: Sliding window swapping
Tile list
Tile list
...
N
N/2
N/2
N/4
N/4
N/4
N/4
…
tc
1
tc
2
tc
n
…
(k-1) rows
k
th
row
1
2
3
4
h=1 h=2 h=3
… … … …
(a) Step 1: Row Placement (b) Step 2: Column arangement
Column 1 Column 2 Column n
N/8
N/8
N/8
N/8
N/8
N/8
N/8
N/8
Figure 3.8: The first and second steps of CALM.
Horizontal positions of the task clusters are determined in the first step, and these task clusters are
distributed into the columns in the second step. However, the order of these columns can be optimized to
further reduce latency. The optimization process is similar to Step 1, except that each column is treated as
a node in the input graph of KL algorithm, as shown in Figure 3.8(a). The time complexity of Step 3 is
O(n
3
) as there are n columns in total.
Step 4 Contention-aware fine tuning by swapping certain task cluster assignments.
We restrict swaps to occur within a sliding-window and choose the best result greedily to do fine
tuning. Mapped tiles are organized in the list view conceptually. Each window contains four tiles. All 24
possible permutations (the factorial of 4) of the mappings for these four tiles are explored, and we choose
the permutation that leads to the minimum sum of overall zero-load latency and penalty, calculated by
¯
T
0
+a
¯
P=
N
å
i=1
N
å
j=1
r
i j
T
FB
0
(i; j)
N
å
i=1
N
å
j=1
r
i j
+a
N
å
i=1
L
0
(p(i)) P(i)
N
å
i=1
N
å
j=1
r
i j
: (3.9)
The window starts with a step size of 1, i.e., four consecutive tiles are picked, and slides from
the beginning to the end of the tile list. Then we increase the step size of the window and perform
window-sliding again, as depicted in Figure 3.8(b).
The number of windows generated is O(N
2
). During the processing of each window, there are 24
permutations, each requiring one calculation of latency-penalty sum, which finishes within O(N) time.
The overall swapping has a time complexity of O(N
2
) O(N)= O(N
3
).
Taking into account all the four steps, the total time complexity of the proposed algorithm is O(N
3
).
20
3.4 Evaluation
3.4.1 Simulation Setup
As mesh network without concentration has much higher latency than other structures, in order
to provide more fair comparison, we use CMesh as the baseline. The following application mapping
schemes on a CMesh and a flattened butterfly network are compared, of which the names are formed with
algorithm names in capital letters (e.g., MC) followed by topology names in small letters (e.g., cm).
1. MC cm (the baseline): Monte Carlo method on the CMesh, which picks the mapping with the
smallest latency among a large number of randomly generated mapping solutions based on the
CMesh;
2. SA cm: simulated annealing algorithm for the CMesh;
3. SA(cm) fb: the mapping solution generated by simulated annealing assuming mesh latency model
but applied to the flattened butterfly;
4. CHOI fb: the mapping solution generated by the genetic algorithm-based technique proposed by
Choi [15], applied to the flattened butterfly;
5. SA fb: simulated annealing algorithm on a flattened butterfly network;
6. CALM fb: our proposed contention-aware latency-minimum mapping approach.
Since Monte Carlo and simulated annealing are algorithms that have trade-offs between runtime and
performance, for fair comparison, the runtime of both algorithms are configured to be roughly the same
as the runtime of our proposed algorithm.
A mixture of real benchmarks and traffic patterns generated by TGFF [24] are chosen to avoid
biasing the workload in favor of a particular topology. This includes the traces of four real applications,
namely mpeg4, toybox, vopd, and mms [10], as well as four random task graphs generated by TGFF,
referred to as tgff r1, tgff r2 tgff sp1 and tgff sp2. Among them, the tgff r1 and tgff r2 patterns are
two random task graphs while tgff sp1 and tgff sp2 are two series-parallel graphs formed recursively
by joining two sub-graphs in series and parallel, mimicking the stressed behaviors of multithreaded
applications. Figure 3.9 depicts the communication rate graphs of mpeg4 and toybox. Each node
denotes a task cluster, and the edge width indicates the relative magnitude of the communication
rate. Cores that are communicating heavily with each other are surrounded by a dotted eclipse
and highlighted with different colors. Collectively, these eight inputs comprise a representative set
of MPSoC scenarios. A 64-task configuration with concentration degree of four is simulated for the
majority of the evaluation. In addition, a 256-task configuration is also evaluated for scalability discussion.
We adopt the cycle-accurate NoC simulation module GARNET [4] for detailed timing and contention
simulation of the CMesh and flattened butterfly networks. We also integrate the latest DSENT NoC power
tool [68] into GARNET. This allows us to obtain runtime network activity statistics and estimate router
power consumption more accurately. DSENT is configured with 45nm bulk CMOS technology. The flit
size of the CMesh topology is 256-bit, and each input buffer in a router has a depth of 5 flits. The flattened
butterfly is configured to have the same amount of total buffers as CMesh. The router pipeline latency
t
R
and unit-length link latency t
L
are set to 3 and 1, respectively. The design of a 3-cycle router follows
21
1 6
2
4
3
8
12
5
7
9
10
13
11
14
15
16
1
6 2
4
3
8
12
5
7
9
10
13 11
14 15 16
(a) MPEG Communication Graph (b) Toybox Communication Graph
Figure 3.9: Task communication graph of mpeg and toybox.
the pipeline of virtual channel allocation, switch allocation and switch traversal, with the optimization of
look-ahead routing to hide the cycle of routing computation.
3.4.2 Average Packet Latency Reduction
We first evaluate the effectiveness of CALM to reduce the number of turns taken by all the packets.
Table 3.1 compares the percentage of communication traffic that needs to make turns in the flattened
butterfly for different algorithms. Similar to SA cm, SA(cm) fb only minimizes Manhattan distances,
and hence has a high proportion of packets making turns. The proposed CALM is able to achieve
an average of 3.4X and 2.5X reduction in the percentage compared to SA(cm) fb and SA fb on
the flattened butterfly network, respectively. Figure 3.10 depicts the mapping results of CALM for
mpeg4 and toybox. The arrows that represent light communication rates are omitted for clarity. It is
clearly shown that the cores communicating heavily (colored nodes) are placed on the same row or
column to avoid turns. It is worth noting that, while the proposed algorithm reduces the number of
turns, most of the heavily communicating tasks (as indicated by wider edges) are also mapped close
to each other, such as task cluster tc
2
and tc
6
in Figure 3.10(a) and task cluster tc
1
and tc
7
in Figure 3.10(b).
(a) MPEG Communication Graph (b) Toybox Communication Graph
9
10 5
4
2
12 8 13
11
1
6
14
7 15
3
16 6
1
15
14
12
3
16
7
11
4 8
2 13
9 5
10
Figure 3.10: Mapping results of CALM.
0
10
20
30
40
50
60
mpeg4 toybox tgff_r1 tgff_sp1
Percentage (%)
SA_fb
CALM_fb
Figure 3.11: Percentage of forwarding
traffic associated with high-load routers.
22
Table 3.1: Percentage of Traffic that Needs to Make Turns.
Systems mpeg4 toybox vopd mms tgff r1 tgff r2 tgff sp1 tgff sp2 Average
SA(cm) fb 38.74 22.37 31.74 20.84 48.43 49.07 42.84 43.05 37.14
SA fb 25.51 14.91 19.62 11.78 40.73 40.21 36.88 27.42 27.13
CALM 11.72 4.35 4.00 0.19 19.33 19.50 17.85 12.09 11.13
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
mpeg4
Latency (cycles)
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
toybox
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
mms2
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
vopd_t
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
tgff_r1
Latency (cycles)
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
tgff_r2
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
tgff_sp1
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_fb
SA_fb
CALM_fb
tgff_sp2
T
H
T
S
T
C,n
T
C,s
Figure 3.12: Average packet latency results.
We then show how the proposed algorithm mitigates contention. As mentioned in Sections 3.3 and
5.1, contention latency reduction requires more than reducing the number of turns or forwarding traffic.
It requires that the routers with higher load be assigned with smaller forwarding traffic. Figure 3.11
compares the percentage of forwarding traffic associated with the eight routers that have the highest
self-load in the result of SA fb and CALM fb. Without the consideration of contention latency, almost
half of the forwarding traffic is assigned to high-load routers, whereas CALM fb reduces the percentage
to 16.34% on average.
The closer physical distances, reduced turns and decreased forwarding traffic for high self-load
routers altogether result in considerable improvement of packet latency. Figure 3.12 plots the results of
average packet delay for the eight different test cases. Compared to the baseline system, the proposed
CALM algorithm reduces the overall average packet latency by 14.12% on average, which is 7.04% more
savings than SA fb. This improvement over SA fb is two-fold: Compared to SA fb, CALM fb reduces
the head-flit latency by an average of 5.95% as well as the contention latency by 24.8%. This indicates
that CALM fb not only optimizes mapping to shrink head-flit latency, but also addresses the contention
latency issue. Moreover, the CALM algorithm achieves 11.39% less latency compared to another
state-of-the-art genetic-based algorithm, CHOI fb, further proving its advantages over search-based
23
0
0.2
0.4
0.6
0.8
1
1.2
1.4
mpeg4 toybox mms2 vopd_t tgff_r1 tgff_r2 tgff_sp1 tgff_sp2
Router power consumption (W)
MC_cm SA_cm SA(cm)_fb CHOI_fb SA_fb CALM_fb
Figure 3.13: Router dynamic power consumption.
algorithms. In addition, SA(cm) fb yields a mapping result with much higher latency than SA fb and
CALM fb. This means that the mapping solution generated from CMesh network is not optimal when
applied to flattened butterfly networks, demonstrating the need to use a different latency model when
designing mapping algorithms for flattened butterfly networks.
3.4.3 Power Consumption
Although the primary objective is to reduce packet delay, the proposed mapping algorithm is also
able to slightly reduce power consumption as a side effect, because the algorithm reduces the number of
routers and links through which packets need to travel. Figure 3.13 shows the dynamic power consumption
of different mapping algorithms on the eight applications. The proposed CALM achieves 12.9% less
power consumption compared to MC cm, and also saves 2.1% and 1.9% more than CHOI fb and SA fb,
respectively. As can be seen, even though CALM does not target power consumption reduction, it achieves
the lowest dynamic power consumption among all the schemes.
3.4.4 Discussion
Impact of Pipeline Stages
So far we have assumed a three-stage router pipeline, which is an optimized version on top of the
canonical 4-stage router. Equations 3.1 and 3.2 show that the number of router pipeline stages affects
the zero-load packet latency. To assess this impact, Table 3.4.4 compares the mapping results of the
toybox benchmark for SA cm, SA fb, and CALM fb while varying the numbers of pipeline stages from
2 to 4. As can been seen, flattened butterfly in general has a smaller advantage over CMesh when the
number of pipeline stages is smaller (and SA fb is worse than SA cm for the 2-stage router). This
is because a smaller number of router pipeline stages means lower savings of the head-flit latency
of flattened butterfly, which is overshadowed by its increase in serialization latency. However, the
proposed CALM fb restores the advantages of flattened butterfly for the 2-stage router, and has the
lowest average packet latency across different numbers of pipeline stages. This illustrates that CALM can
24
0
5
10
15
20
25
30
35
SA_fb CALM_fb
tgff_ld1
Latency (cycles)
0
5
10
15
20
25
30
35
SA_fb CALM_fb
tgff_ld2
0
5
10
15
20
25
30
35
SA_fb CALM_fb
tgff_ld3
0
5
10
15
20
25
30
35
SA_fb CALM_fb
tgff_ld4
T
H
T
S
T
C,n
T
C,s
Figure 3.15: Latency results for different traffic loads.
be useful in a wide range of networks built from more aggressive or more conservative router architectures.
0
5
10
15
20
25
30
35
SA_cm
SA_fb
CALM_fb
tgff_r3
Latency (cycles)
0
5
10
15
20
25
30
35
SA_cm
SA_fb
CALM_fb
tgff_sp3
T
H
T
S
T
C,n
T
C,s
Figure 3.14: 8x8 results.
Stages SA cm SA fb CALM fb
2 14.02 14.72 13.27
3 16.69 16.20 15.28
4 18.91 17.69 16.83
Table 3.2: Average packet latencies (in cycles)
of toybox under different pipeline stages.
Scalability
The previous evaluation uses 64-task configurations with concentration degree of 4 on a 4x4 NoC.
To further illustrate the scalability of the proposed algorithm, we generate two TGFF configurations of
256 tasks with the same concentration degree, namely tgff r3 and tgff sp3. As plotted in Figure 3.14,
compared with SA cm and SA fb, CALM fb is able to reduce the average packet delay by 9.75% and
17.57% under the same runtime, respectively. We notice that CALM on 8x8 achieves higher head-flit
latency decrease (26.54% on average) compared to SA fb than the CALM fb on 4x4 (5.28%). This is
because the search space of mapping solutions increases exponentially with network diameter, which
makes simulated annealing requiring significantly longer runtime to achieve satisfying solutions. This
demonstrates that the proposed CALM achieves higher improvement for larger networks, indicating its
good scalability.
Impact of Different Traffic Load Levels
As mentioned in Section 3.1.2, the on-chip network may suffer performance degradation in case of
contention due to high traffic load. We use TGFF to generate four different traffic load patterns tgff ld1,
tgff ld2, tgff ld3, tgff ld4, of which the average packet injection rates are around 0.5, 0.8, 1, and 1.3 of
the average injection rate of tgff r1 in Figure 3.12. The resulting comparison of SA fb and CALM fb
is shown in Figure 3.15. We can see that both the in-network contention latency T
C;n
and the source
contention latency T
C;s
increase with increasing traffic load on the NoC. At a higher traffic load, T
C;s
25
dominates the average packet latency, and the proposed CALM fb is able to reduce T
C;s
by 23.71% and
overall average packet latency by 12.95% compared to SA fb.
Applicability to Express Link-Based Topologies
The proposed algorithm is generally applicable to any type of mesh-based NoC topology with express
channels, and is not limited to flattened butterfly topology due to the following reasons. First, the
proposed algorithm can align heavily communicating cores on the same rows or columns to increase the
express link utility. Second, Step 2, Step 3 and Step 4 also put heavily communicating cores close to each
other in Manhattan distance, which further reduces the zero-load latency. Third, the contention latency is
greatly reduced by the proposed bypass traffic reducing technique.
We applied the proposed CALM to Multidrop Express Channels (MECS) [29], which utilizes a point-
to-multipoint communication fabric to achieve bandwidth efficiency while proving high connectivity. The
MECS topology is shown in Figure 3.16. The average packet latency results of mpeg4 and tgff r1 are
shown in Figure 3.17.
Figure 3.16: MECS topology.
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_mecs
CHOI_mecs
SA_mecs
CALM_mecs
mpeg4
Latency (cycles)
0
5
10
15
20
25
MC_cm
SA_cm
SA(cm)_mecs
CHOI_mecs
SA_mecs
CALM_mecs
tgff_r1
T
H
T
S
T
C,n
T
C,s
Figure 3.17: Results on MECS topology.
3.5 Summary
Express channel-based NoC topologies such as the flattened butterfly have been proposed in recent
studies as a promising approach to support fast on-chip communications for current and future many-core
MPSoCs. However, the characteristics of these new topologies have not been fully exploited in existing
application mapping algorithms. In this paper, we propose an efficient heuristic to explore the application
mapping opportunities for flattened butterfly-based NoCs. The proposed algorithm is able to map tasks
with large communication rate closer to each other, aligns heavily communicating tasks to the same rows
or columns to reduce unnecessary turns, and effectively mitigate contention by minimizing forwarding
traffic for selected routers. Simulation results show considerable reduction in average packet latency in
the generated mapping solutions.
26
Chapter 4
Balancing NoC Latency in Multi-Application
Mapping
As the number of cores continues to grow in chip multiprocessors (CMPs), application-to-core
mapping algorithms that leverage the non-uniform on-chip resource access time have been receiving
increasing attention. However, existing mapping methods for reducing overall packet latency cannot meet
the requirement of balanced on-chip latency when multiple applications are present.
In this chapter, we address the looming issue of balancing minimized on-chip packet latency with
performance-awareness in the multi-application mapping of CMPs [74]. Specifically, this chapter presents
a balanced application-to-core mapping algorithm that aims to minimize the maximum on-chip packet
latency of all running applications. The chapter starts by formulating the balanced mapping problem for
CMPs and proving its NP-completeness. Next it presents an efficient heuristic algorithm for solving the
aforesaid problem, which utilizes the characteristics of on-chip cache and memory accesses in CMPs
and takes into account the workload variations among applications. Simulation results on the PARSEC
benchmark suite show that the proposed algorithm reduces the maximum average packet latency of all
applications by 11% while cutting the standard deviation of on-chip packet latencies by 99%. This is
achieved by very little overhead in terms of the overall packet latency and power consumption averaged
over all packets.
4.1 Motivation
4.1.1 Many-Core Chip Multiprocessor Architecture
Fig. 4.1 shows a typical structure of a 64-tile CMP with mesh-based NoC. Each tile comprises of
a processing core, a private L1 cache, and a slice of shared L2 cache bank. The shared L2 cache is
distributed among all the tiles on the chip. In most commercial CMPs, when a data block is fetched from
memory, the L2 cache bank in which to place the block is determined by hashing on the lower-order
bits of the data address [67, 43, 38]. Routers are interconnected to form a mesh network, and tiles
are connected to routers via network interfaces (NIs). A typical placement of memory controllers is
attaching one memory controller onto each of the four corner tiles (shown as four shaded tiles in Fig.
4.1) in addition to the regular core/cache structure. There are several other popular placements of on-chip
memory controllers (as discussed in Section 4.2.2).
The basic data access procedure in a NoC-based CMP is as follows. When a processing core has a
data request, no network packet is needed if the request hits its private L1 cache. Otherwise, one of the
two types of traffic may be generated depending on where the data block is located, namely cache traffic
(i.e., the data block is on chip) or memory controller traffic (i.e., the data block needs to be fetched from
the off-chip main memory). The cache traffic includes (i) packets initiated at the requesting core destined
for an L2 cache bank, (ii) the checking/forwarding packets from the L2 cache bank to other private L1
caches, and (iii) the reply packets from the L2 cache bank to the requesting processor core. In all of these
27
8
29
29
64 57
Router
NI
L1
cache
Shared L2
cache
Core
1
Figure 4.1: A typical 64-core NoC-based CMP architecture.
cases, either the source tile or the destination tile is an L2 cache bank. Note that no packet is generated
if the destination tile is the same as the source tile. For memory controller traffic, a requesting packet is
generated and then forwarded to one of the tiles with memory controllers through the on-chip network
(e.g., the four gray tiles in the corners of Fig. 4.1). The packet forwarding typically follows the proximity
principle [6], i.e., the packet is sent to the nearest memory controller tile. Data are then fetched from the
main memory and returned to the memory controller after a fixed number of cycles.
4.1.2 Uneven Packet Latencies in Cache and Memory Controller Traffic
As more and more cores are integrated on a chip, the non-uniformity of on-chip packet latencies
among different tiles continues to increase - not only within each of the above two traffic types but
also between the two traffic types. This is the fundamental cause for the imbalanced latencies among
concurrently running applications. To enable further study, this subsection presents the packet latency
models that analyze the phenomena mathematically.
We first introduce the tile numbering rule used throughout this chapter. The number of a tile k is
determined by
k=(i
k
1) n+ j
k
; (4.1)
where i
k
; j
k
are the row number and column number of the tile, respectively, and n is the number of tiles
in a row. For example, in Fig. 4.1 (where n= 8), the tile located at the fourth row (from the top), fifth
column (from the left) is numbered 29.
We calculate the on-chip latency T(k;k
0
) of a packet generated at the k-th tile and heading for the k
0
-th
tile on a mesh network as follows, based on [18]:
T(k;k
0
)= H(k;k
0
)(l
r
+ l
w
+ l
c
)+ l
s
; (4.2)
where H(k;k
0
) is the number of hops through which the packet travels. Note that l
r
, l
w
, and l
c
are the
per-hop latency for the router, wire, and in-network contention, respectively. The serialization latency,
l
s
, is determined by the ratio of the packet length to the channel bandwidth, which is fixed with a given
packet format and NoC structure. To avoid deadlocks, dimension-order routing (e.g., XY routing) is
28
Tag Cache Index Block Offset
MSB LSB
Figure 4.2: Physical memory address breakdown.
(a) average L2 cache latency (b) average memory controller latency
Figure 4.3: Uneven packet latencies exhibited in cache and memory controller traffic.
adopted to minimize design effort and implementation cost [18].
As mentioned, the hashing for the shared L2 cache banks uses the cache index in a physical address,
as shown in Fig. 4.2. Take the 64-core CMP in Fig. 4.1 as an example. The shared L2 cache is separated
into 64 pieces and distributed to each of the tiles. If the size of one data block (i.e., one cache line) in the
L2 cache is 32B, the lowest 5 bits (Bit 0 to Bit 4) are reserved for block offset. The next lowest 6 bits, Bit
5 to Bit 10, are the cache index used to hash and decide which tile the block is located among the 64 tiles.
Hence, any consecutive chunks of 64 cache blocks (2KB in total) are uniformly distributed across all the
L2 cache banks. For a typical application running on a CMP, it is reasonable to assume that the destination
tile in the cache traffic has statistically the same probability to be any tile on the CMP (including the source
tile). Therefore, for a CMP with N = n
2
tiles, the average number of hops H
C
k
of all cache traffic packets
generated at the k-th tile only depends on its location, where H
C
k
is calculated by
H
C
k
=
1
N
N
å
i=1
H(k;i): (4.3)
With Equation (4.2), the average cache traffic latency for packets generated at the k-th tile T
C
k
is calculated
by
T
C
k
= H
C
k
(l
r
+ l
w
+ l
c
)+ l
s
: (4.4)
The value of H
C
k
is smaller for the tiles in the chip center and larger for the tiles in the corners. For
example, on the CMP shown in Fig. 4.1, H
C
1
= 7 for the tile numbered 1 (a corner tile), and H
C
29
= 4 for
the tile numbered 29 (a central tile). Consequently, the threads mapped onto the central tiles have smaller
T
C
k
than the threads mapped closer to the chip perimeter, as shown in Fig. 4.3 (a), where darker areas
indicate tiles with larger cache request packet latencies.
29
For memory controller traffic, the average number of hops, H
M
k
, is determined by the memory access
behavior as well as the memory controller placement. For the popular four-corner memory controller
configuration shown in Fig. 4.1, the chip is divided into four quadrants relative to the center of the chip.
All the memory request packets generated by the tiles in one quadrant are sent to the memory controller
in that quadrant. Precisely, the average number of hops for a memory controller request packet generated
at the k-th tile can be calculated by
H
M
k
= minfi
k
1;n i
k
g+ minf j
k
1;n j
k
g: (4.5)
The average memory request latency for packets generated at the k-th tile, T
M
k
, is then calculated by
T
M
k
= H
M
k
(l
r
+ l
w
+ l
c
)+ l
s
based on Equation (4.2). As shown in Fig. 4.3 (b), with this four-corner
memory controller placement, tiles close to the corners have smaller average on-chip latency of memory
controller traffic than tiles close to the center. This aspect is different from that of cache traffic, which
further complicates the problem of balancing on-chip latency.
4.2 Challenges
4.2.1 Difficulty in Utilizing Existing Mapping Algorithms
As mentioned, on-chip latency balancing is an important design requirement in CMPs to guarantee
quality of service in case of multiple users, provide a uniform on-chip access for cache and memory
system, and eliminate the overhead of hardware support for latency balancing. However, traditional
application mapping algorithms which target minimizing the overall packet latency of all the threads
are potentially counter-optimal in terms of balancing latencies. The primary reason is that, to be most
productive towards minimizing the overall packet latency, these algorithms map threads with higher data
access rates to tiles with smaller average on-chip latencies while threads with lower NoC traffic rates
are mapped to large-latency tiles. Consequently, the latencies of low traffic-load applications are greatly
increased, leading to significant imbalance in per-application average packet latency, or APL for short.
To quantify the imbalance, we apply a mapping algorithm that minimizes the overall packet latency,
referred to as Global, to PARSEC 2.0 benchmarks [8] (details of the simulation setup are given in Section
4.5.1). Global optimally solves the problem of finding the minimum overall packet latency for all threads,
as explained in Section 4.4.1. Five different configurations (i.e., sets) of applications are tested on an
8x8 mesh network, denoted as C1, C2, C4, C5, and C9
1
. C1 and C2 contain four 16-thread (or 16T for
short) applications, C4 and C5 have sixteen 4T applications, and C9 has two 4T, one 24T, and one 32T
application. Besides Global, we also evaluate Random, the average of a large number ( 10
4
) of random
mappings. The Random result represents the expected result achieved by any mapping method.
We compare the mapping results from three aspects, namely (i) the overall average latency of all
threads, or g-APL, which is calculated by the sum of all packet latencies divided by the total commu-
nication volume, (ii) the maximum APL of all applications, or max-APL (the APL of each application
is calculated first, and the largest APL among all applications is the max-APL), and (iii) the standard
deviation of the APLs of all applications, or dev-APL. Larger max-APLs or dev-APLs indicate more
severe imbalance among the applications. The results of C1, C2, C4, C5, and C9 are listed in Table 4.1.
Although Global reduces g-APL by 7.32% on average compared to Random, the max-APL is increased by
1
The numbering of the configurations is nonconsecutive here in order to be consistent with the numbering in Section 4.5.
30
Table 4.1: Exacerbated imbalance by Global.
Config App sizes
g-APL max-APL dev-APL
Random Global Random Global Random Global
C1 Four 16T 22.64 21.35 22.75 25.15 0.530 2.09
C2 Four 16T 22.60 21.63 22.73 24.63 0.544 1.63
C4 Sixteen 4T 22.52 20.28 22.90 28.04 0.899 6.42
C5 Sixteen 4T 22.65 20.49 22.78 28.53 0.964 6.10
C9 4T, 4T, 24T, 32T 22.59 20.98 22.80 25.41 0.868 4.75
4
4
4
4
4
4
4
4
4
4
4 4
4 4
4 4
3
3
3
3 3
3
3 3
3 3
3
3 3
3
3 3
1
1
1
1
1
1 1 1
1 1
1
1
1
1
1 1
2
2
2 2
2
2
2
2
2
2 2 2 2 2
2
2
(a) Mapping result of C1 (b) Mapping result of C9
2
3
4
4
4
4
3
3
4
3
3
4
3
4
3
4 2 4 4
4 4 3 4
4 3 3 3
4 3 3
2 4 4 4
2 4 3 4
3 4 4 3
3 3 3
4
4
4
4
4
4
3
4
4
4
3
4
3
3
3
1 1
1
1
Figure 4.4: Mapping results of C1 and C9 under Global.
15.62% and the dev-APL is about three to seven times that of the random average result. This highlights
that Global improves the overall performance at the cost of making the APLs of one or more applications
dramatically larger.
We show two of the mapping results of Global in Fig. 4.4 to further elucidate the imbalance issue. All
the applications have a small percentage of memory requests (around 10% as shown in Fig. 4.5), making
cache accesses the dominant factor of on-chip traffic. Application 1 in both C1 and C9 has the lightest
cache traffic. As depicted in Fig. 4.4, the threads of Application 1 are assigned with tiles close to corners
whose cache access on-chip latencies (T
C
k
) are large as highlighted with shaded patterns in Fig. 4.4. The
APL of Application 1 is 25.15 cycles in C1, 17.8% higher than the overall average g-APL of 21.35 cycles,
and 25.41 cycles in C9, 21.12% higher than the g-APL of 20.98 cycles.
These mapping and APL results demonstrate that a mapping algorithm that solely aims at reducing
g-APL may intensify imbalance in packet latencies between different applications and, thus, cannot be
utilized directly.
4.2.2 Variations in Applications
Yet another challenge in providing balanced mapping results is the potentially large variations among
applications that are being executed at the same time. There are a couple of sources of these variations.
First, applications may vary in their levels of parallelism, i.e., the number of threads they have. Second,
31
0
5
10
15
20
Cache Memory
Access rate (MPKI)
blackscholes
bodytrack
canneal
dedup
ferret
fluidanimate
raytrace
swaptions
vips
x264
Figure 4.5: Average cache access and memory access rates of applications in the PARSEC benchmark
suite. (MPKI = misses per kilo instruction).
they differ in the average cache and memory access rates (hence, the resulting traffic loads). The average
access rate of the threads in an application can be several times larger or smaller than that of another
application, as shown in Fig. 4.5. Third, although cache traffic accounts for the majority of the on-chip
traffic, applications may have quite different percentages of memory controller traffic. For example,
the ratio of memory controller access rate to cache access rate, referred to as memory-to-cache ratio
hereinafter, of the applications in the PARSEC 2.0 benchmarks can range from 0.108 to 0.607. Fourth,
besides the four-corner memory controller placement, several other memory controller placements have
also been studied (e.g., [3]) as shown in Fig. 4.6, which changes the memory controller traffic behavior
considerably. All the above factors significantly impact APLs and need to be considered in order to
achieve balanced mapping among applications. Furthermore, differences in the runtime of applications
place additional requirements in the mapping algorithms. Some applications may finish earlier than
others, resulting in some of the tiles on chip becoming idle. In order to utilize these idle tiles, new
applications may be introduced to be executed on these tiles, thus requiring mapping algorithms to have a
sufficiently low time complexity so as to allow dynamic mapping of these new threads.
4.3 Problem Statement: Mapping For On-Chip Latency Balancing
4.3.1 Selecting Metrics for Latency Balancing
An ideal multi-application mapping algorithm minimizes the imbalance in APLs of different appli-
cations while keeping the overall APL low. To design such an algorithm, we need to find an appropriate
metric that quantifies the degree of balance. Besides max-APL, two other popular metrics of balance
are the standard deviation of APLs of applications (dev-APL) and the ratio of minimum to maximum
of the APLs (min-to-max ratio) [70]. However, dev-APL and min-to-max only gauge the relative
differences among applications, they both suffer from one weakness if used as the objective function.
That is, optimizations based on these two objectives cannot ensure overall NoC performance (in terms of
minimizing packet latency) and may result in such a solution that makes the APLs of each application
close to each other but larger than otherwise.
32
(a) row0_7 (b) col0_7 (c) row2_5
(d) diagonal X (e) diamond (f) checkerboard
Figure 4.6: Different memory controller configurations on an 8x8 mesh network [3]. Shaded tiles represent
a memory controller co-located with the core/cache structure.
We use an example to illustrate the potential problems with using dev-APL or min-to-max ratio as the
objective function. Assume there are four 4T applications, totaling 16 threads to be mapped onto the 16
tiles of a 4-by-4 mesh network with one thread per tile. Suppose the four threads of each application have
L2 cache access rates of 0.1, 0.2, 0.3, and 0.4, respectively. For simplicity, suppose all the applications
require zero memory accesses. Assume a router latency per hop of l
r
= 3, wire latency per hop of l
w
= 1,
and serialization latency of l
s
= 1. An optimal mapping solution is easily found as shown in Fig. 4.7 (a),
which achieves the overall minimum APL as well as exactly equal APLs (10.3375 cycles) among the four
applications. However, if we choose the dev-APL or min-to-max ratio as the objective function, we find
that Fig. 4.7(b) is also one of the ‘optimal’ mapping results since it has zero dev-APL and min-to-max
ratio equal to one, both optimal values for these objective functions. However, in this case, although all
the applications have the same APL, they experience large latencies (11.5375 cycles). Therefore, although
dev-APL and min-to-max ratio are good metrics for optimizing balance, neither of them are suitable as
the objective function for the mapping algorithm to achieve balanced APL while also minimizing overall
packet latency.
To avoid the drawbacks of dev-APL and min-to-max ratio as objective functions, we adopt max-APL,
which uses the maximum APL of all the applications as the metric. By minimizing max-APL, the mapping
method takes into consideration both the overall NoC performance and the balance among individual
applications, as it prevents any of the applications from having a significantly large latency.
33
0.4
0.1
0.1
0.4
0.1
0.4
0.4
0.1 0.2
0.3
0.2
0.3
0.3 0.3
0.2
0.2
0.1
0.4
0.4
0.1
0.4
0.1
0.1
0.4 0.2
0.3
0.2
0.3
0.3 0.3
0.2
0.2
(a) Both globally optimal and
equal APLs for each application
(b) Optimal in terms of
var-APL or min-to-max ratio
A1
A2
A3
A4
A1
A2
A3
A4
Figure 4.7: Comparison of two optimal mapping results.
4.3.2 Problem Statement
We first derive the mathematical expression of the APL of one application. For simplicity, we assume
one physical tile can run no more than one thread at the same time, which means the number of threads is
equal to or less than N. If the number of threads N
0
is less than N, we can add one application of N N
0
pseudo threads with zero communication rate to make N threads in total. Given an N-tile NoC-based CMP
and a set of applicationsfa
i
g;1 i A with N threads in total, a mapping solution is a permutation of N,
i.e.,p( j)= k, denoting mapping of the j-th thread onto the k-th tile. There are two parameters regarding
each thread, namely the shared cache request rate, c
j
, and the memory controller request rate, m
j
. We
index the threads in the following way: The threads of the i-th application, a
i
, are indexed from N
i1
+ 1
to N
i
, and N
0
= 0;N
A
= N. Note that the numbers of threads of each application are not necessarily the
same. With T
C
k
and T
M
k
as defined in Section 4.1.2, the APL of application a
i
with mapping solutionp( j)
is calculated by
d
i
=
N
i
å
j=N
i1
+1
(c
j
T
C
p( j)
+ m
j
T
M
p( j)
)
N
i
å
j=N
i1
+1
(c
j
+ m
j
)
; (4.6)
where c
j
T
C
p( j)
is the total latency of cache request packets when thread j is mapped to tile p( j), and
similarly m
j
T
M
p( j)
is the total latency of memory request packets. Therefore, the goal of mapping
to achieve latency balancing is to minimize the max-APL d
max
, which is the maximum APL of all
applications.
Formally, we formulate the On-chip latency Balancing Mapping (OBM) problem as follows:
Given:
1. number of tiles N and number of applications A;
2. application setfa
i
g, where i= 1;2;:::;A, and the i-th application a
i
consists of the threads indexed
from N
i1
+ 1 to N
i
where N
0
= 0 and N
A
= N;
3. L2 cache communication rates array C=fc
j
g, where c
j
indicates the request rate of L2 cache of
the j-th thread;
4. memory controller communication rates array M=fm
j
g, where m
j
indicates the memory controller
request rate of the j-th thread; and
5. arraysfT
C
k
g andfT
M
k
g, denoting the APLs of packets generated at the k-th tile to the distributed L2
cache and to the memory controller, respectively;
34
Find: thread-to-tile mappingp( j)= k, where j;k2 1;2;:::;N and
Minimize: the max-APL
d
max
= max
i=1;2;:::;A
d
i
(4.7)
where d
i
is the APL of the i-th application defined in (4.6).
4.3.3 NP-Completeness of OBM
Theorem 4.3.1. The OBM problem is NP-complete (NPC).
Proof Sketch.
In order to prove the NP-completeness of OBM, we first transform it into a decision problem as follows.
Definition 4.3.1 (Decision version of OBM (DOBM)). With the conditions given in Section 4.3.2, does
there exist a mapping solution that makes the APL of each application no larger than a given valuet?
Part 1. Prove DOBM2 NP. This means that one can use polynomial times of calculation to verify
whether a given mapping solution satisfies the DOBM problem. It takes O(N) calculations to get the
APLs for all the applications, and there are A applications requiring O(A) times of comparisons of their
APLs and the threshold valuet. Therefore, DOBM2 NP.
Part 2. For a known NPC problem G, prove G
P
DOBM. We adopt the well-known set-partition
NPC problem as G, which is stated as follows: Given a set of numbers S=fs
k
g;k2f1;2;:::;Ng, does
there exist two sets A
1
and A
2
with equal size, satisfying A
1
[ A
2
=f1;2;:::;Ng;A
1
\ A
2
= / 0 such that
å
k2A
1
s
k
=å
k2A
2
s
k
? [16]
Assume we have a subroutine
ˆ
D that solves DOBM, i.e.,
ˆ
D returns whether there exists such a mapping
that the APLs of all applications are no larger thant. In order to solve the above problem G, we set up
a DOBM problem of the following form. Build an N-tile chip such that the set of APLs of the L2 cache
access of each tile is equal to S, i.e.,8k2f1;2;:::;Ng, T
C
k
= s
k
. There are a total of two applications with
equal size, a
1
and a
2
, making N
1
= N
2
= N=2.8i; j;k;T
M
k
= 0;c
i j
= 1. In this given setup, the APLs of a
1
and a
2
are calculated as
d
1
=
N=2
å
j=1
s
p( j)
N=2
;d
2
=
N
å
j=N=2+1
s
p( j)
N=2
: (4.8)
We then call the subroutine
ˆ
D to find if there exists a mapping j!p( j) such that the APL of each
application is no larger thant, where
t=
1
N
N
å
k=1
T
C
k
: (4.9)
Note thatt is constant for a given chip layout.
35
If
ˆ
D holds, i.e., d
1
t and d
2
t, we have
1
2
Nd
1
1
2
N
å
k=1
T
C
k
and
1
2
Nd
2
1
2
N
å
k=1
T
C
k
. Since
1
2
Nd
1
+
1
2
Nd
2
=
N
å
k=1
T
C
k
, we conclude that
1
2
Nd
1
=
1
2
Nd
1
=
1
2
Nd
2
=
1
2
N
å
k=1
T
C
k
. Therefore, if
ˆ
D returns yes, it means
that
9p( j);s:t:
N=2
å
j=1
s
p( j)
=
N
å
j=N=2+1
s
p( j)
=t
N
2
=
1
2
N
å
k=1
s
k
: (4.10)
G holds if and only if
ˆ
D holds. The solutions to the two subsets are A
1
=fp( j)j j= 1;:::;
N
2
g;A
2
=
fp( j)j j=
N
2
+ 1;:::;Ng: Subroutine
ˆ
D is called once, thus provingG
P
DOBM.
Therefore, the NP-completeness of DOBM is proved, and equivalently the OBM problem is NPC.
4.4 Proposed Solution
The NP-completeness of the OBM problem precludes a polynomial-time optimal solution. Prior
art on NoC mapping problems has tried general neighborhood search algorithms such as simulated
annealing [49] and genetic algorithms [39]. These algorithms, however, are too time-consuming to reach
a satisfactory solution.
In this section, we present an efficient heuristic to solve the OBM problem. The algorithm not
only utilizes the traffic characteristics of NoC-based CMPs but takes into account the variations among
applications to increase mapping effectiveness. The proposed algorithm consists of two steps, namely
application-level assignment, which assigns tiles to applications to balance cache traffic latencies, and
fine-tuning, which refines the mapping result to further minimize max-APL by swapping tile-to-thread
mapping across applications.
4.4.1 Subproblem: Single Application Mapping
Before presenting the algorithm to solve OBM, we first introduce its sub-procedure, namely single
application mapping (SAM). Given N
a
tiles and an application a with N
a
threads, the SAM sub-procedure
derives an optimal tile-to-thread mapping so that the APL of a is minimized. We formulate the SAM
problem as follows.
Given:
1. number of tiles (threads) N
a
;
2. L2 cache communication rates C=fc
j
g and the memory controller communication rates M=fm
j
g;
and
3. tile APLsfT
C
k
g andfT
M
k
g, denoting the average packet latency from the k-th tile to the distributed
L2 cache and to the memory controller, respectively;
Find: thread-to-tile mappingp
a
( j)= k, where j;k2f1;2;:::;N
a
g and
Minimize: the APL of application a:
36
d
a
=
N
a
å
j=1
(c
j
T
C
p( j)
+ m
j
T
M
p( j)
)
N
a
å
j=1
(c
j
+ m
j
)
: (4.11)
Note that the Global algorithm mentioned in Section 4.2.1 is a special form of SAM. Global minimizes
the APL of all the N threads on the chip, or the g-APL. If we consider only one application, which has N
threads, is running on the CMP, i.e., N
a
= N, minimizing the APL of a is equivalent to minimizing g-APL
of all the threads.
As discussed in Section 4.1.2, the APL of thread j assigned to tile k depends on the communication
rates c
j
and m
j
and the tile APLs T
C
k
and T
M
k
. In the calculation of T
C
k
given by (4.4), l
r
;l
w
;l
s
are fixed
with the NoC design. The in-network contention latency, l
c
, is approximated as a constant in the proposed
problem solution for the following reasons. First, on-chip networks typically have wide link width (e.g.,
128-bit or 256-bit) with multiple virtual channels per link [45], making the in-network contention latency
relatively small (less than one cycle per hop, on average, for injection rates up to as high as 0.15 packets
per cycle). Second, due to the backpressure resulting from flow control mechanisms (e.g., credit-based
flow control), the majority of the packets are queued in the source nodes when the traffic load is high,
and the contention in the network is often limited. With l
r
;l
w
;l
s
;l
c
all constant values, the APL of thread
j assigned to tile k is determined by c
j
;m
j
and H
C
k
;H
M
k
, independent of the mapping results of other
threads. In other words, the cost, cost
jk
(latency), of assigning thread j to a certain tile k is fixed once
p
j
= k, regardless of which tiles other threads are mapped to.
Given the cost function cost
jk
, the SAM problem is hence an instance of the combinational assignment
problem, solvable in polynomial time. One efficient solution to such assignment problems is the
Hungarian algorithm, which is a cubic time complexity algorithm [44]. The detailed SAM solution is
shown in Algorithm 3.
ALGORITHM 3: HungarianSAM
Input: An application a, its number of threads N
a
(or tiles), tile latency arraysfT
C
k
g;fT
M
k
g, and
thread communication ratesfc
j
g;fm
j
g.
Output: The mapping resultp
a
( j) with minimum APL(d
a
)
min
.
1 Generate the N
a
N
a
cost matrix cost
jk
based on the tile latency and communication rates;
2 Call Hungarian algorithm HUNGARIAN(cost
jk
);
3 Return the optimal permutationp
a
( j) of 1;2;:::;N
a
and the minimal overall cost(d
a
)
min
.
The overall complexity of Algorithm 3 is O(N
3
a
) because the first step of generating the cost matrix
has O(N
2
a
) complexity and the second step of calling the Hungarian algorithm has O(N
3
a
) complexity.
4.4.2 Variation-Aware Heuristic Algorithm for OBM
With HungarianSAM solution, we develop a complete heuristic OBM (hOBM) algorithm as follows.
The first step is to perform an application-level assignment based on L2 cache traffic characteristics
as previously mentioned. To implement this, all the tiles are sorted according to their L2 cache access
37
...
Section 1 Section 2 Section N
a
Application a
Assign
Assign
Assign ...
Figure 4.8: Application-level assignment.
l
3
l
4
l
5
l
6
a
1 a
2
(a)
(b)
l
1
l
2
l
3
l
4
l
5
l
6
l
1
l
2
a
3
l
3
l
4
l
5
l
6
l
1
l
2
a
3
l
3
l
4
l
5
l
6
l
1
l
2
l
3
l
4
l
5
l
6
a
1
l
1
l
2
a
2
l
3
l
4
l
5
l
6
l
1
l
2
Figure 4.9: Two assignment orders: (a) poor balancing and (b) good balancing.
latencies (i.e.,fT
C
k
g). We then assign a set of tiles to each application in such a way that tiles with large
cache latencies and tiles with small cache latencies are equally distributed among different applications.
Specifically, to assign tiles for an application a with N
a
threads, the sorted tile list is divided into N
a
sections with equal number of tiles, and then the median tile from each section is selected for application
a, as shown in Fig. 4.8. The assignment is then followed by calling HungarianSAM to map the N
a
threads
of Application a to these selected tiles to achieve minimum APL for this application. All the tiles are
assigned to one application in this manner.
A crucial factor in the first step is the order of applications to be assigned. As applications may exhibit
large variations, the effectiveness of the first step can be greatly influenced by the application assignment
order. Take the mapping of six threads onto six tiles as an example. Assume the six tiles have been
sorted according to their cache access latencies, denoted as l
1
to l
6
. One 4T application a
1
and two 1T
applications a
2
;a
3
with six threads in total are to be mapped onto the six tiles. If a
1
gets assigned with
tiles first and a
2
gets assigned next as shown in Fig. 4.9 (a), when a
3
gets assigned last, the tile list has
only one tile left to choose. This leads a
3
to have a large APL and eventually results in a large max-APL
and, therefore, severe imbalance for the three applications. In general, with multiple applications, it is
easy to end up with a more imbalanced solution with a shorter available tile list in later assignment phases.
To solve this problem, we assign tiles to smaller applications first, i.e., the applications with fewer
numbers of threads. This maximizes the length of the remaining tile list and mitigates the impact caused
by application variation. Return to the same example of the six threads. The better solution depicted in
Fig. 4.9 (b) assigns tiles to smaller applications a
2
and a
3
first. At the last step, since four tiles are still
remaining, a
1
gets a more averaged packet latency and the max-APL of the three applications is, hence,
much lower compared to (a). Conclusively, the application-level assignment should follow the ascending
38
order of application sizes in terms of the number of threads.
The second step of the proposed (hOBM) algorithm is to perform fine-tuning by swapping certain
thread-to-tile mappings across applications. This swapping is conducted based on two observations.
First, we observe that some applications are more memory-intensive than others, such as raytrace and
swaptions in PARSEC 2.0 benchmark suite as shown in Fig. 4.5. Second, we observe that the threads in
the same application can also have quite different memory access rates. Take the bodytrack benchmark
(body tracking of a person) as an example. When it is parallelized into 16 threads, the L2 cache miss rate
of each thread can range from a minimum of 0.859 MPKI (misses per kilo instructions) to a maximum of
2.35 MPKI. These observations on inter- and intra-application variations prompt us to perform following
swaps to further optimize the mapping result.
After the first step, every thread of each application has been mapped onto a tile. In the second step,
the threads of all the applications are first sorted in descending order based on the memory-to-cache
ratio of each thread to obtain a sorted listft
m
g (if the memory-to-cache ratios of two threads are very
close, they are ordered based on cache access rate). The rationale is to adjust the tile mapping for threads
that have relatively high memory controller traffic but were not mapped onto tiles with smaller memory
access latencies. To implement the adjustment, for each thread t
i
in the first half offt
m
g (i.e., those with
higher memory-to-cache ratios), we find all the tiles that have smaller memory controller latencies than
the current tile where t
i
is mapped, and greedily choose to swap t
i
to one of those tiles that yields the
largest latency reduction for the two threads (i.e., thread t
i
and the thread on the other tile before the
swap). Finally, after all the swapping is done, the algorithm calls the HungarianSAM once more for each
application to reduce their APLs, thereby possibly reducing further the overall max-APL. The pseudo
code is shown in Algorithm 28.
39
ALGORITHM 4: Heuristic OBM (hOBM)
Input: Application setfa
i
g of size A, number of threads (tiles) N, cache and memory latencyfT
C
k
g
andfT
M
k
g, cache and memory access ratefc
j
g andfm
j
g.
Output: The minimized max-APL d
max
and mappingp( j).
/
*
Step 1. Application-level assignment.
*
/
1 Sort the tiles to get the sorted listfl
k
g based onfT
C
k
g;
2 Sort the applications in ascending order of their sizesDN
i
= N
i
N
i1
to get the sorted listf ˆ a
i
g;
3 for ˆ a
i
from ˆ a
1
to ˆ a
A
do
4 Divide the tile listfl
k
g intoD
ˆ
N
i
sections with equal length;
5 Pick the median tiles in each section fromfl
k
g;
6 Call ALGORITHM 3 to assign theseD
ˆ
N
i
tiles to the threads of ˆ a
i
so that the APL of ˆ a
i
is
minimized;
7 Remove the assigned tiles from the listfl
k
g;
8 end
/
*
Step 2. Fine-tuning by swapping based on memory-to-cache
ratio.
*
/
9 Sort the threads in descending order of memory-to-cache ratios to get the sorted listft
m
g;
10 for t
m
from t
1
to t
bN=2c
do
11 The memory access latency of current tile l
p(m)
is T
M
p(m)
;
12 Initialize APL reduction R
max
= 0;n
max
=1 ;
13 for t
n
from t
1
to t
N
do
14 if T
M
p(k)
< T
M
p(m)
then
15 Current APLs of t
m
and t
n
are d
m
and d
n
;
16 Calculate new APLs of t
m
and t
n
as d
0
m
and d
0
n
;
17 if d
m
+ d
n
d
0
m
d
0
n
> R
max
then
18 R
max
= d
m
+ d
n
d
0
m
d
0
n
;n
max
= n;
19 end
20 end
21 end
22 if R
max
> 0 then
23 Swapp(n
max
) andp(m);
24 end
25 end
26 for a
i
from a
1
to a
A
do
27 Call ALGORITHM 3 to remap the currentDN
i
threads of a
i
to minimize its APL.
28 end
4.4.3 Time Complexity
The overall time complexity of the proposed hOBM algorithm is O(N
3
) as each of the two steps takes
O(N
3
) time complexity.
Step 1 Sorting tiles and applications takes O(N logN) times of calculation. There are A appli-
cations, each requiring one-time assignment. In each assignment, selecting DN
i
tiles has O(DN
3
i
)
time complexity, HungarianSAM subroutine has O(DN
3
i
) time complexity, and deleting the assigned
members from the tile listfl
k
g takes O(N) time. Altogether, the first step has a time complexity of
40
O(N logN)+å O(DN
3
i
)= O(N
3
i;max
) where N
i;max
= max
i=1;2;:::;A
N
i
. Since in worst case N
i;max
= N when
there is only one application running, the complexity of the first step is O(N
3
).
Step 2 Sorting N threads has O(N logN) time complexity. Next in the nested loop from Line 11 to 24,
there are N iterations of the inner loop and N=2 iterations of the outer loop. For each iteration of the inner
loop, calculating and comparing APLs taking O(1) time since arraysfT
C
k
g andfT
M
k
g are pre-calculated.
Then the inner loop takes O(N) times of calculation in total, which is the time complexity of each iteration
of the outer loop. With the final HungarianSAM adjustments having å O(DN
3
i
) complexity, which in
worst case is O(N
3
) similar to the complexity of Step 1, the fine-tuning step has a time complexity of
O(N
3
).
4.4.4 Dynamic Application Mapping
As mentioned in Section 4.2, a desired variation-aware application mapping method should also be
able to perform run-time mapping of new threads when applications are dynamically added or removed
(completed) in the CMPs. Owing to its low computational complexity, the proposed hOBM is applicable in
these scenarios as application change happens at a much coarser time-granularity. We collect the statistics
offc
j
g andfm
j
g of the new applications in a certain interval at runtime, and then solve the OBM problem
to determine the new mapping solution which is used until the next application change occurs on the chip.
4.5 Evaluation
4.5.1 Evaluation Setup
We evaluate the effectiveness of mapping algorithms by utilizing traces gathered from running
multi-threaded PARSEC 2.0 benchmarks [8] on full-system simulation using Simics [51]. The GEMS
[52] and GARNET [4] simulators are integrated with Simics for detailed timing of the memory system and
the on-chip network, respectively. The NoC power model DSENT [68] is adpoted for power estimation
under 45 nm technology and 1 V power supply.
Key parameters are listed in Table 4.2. We assume a canonical credit-based wormhole router with a
3-stage pipeline and look-ahead routing optimization. With 128-bit link width, short 16-bit packets are
single-flit while long packets carrying 512-bit data plus a head flit have 5 flits.
We compare the following five algorithms:
1. global optimization (Global), which minimizes the overall average latency (g-APL) of all the
threads;
2. Monte Carlo method (MC) for the OBM problem, which selects the one with minimum max-APL
from a large number ( 10
4
) of random mappings;
3. simulated annealing-based algorithm for the OBM problem (SA) in which a random move is defined
as swapping the mapping of two randomly selected threads;
4. the heuristic for the OBM problem using descendant order in application assignment (hOBM desc);
and
41
Table 4.2: Key parameters in the evaluation.
Network topology 8x8 mesh
Router setup 3-stage, 2GHz
Input buffer 5-flit depth
Link bandwidth 128 bits per cycle
Core setup Sun UltraSPARC III+, 2GHz
Private I/D L1$ 32KB, 2-way, LRU, 1-cycle access latency
Shared L2 $ per bank 256KB, 16-way, LRU, 6-cycle access latency
Cache block size 64B
Virtual channel 3 VCs per protocol class
Coherence protocol MOESI
Memory controllers Seven placement schemes
Memory latency 128 cycles
Table 4.3: Application configurations for the evaluation.
Config App sizes
Cache Memory
Avg Std Avg Std
C1 Four 16T 7.008 9.40 0.899 3.14
C2 Four 16T 1.886 4.19 0.381 1.49
C3 Four 16T 10.88 10.60 1.51 4.29
C4 Sixteen 4T 6.221 10.62 0.737 1.25
C5 Sixteen 4T 2.643 3.324 0.543 1.42
C6 Sixteen 4T 9.589 11.56 1.26 3.97
C7 8T, 12T, 20T, 24T 6.457 4.50 0.85 0.93
C8 4T, 8T, 20T, 32T 10.99 7.97 1.38 1.22
C9 4T, 4T, 24T, 32T 7.633 5.89 0.790 1.00
C10 1T, 1T, 2T, 20T, 40T 4.134 6.22 0.963 0.87
5. the proposed heuristic for the OBM problem using descendant order in application assignment
(hOBM).
Note that hOBM desc is introduced to demonstrate the importance of the awareness of application
variation, and therefore we only include its results in comparison in Section 4.5.1-4.5.2. The remaining
sections analyze the results of the other four mapping algorithms.
As different applications have various intensities of network load (i.e., the sum of shared cache requests
and memory controller requests), we construct ten different configurations with varying loads and appli-
cation sizes in the evaluation, as shown in Table 4.3.
4.5.2 Impact of Application-Level Assignment Order
As discussed in Section 4.4.2, the order of applications in the application-level assignment step
makes a significant difference. Fig. 4.10 compares the APL results after the application-level assignment
step, with descending (hOBM desc) and ascending (hOBM) orders in terms of the number of threads
42
in an application. The four applications in (a) have 4 threads, 4 threads, 20 threads, and 32 threads,
respectively, and the five applications in (b) have one thread, one thread, 2 threads, 20 threads, and
40 threads, respectively. The max-APL of descending-order assignment of C9 shown in Fig. 4.10
(a) is 25.26 cycles, which is 12.6% higher than the ascending-order assignment result, and the max-
APL of C10 increases to 28.56 cycles, which is 27.2% higher than the ascending order assignment
result in Fig. 4.10 (b). This confirms that the application-level assignment should follow the ascend-
ing order of the number of threads of applications to improve the mapping results in the first step of hOBM.
18
20
22
24
26
28
30
APL (cycles)
Desc Asc
(a) C9
4T
4T
4T
20T
32T
18
20
22
24
26
28
30
APL (cycles)
Desc Asc
(b) C10
1T
1T
5T
20T
40T
Figure 4.10: Results with different application assignment orders.
1
1 1
1 1 1 1
1 1
1
1
1 1
1 1
1
3
3
3
3 3
3
3
3
3 3
3
3 3
3
2
2 2
2
2
2
2
2 2
2 2
2
2 4
2
4
4 4
4
4
4 4
4
2
4
4 4 4
4
4
3
2 4 3
Figure 4.11: Results of C1.
4.5.3 Max-APL Comparison
Fig. 4.11(a) shows the hOBM mapping results of C1. Smaller application numbers indicate lower
overall cache access rates. Application 1 (A1), which requires the lightest on-chip traffic, is no longer
placed in the four corners of the chip, whereas in the mapping results shown in Fig. 4.4, the conventional
Global assigns corner tiles to the threads of A1. Fig. 4.11(b) compares the APLs of the four applications
in C1. The imbalance between applications is almost negligible with the proposed hOBM mapping
algorithm. It reduces the max-APL to 22.31 cycles, or a 11.29% decrease, making the APLs of the four
applications nearly the same.
Fig. 4.12 compares the max-APL results of the five mapping methods applied to all the ten configu-
rations. The proposed hOBM achieves the best latency balancing for different applications, reducing the
max-APL by 15.45% on average compared to Global for the ten configurations. MC and SA achieves
results close to hOBM (still 2.6% and 1.7% higher APLs than hOBM, respectively), but they require high
runtime as all the search-based algorithms. The hOBM desc algorithm has similar results as hOBM for
C1-C6 where the applications have the same sizes, but it has significantly worse balance when applications
have different sizes, as reflected in the 15.32% increase in max-APL compared to hOBM for C7-C10. This
demonstrates the importance of ordering in the application assignment step. The evaluation analysis in
the following sections no longer includes hOBM desc and only compares the other four mapping schemes.
Fig. 4.12 also shows that the imbalance between applications becomes even more severe under Global
with higher numbers of smaller applications: the Global max-APL of mapping 4T applications in C4 to
C6 increases, on average, by 13.60% compared to that of mapping 16T applications in C1 to C3. This is
because applications with smaller sizes have a high probability of being assigned with all high-latency
tiles or all low-latency tiles.
43
20
21
22
23
24
25
26
27
28
29
30
Max−APL (cycles)
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg
Global MC SA hOBM
Figure 4.12: Max-APLs for the ten configurations.
Table 4.4: Dev-APLs for the ten configurations.
Config Global MC SA hOBM
C1 2.094 0.089 0.026 0.006
C2 1.63 0.162 0.02 0.005
C3 1.877 0.143 0.035 0.007
C4 6.419 0.25 0.118 0.016
C5 6.091 0.358 0.049 0.007
C6 9.787 1.179 0.32 0.006
C7 4.225 0.837 0.73 0.009
C8 6.409 0.919 0.94 0.016
C9 4.746 0.356 0.078 0.015
C10 4.363 0.203 0.143 0.019
4.5.4 Standard Deviation
Although standard deviation of APLs (dev-APL) is unsuitable as the objective function for latency bal-
ancing mapping algorithms as mentioned in Section 4.3.1, it is still a direct and well-acknowledged indi-
cator for measuring the variance among multiple values. Table 4.4 lists the dev-APL of the four mapping
methods Global, MC, SA, and hOBM for the ten different configurations. Global has the largest dev-APL
among the four mapping algorithms. Both MC and SA have moderate reduction in dev-APL compared to
Global. The proposed hOBM algorithm reduces dev-APL significantly by 99.75%, 92.91%, and 80.60%
compared to Global, MC and SA, respectively, demonstrating its superior advantage in balancing latencies
among multiple applications.
4.5.5 Performance and Power
As mentioned above, hOBM is a performance-aware latency balancing mapping algorithm. Although
balanced mapping may result in overall on-chip packet latency increase, the proposed mapping problem
formulation uses max-APL as the metric by which it is able to achieve balancing with much less
performance loss compared to other criteria such as standard deviation. Fig. 4.13 plots the overall average
APLs (g-APLs) of four mapping methods. As expected, Global has the minimal g-APL since its sole
44
18
19
20
21
22
23
24
25
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg
Global APL (cycles)
Global MC SA hOBM
Figure 4.13: g-APLs.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg
Dynamic power (W)
Global MC SA hOBM
Figure 4.14: Dynamic power consumption.
objective is to minimize g-APL. The performance loss percentages of the other three algorithms are all
within 9% because minimizing max-APL is in the proposed problem formulation as their optimization
objective. Among them, hOBM only slightly increases g-APL compared to Global, by up to 6.02%,
which is less than SA (7.91% increase) and MC (8.84% increase). This proves that the benefits of balanced
latency in the proposed hOBM do not introduce large penalties in overall packet latency.
In addition to the NoC performance, we also evaluate the NoC power consumption of the proposed
algorithms. While the static power is approximately the same for different schemes, the dynamic NoC
power depends on the total number of packets injected into the network per unit time and the average
number of hops packets travel, both affected by the application mapping results. Fig. 4.14 depicts the
dynamic power comparison results. As can be observed, the proposed hOBM algorithm has almost negli-
gible power overhead (2.59% on average) compared to Global, which is also the least overhead among the
three max-APL minimization algorithms MC, SA, and hOBM. This highlights that the latency balancing
feature of the proposed mapping does not significantly penalize NoC power.
45
4.5.6 Algorithm Runtime Comparison
Search-based algorithms such as simulated annealing typically yield trade-offs between runtime and
performance. Fig. 4.15 plots the max-APL results of SA when it is allowed to run for different CPU
times. The result is normalized to the runtime of hOBM and is plotted in a logarithmic scale. To reduce
the impact of randomness in SA, we show the average max-APL results of the configurations C1-C10. As
can be seen from the figure, hOBM outperforms SA even when SA’s runtime is 100X larger than that of
hOBM. In addition, while the max-APL difference between SA and hOBM is not very large, we have seen
from Table 4.4 that the dev-APL between the two methods has an average of around 7X difference when
SA is allowed to have the same runtime as hOBM.
10
0
10
1
10
2
22.2
22.4
22.6
22.8
23
23.2
23.4
Normalized runtime
Average max−APL (cycles)
SA
hOBM
Figure 4.15: Runtime comparison.
20
21
22
23
24
25
26
27
28
29
30
C1_new C5_new
20
21
22
23
24
25
26
27
28
29
30
Global
MC
SA
hOBM
Original
Figure 4.16: Dynamic mapping.
4.6 Discussion
4.6.1 Dynamic Application Mapping
The low computation complexity of the proposed hOBM allows the algorithm to be applied to dynamic
mapping scenarios. We evaluate the mapping algorithms when some of the applications finish earlier
and new applications are available to be mapped onto those idle tiles (assuming the unfinished applica-
tions/threads do not change their mapped tiles
2
). In Fig. 4.16, the dots labeled as “Original” denote the
original static mapping results of C1 and C4. C1 new represents the case when two out of four applica-
tions in C1 finish and are replaced by two new applications, and C4 new represents the case when eight
out of sixteen applications in C4 finish and are replaced by eight new 4T applications. The three max-
APL minimization algorithms, MC, SA, and hOBM are allowed to have the same runtime. It can be seen
that the max-APL results of the four algorithms after new applications are added are slightly increased
by 2.67%, 2.09%, 1.92%, and 0.77%, respectively. This is because, in the updated configurations with
new applications mapped, the mapping of the existing unchanged applications is likely not the optimal
mapping any more. Nevertheless, the proposed hOBM achieves the minimum max-APL increase among
the four algorithms, showing its capability in balancing latency in case of dynamic application mapping.
2
The proposed algorithm is also applicable when thread migration is used, but the evaluation of the overall costs and gains
with thread migration is out of the scope of this research.
46
20
21
22
23
24
25
26
27
Max−APL (cycles)
Four corners row0_7 col0_7 row2_5 diagX diamond checker Avg
Global MC SA hOBM
Figure 4.17: Different memory controller placements.
4.6.2 Impact of Memory Controller Placement
We also evaluate the proposed solution under different memory controller placements. As shown pre-
viously, there are various memory controller placements, resulting in different on-chip memory request
latencies. The max-APL achieved by the four mapping algorithms are compared in Fig. 4.17. “Four
corners” represents the default case adopted in previous evaluations where four memory controllers are
placed in the four corners. The other six placements all have 16 memory controllers each, as shown in Fig.
4.6. The APLs of the six new configurations are reduced compared to “Four corners” as a result of the
increased number of memory controllers. With the memory traffic-aware swapping in the proposed algo-
rithm, hOBM is able to maintain a good balance of APLs among applications, and performs consistently
better than other algorithms across different memory controller placements.
4.6.3 Scalability
As the network size increases, the imbalance between applications becomes increasingly aggravated
under Global which only targets minimizing the overall APL. For a 16x16 mesh network that runs all
the applications in C1 to C4 together (i.e., 16 applications, each having 16 threads), Global results in
a max-APL of 55.70 cycles, which is a 35.1% increase as compared to its g-APL result. This is more
severe compared to the 26.8% increase in the 8x8 network. In contrast, the proposed hOBM provides
very good scalability and is suitable for large network sizes as its timing complexity is only O(N
3
). Other
search-based algorithms such as MC and SA need much more runtime to provide satisfactory results on
larger networks with an exponentially growing search space. For the above-mentioned example where
sixteen 16T applications are mapped onto a 16x16 network, hOBM achieves a max-APL of 45.81 cycles,
which is 17.8% lower than Global, 9.54% lower than MC, and 7.25% lower than SA. These results
highlight the need for latency balance-aware mapping algorithms and the importance of this work for
future large on-chip networks.
4.6.4 Impact on Application Execution Time
We conduct full-system simulation with PARSEC benchmark suite and investigate the impacts
of hOBM on the application execution time. As hOBM balances the average packet latency, some
47
0
0.2
0.4
0.6
0.8
1
1.2
Speedup
C1 C9
A1
A2
A3
A4
Figure 4.18: Speedup of hOBM over Global in C1 and C9.
applications experience higher APLs while others experience lower APLs. We demonstrate the usefulness
of the balanced APLs by showing that the changes in APLs lead to sizable influence on application
speedup.
Fig. 4.18 plots the speedup of hOBM over Global for each application in C1 and C9. Take C1 as an
example, of which the APL results are shown in Fig. 4.11. Out of the four applications of C1, hOBM
reduces the APL of A1 by 9.87%, which results in 5.49% increase in speedup; whereas the APL of A4
is increased by 8.62% in hOBM, which translates to 4.43% decrease in speedup. This indicates that
the reduced gap in APLs helps to reduce the gap in application performance. Meanwhile, the average
speedup of hOBM over Global for the four applications in C1 is 1.01 (and 0.994 for C9), indicating that
the reduced gap in hOBM is not achieved at the cost of too much average speedup.
4.7 Summary
This chapter addresses the important issue of balancing on-chip network latency in multi-application
mapping for chip multiprocessors. We formulate the problem of on-chip latency balanced mapping (OBM)
for multiple concurrently running applications. After proving the NP-completeness of the OBM problem,
we propose an efficient heuristic-based algorithm that leverages the characteristics of shared cache and
memory controller traffic as well as variations among applications, while taking into consideration the
overall NoC performance. Simulation results show that the proposed algorithm can achieve an average
reduction of 11.29% in maximum average packet latency and 340 times in standard deviation, with only
2.60% more in power consumption. This demonstrates the viability of exploiting thread-to-tile mapping
to balance the on-chip latencies among different applications while incurring little overhead in the NoC
performance.
48
Chapter 5
Temperature-Aware Application Mapping
Application mapping with its ability to spread out high-power components can potentially be a good
approach to mitigate the looming issue of hotspots in many-core processors. However, very few works
have explored effective ways of making trade-offs between temperature and network latency. Moreover,
on-chip routers, which are of high power density and may lead to hotspots, are not considered in these
works. In this section, we propose TAPP (Temperature-Aware Partitioning and Placement) [72], an effi-
cient application mapping algorithm to reduce on-chip hotspots while sacrificing little network perfor-
mance. This algorithm spreads high-power cores and routers across the chip by performing hierarchical
bi-partitioning of the cores and concurrently conducting placement of the cores onto tiles, and achieves
high efficiency and superior scalability. Simulation results show that the proposed algorithm reduces the
temperature by up to 6.80
C with minimal latency increase compared to a latency-oriented mapping solu-
tion.
5.1 Motivation: Thermal Impacts of NoC Routers
NoC routers may have large impact on thermal issues due to their higher power-to-area ratio compared
to other on-chip components. To illustrate, Figure 5.1 (a) plots this ratio for the main components in
Scorpio, a newly fabricated 36-core CMP [23]. With 10% of chip area and 19% of chip power, the NoC
has the highest power-to-area ratio. Figure 5.1 (b) is the thermal map of Scorpio by inputting the key
chip parameters to Hotspot [66]. The figure shows that the NoC component can potentially be the hotspot
in each tile. However, reducing NoC temperature in application mapping requires placing high power-
consumption cores far from each other, whereas the goal of reducing on-chip latency may require these
active cores being placed as close as possible. To address this issue, a new temperature-aware mapping
scheme is much needed, as proposed in this work.
L1_Dcache L1_Lcache
L1_DTag
L1_LTag
Core_part0
Core_part1
Core_part2
Router L2_Cntrl_0
L2_Cntrl_1
L2_Cntrl_2
L2_Tag
L2_Cache_part0
L2_Cache_part1
320.67
320.46
320.25
320.05
319.84
319.64
319.43
319.30
0
0.1
0.2
0.3
0.4
Router Core L1C L2C
Power density (W/mm
2
)
(a) (b)
Figure 5.1: (a) Power densities of different components on a tile in Scorpio. (b) Thermal map of one tile
in Scorpio (the core has three parts due to the rectangular restriction in HotSpot).
49
5.2 Problem Statement: Temperature-Aware Application Mapping
The goal of application mapping is to find a permutationp( j) but with on-chip maximum temperature
constraint. We first calculate the on-chip temperature. The power consumption of a router is calculated by
P
R
= P
R;sta
+ P
R;dyn
(C
R
); (5.1)
where P
R;sta
is the static power consumption of the router, and P
R;dyn
is the dynamic power consumption
which is a function of the traffic going through this router C
R
(in flits per cycle), calculated by the sum of
c
i; j
where the communication between core i and j goes through this router. This relies on the locations
of core i and j. Therefore, C
R
is dependent on the core mapping results.
As network latency is among the most important criteria of on-chip networks, the temperature-aware
mapping method should ensure as little latency increase as possible while reducing on-chip temperature.
The overall average packet latency L and the maximum on-chip temperature T
M
are calculated by
L=
(
N
å
i=1
N
å
j=1
r
i j
d
p(i)p( j)
N
å
i=1
N
å
j=1
r
i j
;and (5.2)
T
M
= max
i
T(x
p(i)
;y
p(i)
): (5.3)
Therefore, the optimization goal of the proposed temperature-aware mapping algorithm is the balance
between the maximum on-chip temperature, T
M
, and the overall average packet latency L,f L+y T
M
.
The units of coefficientsf andy are 1/cycle and 1/K, respectively. The objective function can be adjusted
by varying these two coefficients.
Unfortunately, this optimization problem is hard. Even with y = 0 in the formulated problem
(latency-minimizing mapping), this simplified version has the form of a Quadratic Assignment Problem
(QAP), which is NP-hard. Thus, the scalability of exact algorithms is greatly restricted due to their high
complexity, and an efficient heuristic algorithm is needed.
5.3 Proposed Algorithm: Temperature-Aware Partition-and-
Placement (TAPP)
In traditional circuit partitioning algorithms, the primary goal is to find the min-cut of a graph of
gates, i.e., to minimize the total weight of interconnections that cross the cutline. We apply a similar
approach to application mapping based on Kernighan-Lin (KL) algorithm [41], considering two factors in
the weight of interconnection. First, the weight needs to reflect the overall traffic (communication rate)
between two partitions A and B, i.e., C
par
AB
= å
i2A; j2B
c
i j
. Since higher power density regions usually lead
to hotspots, the on-chip power consumption should be evenly spread out in order to reduce the maximum
chip temperature. Therefore, we add the difference in power consumption between two partitions to the
total cost C
par
AB
of a cutline, the second factor in addition to the traffic cost. In this way, minimizing C
par
AB
50
N
N/2
N/2
N/4
N/4
N/4
N/4
A
B
2
1
N/8
N/8
N/8
N/8
Placed
rows
3
Iter=1 Iter=2 Iter=3
A B
Placed
module
s
(a) Step 1. (b) Step 2.
Figure 5.2: First two steps of TAPP.
means concurrently balancing power and minimizing latency during partitioning.
Specifically, the partition cost of a cutline C
par
AB
which divides the current graph into block A and B is
calculated by
C
par
AB
=
ˆ
f
å
i2A; j2B
c
i j
+ ˆ yj
å
i2A
ˆ
P
i
å
j2B
ˆ
P
j
j; (5.4)
where the power
ˆ
P
i
of each tile i on a partition is calculated as the sum of the core power P
i
, the NoC
router static power and the part of router dynamic power caused by all the communication generated at or
destined for this tile. Note that the part of router dynamic power caused by forwarding bypassing traffic is
not included as it depends on the mapping results and is not reflected at this step. It will be accounted for
in the final temperature and latency calculation.
We then propose the Temperature-Aware Partition-and-Placement (TAPP) algorithm. The key idea
behind the proposed TAPP is that the block placement is carried out concurrently within each iteration of
the hierarchical bi-partitioning. TAPP consists of three steps.
The first step is to conduct horizontal partitioning with placement until the size of each partition
equals one row in the mesh network, based on the partitioning cost calculation C
par
AB
in Equation 5.4. Each
iteration includes the partitioning and placement of all the current blocks, so the number of blocks doubles
after each iteration, as shown in Figure 5.2 (a). As soon as the min-cut partitioning is finished in each
partition, the placement order of the two blocks is determined right away. This step has O(N
3
) complexity.
With the cores assigned to each row determined after the first step, the second step performs the
vertical bi-partitioning with placement hierarchically until each block contains one core. In each iteration,
the bi-partitioning is performed bottom to top, and left to right, as shown in Figure 5.2 (b). Each of the
current blocks is partitioned with minimum cost as in Equation 5.4, and the order of the two blocks after
partitioning is calculated similarly as in the horizontal placement. For example in Figure 5.2 (b), after
the partitioning of the A and B block, we calculate the temperature and latency costs of all the placed
blocks (including the placed blocks in the same row) caused by A and B in the two placing ways, i.e.,
either A on the left or B on the left, and choose the one with smaller cost. This step has O(N
2:5
) complexity.
51
Table 5.1: PARSEC benchmark configurations.
No. PARSEC Benchmarks P
i
Avg P
i
Std dev
P1 blackscholes, bodytrack, canneal, ferret 0.4564 0.1678
P2 bodytrack, canneal, ferret, vips 0.5036 0.1028
P3 blackscholes, dedup, ferret, fluidanimate 0.3938 0.1779
P4 canneal, fluidanimate, swaptions, x264 0.3717 0.0757
Before reaching the final placement solution, in the last step, we conduct local adjustment to the
assignment generated in Step 2 as fine tuning. A 2x2 window is slid from bottom left to top right, and the
24 permutations of the four cores inside the window are evaluated, and we greedily pick the permutation
with minimum cost to assign them to the four tiles, with O(N
2
) complexity.
To sum up, the timing complexity of TAPP is O(N
3
).
5.4 Evaluation
The proposed problem and schemes are evaluated with both CMP traces and MPSoC traces. The
CMP traces are generated by the cycle-accurate gem5 simulator [9] and the McPAT power modeling
frame-work [48], with the multi-threaded PARSEC benchmarks [8]. The MPSoC task graphs and power
traces are generated by TGFF [24]. The interconnection power consumption is calculated by DSENT [68].
We have the following five schemes, namely (1) Random, the average latency and temperature of
randomly generated mapping results, (2) MinLat SA, a simulated annealing algorithm aiming at minimiz-
ing the overall average packet latency, as the baseline algorithm, (3) CoreOnly SA, a temperature-aware
simulated annealing algorithm which considers only core power for generating mapping (the final temper-
ature calculation includes routers), (4) CoreNoC SA, a temperature-aware simulated annealing algorithm
solving the proposed temperature-aware mapping problem, i.e., considering both core power and NoC
power, and (5) CoreNoC TAPP, the proposed heuristic algorithm to solve the temperature-aware mapping
problem.
5.4.1 Temperature and Latency Results
We test the five schemes on an 8x8 mesh network with grid size = 1 mm. For the CMP traces,
we use four groups P1, P2, P3, and P4 as listed in Table 5.1, each of them containing four 16-thread
PARSEC benchmarks. For the MPSoC traces, we use TGFF to generate four application graphs for the
64 cores to be placed, as listed in Table 5.2. The wide range of configurations in these two tables show a
representative set of workloads.
The CMP results and MPSoC results are plotted in Figure 5.3 as the trade-off curves of temperature
and latency. The latency achieved by Random is around 28.7 cycles for CMP and 30.7 cycles for MPSoC,
and the Random temperature result is shown as text. Note that the CMP results show less than 20
cycles in delay because the actual communication graph of each of the four groups is comprised of four
disconnected subgraphs due to the lack of inter-application communication.
52
Table 5.2: TGFF benchmark configurations.
Testbench P
i
Avg P
i
Std dev
tgff1 0.8346 0.4036
tgff2 0.4259 0.3027
tgff3 0.8130 0.1514
tgff4 0.2502 0.1010
Figure 5.3: PARSEC benchmark results (average latency of Random: 28.7 cycles) and TGFF benchmark
results (average latency of Random: 30.7 cycles).
We have the following three observations.
First, it is noted that the MinLat SA which solely minimizes the average network latency reduces the
on-chip temperature compared to Random mapping result because the MinLat SA greatly reduces the
NoC communication power.
Second, the CoreOnly SA, which only considers the core power consumption during mapping,
decreases the hotspot temperature, but still averages 1-2
C higher than the two solutions with NoC
power-awareness. This demonstrates that the proposed problem which takes into account the NoC power
is a better description of real thermal distributions on chip.
Third, the CoreNoC SA and TAPP achieve equal results. For the four CMP benchmarks, CoreNoC SA
and TAPP reduce the maximum temperature by 1.36
C and 1.25
C with 2.97% and 2.17% latency penalty
on average compared to MinLat SA algorithm, respectively. The maximum penalty of TAPP is 3.30% in
PARSEC 1. For the four MPSoC benchmarks, they reduce temperature by 3.66
C and 4.40
C with 1.87%
and 2.32% latency penalty on average. The maximum penalty of TAPP is 3.40% in TGFF 2. However,
the results achieved by TAPP are obtained within a much smaller execution time. The SA results shown
in Figure 5.3 are given 10
5
times of iterations each, taking an average of 437.4 seconds, leading to 112X
execution time compared to TAPP which takes 3.9 seconds to finish.
53
5.4.2 Power Consumption
The power overhead associated with the temperature-aware mapping in previous and this work comes
from the extra dynamic power of the NoC due to higher network activity. However, since the TAPP
algorithm achieves minimal latency penalty, the activity factor of the NoC is increased only by a small
amount, thus introducing very small power overhead. According to McPAT and DSENT models, TAPP
has an average chip power overhead of 0.21% for PARSEC benchmarks and 0.96% for TGFF benchmarks.
5.5 Summary
This section addresses the important issue of temperature-aware application mapping in many-core
processors. We analyze the thermal impact of NoC routers and formulate a mapping problem that takes
into consideration the power of cores and routers as well as the trade-offs between latency and temperature.
An efficient temperature-aware partitioning and placement (TAPP) algorithm is proposed to mitigate on-
chip hotspots while sacrificing little network performance. Simulation results on 8x8 mesh networks using
PARSEC and TGFF benchmarks demonstrate the effectiveness of TAPP in mapping results and algorithm
execution time.
54
Chapter 6
Performance Improvement Through Designing
Express Link Topologies
With the integration of up to a hundred of cores in recent general-purpose processors, it is critical
to design scalable and low-latency NoCs to support various on-chip communications. An effective way
to reduce on-chip latency and improve network scalability is to add express links between pairs of non-
adjacent routers. However, increasing the express link count may result in smaller bandwidth per link due
to the limited total bisection bandwidth on the chip, thus leading to higher serialization latencies of packets
in the network. Unlike previous works on application-specific designs or fixed express link placement, this
research aims at finding the optimal placement of express links for general-purpose processors consider-
ing all the possible placement options. We formulate the problem mathematically and propose an effective
algorithm that utilizes an initial solution generation heuristic and enhanced candidate generator in simu-
lated annealing. Evaluation on 4x4, 8x8 and 16x16 networks using multi-threaded PARSEC benchmarks
and synthetic traffic patterns shows significant reduction of average packet latency over previous works.
6.1 Motivation
6.1.1 Related Work
With tens to hundreds of cores integrated onto multi-core processors, the performance as well as
the scalability of on-chip net-works have become one of the primary challenges for NoC designers.
Mainstream mesh topologies are easy to implement and more scalable than ring or bus topologies, but the
average on-chip latency still increases linearly with the network diameter.
Adding express links to existing mesh-based NoCs is a promising solution to improve the scalability of
NoCs. They reduce the average number of hops traversed by network packets, thereby reducing on-chip
communication latency. Researchers have proposed many application-specific designs that improve
NoC topology by utilizing application characteristics (e.g., [25, 34, 59]). For example, Ogras et al.
enhance the mesh networks with additional long range links between frequently-communicating routers
determined by traffic patterns of certain applications [59], and Dumitriu et al. present a NoC topology
generation process to provide high performance for given applications [25]. However, the application-
specific nature of these designs may lead to non-optimal solutions in the use of general-purpose processors.
For general-purpose many-core platforms, there are mainly two categories of express link-based
approaches that are equally competitive, namely virtual express link approaches and physical express link
approaches [12]. The virtual approach utilizes virtual express channels to allow certain packets to bypass
the first few pipeline stages of intermediate routers [46]. This approach does not require actual additional
links but packets cannot fully bypass the intermediate router stages (specifically, packets still need to go
through the switch traversal and link traversal stages in most designs), thus it yields a limited reduction in
packet latencies.
55
In contrast, the physical approach deploys physical express links between routers [19, 42, 29],
enabling maximum bypass effects but consuming additional bisection bandwidth. For example, Dally
proposes a hierarchical express-link placement technique [19]. Kim et al. present flattened butterfly,
which adds express links to form full connectivity between the (concentrated) routers in each row and
column [42]. The flattened butterfly design successfully achieves low-diameter NoC topology. A variant
of flattened butterfly is a multi-drop express channel topology that achieves high connectivity between
routers but needs extra multiplexing logic [29]. While these proposals highlight the promise of adding
physical express links, they represent only a few specific example schemes without considering whether
these schemes are optimal and how optimal placements can be found.
This research distinguishes itself from existing studies by aiming at finding the best express link place-
ment for on-chip networks in general-purpose many-core processors. We identify the chip bi-section
bandwidth as a key constraint and factor that influences the overall packet latency, based on how the
placement problem is formulated and solved.
6.1.2 Impact of Express Links on Latency
As mentioned in Section 2.2, the on-chip latency of a packet is comprised of two components
L = L
D
+ L
S
, head latency and serialization latency. Although deploying express links can reduce
the number of hops for a packet, it does not always result in reduced overall packet latency. This is
because the total bisection bandwidth B of a NoC is limited by many factors, including chip dimension,
manufacturing technology, and energy constraints to name a few [60]. Assume that the link count at the
cross-section of two adjacent routers is c. For an n-by-n mesh network, if we add express links such that
each row or column has c links at the cross-section of two adjacent routers, the link width b needs to be
adjusted in order to stay within the network bisection bandwidth constraint, i.e., b c n B.
The above analysis indicates that there is a large design space in express link-based NoCs that needs
further exploration. We not only need to determine the appropriate link width to balance head latency and
serialization latency under bisection bandwidth constraints, but also need to find the optimal placement of
express links at a given link width so as to minimize average hop count. In the next section, we formulate
this optimization problem mathematically and then present, in Section 6.3 several algorithms that produce
optimal or near-optimal placement results.
6.2 Problem Statement: Optimal Express Link Topology
We focus on the design of an express link-based on-chip network for general-purpose many-core
platforms. The goal is to reduce the average NoC packet latency with the presence of multiple packet
types with different sizes (e.g., short packets for read requests or write acknowledgments, and long packets
for read replies or write requests). To reflect general-purpose computing, packet latency is averaged over
all source-destination pairs to avoid unfairness to a particular communication pattern. Various synthetic
traffic patterns as well as application traffic are simulated in the final evaluation section.
Problem: Find the optimal number and placement of express links to be added to an n-by-n mesh
network under a given bisection bandwidth constraint, B.
56
Objective: Minimize the average on-chip packet latency L
avg
,
L
avg
= L
D;avg
+ L
S;avg
=
N
å
i=1
N
å
j=1
L
D
(i; j)
N N
+
å
k
p
k
S
k
b
; (6.1)
where N= n
2
is the total number of routers in the network, L
D
(i; j) is the packet head latency from router
i to router j, determined by the express link placement; p
k
and S
k
are the percentage and size of the k-th
type of packets, and b is the link width in bits.
Constraints: The number of links c at the cross section between any two adjacent routers (including
both local and express links), must not exceed a link limit C, which is calculated as
c C=
B
b n
;8c: (6.2)
The above two equations imply that, with a specific value of C and thus fixed b and L
S;avg
, the overall
average packet latency L
avg
is determined by the average head latency L
D;avg
, and L
D;avg
is determined by
the placement of express links.
6.3 Proposed Solution
To solve the above problem, the overall approach we take is to first determine all the possible values of
C, and for each C, determine the optimal express link placement to minimize L
D;avg
. The minimal average
packet latency can then be found by comparing all cases of C values.
6.3.1 Cross-Section Link Limit
For a regular n-by-n mesh network, C has the minimum value of 1, meaning that neighboring routers
are connected with one bidirectional link. When all routers on the same row (or column) are fully con-
nected, C has the maximum value C
f ull
at the cross-section between the two routers in the middle of a row
or column, given by
C
f ull
= n=2 n=2= n
2
=4; (6.3)
meaning that each router on one side of the center line is connected bi-directionally to all the routers on
the other side. For example, C
f ull
= 4 for a 4x4 network where the maximum link count occurs between
the second and third router, and C
f ull
= 16 for 8x8 network where the maximum link count occurs
between the fourth and fifth router.
Since the possible flit sizes are very limited (the flit size or the link width in bits is typically a divisor
of the packet size and is a power of 2), there are only a few possible values of C. For example, the value
of C can be 1, 2, or 4 for 4x4 networks and 1, 2, 4, 8, or 16 for 8x8 networks.
In what follows, we use P(n;C) to denote the express link placement problem that minimizes L
D;avg
on an n-by-n mesh NoC for a specific link limit value of C.
57
6.3.2 Reduction from 2D to 1D
As shown in empirical findings as well as recent studies, on-chip traffic load in general-purpose chip
multiprocessors typically is very low and does not exhibit lots of contention. Consequently, deterministic
dimension-order routing (DOR) is as effective as adaptive routing most of the time. As a matter of fact,
most taped-out commercial and research many-core chips adopt DOR in practice (e.g., Intel Teraflop
[35], Intel SCC [36], TRIPS [28], Scorpio [23]). Therefore, to increase the applicability of this work in
practical designs, we follow the same design choice and assume DOR in this work.
Under DOR (e.g., XY routing), given source and destination routers, the routing path of a packet is
comprised of a horizontal path component and/or a vertical path component. The on-chip traffic is then
separated into horizontal traffic and vertical traffic. Express links can be added to directly connect two
non-adjacent routers in the same row or column. For two routers that are not on the same row or column,
they can be connected by using a horizontal express link and a vertical express link. To tackle the express
link placement problem, we start with a key theorem.
Theorem: For a specific value of C, the express link placement problem P(n;C) on a two-dimensional
n-by-n mesh with XY routing is reducible to the problem of one-dimensional express link placement on a
row (or column) of n routers that minimizes the average head latency L
D
among these n routers.
Proof. In the XY routing scheme, the routing path of a packet from router i to router j consists of a
horizontal path and/or a vertical path. Hence, we have L
D
(i; j)= L
D
(i;v
i j
)+ L
D
(v
i j
; j), where v
i j
is the
turning point, i.e., the router on the same row with i and on the same column with j. Since i and v
i j
are
on the same row, L
D
(i;v
i j
) is solely determined by the link placement on that row. Similarly, L
D
(v
i j
; j) is
determined by the link placement on the column that router j and v
i j
share. The average packet latency of
the head flit is then expressed by
L
D;avg
=
n
å
r=1
(å
i2r
nn
å
j=1
L
D
(i;v
i j
))+
n
å
c=1
(å
j2c
nn
å
i=1
L
D
(v
i j
; j)
N N
=
n
å
r=1
(nå
i2r
å
v2r
L
D
(i;v))+
n
å
c=1
(n å
j2c
å
v2c
L
D
(v; j))
N
2
= n(
n
å
r=1
n
2
L
D;r
+
n
å
c=1
n
2
L
D;c
)=N
2
;
(6.4)
where L
D;r
and L
D;c
denote the average packet latency on the r-th row and the c-th column, respec-
tively. This equation shows that the average latency can be minimized by minimizing the packet latency
on each row or column individually. Therefore, the optimal express link placement onto mesh network is
determined by (i) solving a one-dimensional link placement problem for a row of n routers and (ii) repli-
cating the result of this sub-procedure n times for n rows and another n times for n columns and combine
them into a final result.
Hereinafter, we use
ˆ
P(n;C) to denote the one-dimensional express link placement problem on an n-
router row with the link limit of C. Due to the geometric symmetry of general-purpose CMPs,
ˆ
P(n;C)
58
only needs to be solved once to minimize pair-wise average packet latency for the n routers on the same
dimension, and the solution is duplicated for n rows and n columns.
6.3.3 Proposed Algorithm
The solution space of
ˆ
P(n;C) is of size O(2
nn
), which is the combination of any number of links
between every router pair. However, note that not all combinations are valid. First, a valid combination
must contain all the local links between adjacent routers. Second, at the cross-section between any two
routers, the link count cannot exceed C. Nevertheless, the solution space is still a super-exponential
function of n, which renders a brute-force solution impractical. Therefore, a properly designed heuristic
is required for large networks and/or higher link limit.
We propose an efficient simulated annealing-based algorithm to solve the one dimensional link
placement problem
ˆ
P(n;C).
A general simulated annealing procedure starts from an initial solution, and performs sufficient times
of probabilistic searches on the solution space. Its basic components include an initial solution, the
solution space, a candidate generator, a cooling schedule, and an acceptance probability function. We
adopt an exponential function for acceptance probability and a linear function for cooling. Since the
number of searches needed for simulated annealing to locate a good solution, i.e., the efficiency of the
algorithm, is greatly affected by the quality of both the initial solution and the neighboring states found
by the candidate generator in each iteration, we present how to choose a good initial solution and design a
good candidate generator in detail as follows.
Step 1. Initial solution based on divide-and-conquer (D&C).
The heuristic of initial solution generation should be efficient and effective. D&C is very fast if
the problem can be divided into sub-problems directly and its solution can be combined efficiently
using the solutions to sub-problems. In the proposed algorithm, we divide
ˆ
P(n;C) into two problems of
ˆ
P(n=2;C 1) and
ˆ
P(n=2;C 1). The combination step is to add one express link between the solutions to
sub-problems, which is fast to implement and also a good estimation to the optimal solution. Note that for
the one-dimensional link placement problem with a small network dimension n 4, the optimal solution
can be quickly located by enumeration methods such as branch and bound.
Step 2. Solution state and candidate generator using connection matrix.
An efficient candidate generator is the key factor to the efficiency of simulated annealing. A naive
generator adds, deletes, stretches, or shortens a randomly selected link in each move. However, a new
candidate solution generated this way is highly likely to fall out of the feasible solution space. More
precisely, it might have some local links missing or exceed the link limit C. This greatly degrades the
efficiency of the candidate generator. To tackle this problem, we define an equivalent search space that
excludes illegal link placements with no loss of possible valid solutions in the proposed algorithm as
follows.
For
ˆ
P(n;C), we define a binary matrix M of size(n 2)(C 1), representing whether the two links
on both sides of a router are connected, referred to as connection matrix hereinafter. Figure 6.1 illustrates
how to construct M for
ˆ
P(8;4) as an example. Since C= 4, there are three layers for express links since
59
1 2 3 4 5 6 7 8
(a)
(b)
2 5 1 6 4 8 3 7
Figure 6.1: (a) Connection matrix. A solid dot means the two links of both sides are connected as one
and a hole means disconnected. (b) The corresponding express link placement. From top to bottom, the
blue, green, and red express links are denoted by the three layers of corresponding colors in the connection
matrix, respectively.
one layer is reserved for local links between adjacent routers. In each of the three express-link layers,
the binary values at six connection points denote how the links are placed. For example, in the first layer
(top) in Figure 6.1 (a), the connection points at Router 3, 5-7 are connected, meaning there is an express
link from Router 2 to Router 4 and a long express link from Router 4 to 8 as shown in Figure 6.1 (b).
Based on the connection matrix, the candidate generator randomly picks one connection point and flips
its value to form the new candidate. Compared to the annealing procedure in the search space of all links,
the new procedure always conducts valid moves that satisfy constraints, and it can be proven that all the
possible solutions are probabilistically reachable. As an example, the placement of express links shown in
Figure 6.1 is the best solution to
ˆ
P(8;4) given by the proposed algorithm.
6.3.4 System Implementation
Deadlock-Free Routing
As express links are added on top of the mesh network, the routing paths need to be modified to take
advantage of these express links while ensuring deadlock freedom. Deadlock-avoidance is achieved by
enforcing packets to traverse unidirectionally and disallowing U-turns. For instance, in Figure 6.1 (b),
a packet at Router 1 may use a mixed set of local or express links to traverse from left to right to reach
Router 6, but the packet cannot use the express link to reach Router 7 or 8 first and then come back to
Router 6. Enforcing this rule eliminates any cyclic dependencies between the channels.
In addition to ensuring deadlock freedom, the routing algorithm should also minimize the path latency
on links and routers. To achieve these, we first compute the directional shortest paths between all router
pairs within each row (or each column) offline by applying Floyd-Warshall algorithm twice, one for each
direction. For example, within a row, the first round of the algorithm calculates the shortest paths from
router i to j where i2f1;:::;ng and j i, i.e., the paths of packets sent from left to right. To implement
this, all edges from j to i (from right to left) are set with infinite weight. The second round then calculates
the shortest path from j to i (still j i) by setting all edges from i to j to have infinite weight. The routing
computation algorithm returns a look-up table for each router that stores the next-hop router number on
the same row/column (more details in the next subsection).
60
The above step is performed to populate the routing tables in each router. This routing calculation step
is actually executed during the proposed simulated annealing-based algorithm each time when a newly
generated placement needs to be evaluated.
Router Implementation
Figure 6.2(a) depicts the router implementation. Compared to a typical mesh router, it has more input
and output ports but with narrower link width for each port. When a packet arrives at a router, the router
first uses DOR to determine the destination or the X-to-Y turning point, and then uses the aforementioned
look-up table (routing table) to find the next-hop router.
Two routing tables, one for X dimension and one for Y dimension, are associated with each router.
Figure 6.2(b) shows a routing table example of the first router in the first row based on the optimal
ˆ
P(8;4)
solution in Figure 6.1. For simplicity, we exclude local ports to/from network interfaces in links and the
routing table. Router 1 has three connections on one dimension, thus six output ports in total for X and
Y dimensions as shown Figure 6.2(a). Its routing table records the output port number for each next-hop
router on X or Y dimension (as well as its network interface(s) not shown here). For example, if a packet
with destination 63 is currently in Router 1, the turning point router is Router 7 according to DOR. Then
the packet is routed to Outport 3 based on the third entry in the X direction routing table, which directs
the packet to the next-hop Router 4. The size of each routing table has at most 2(n 1) entries, so the
overhead of routing table is minimal (e.g., less than 0.5% of the overall router hardware in the example
8x8 network).
6.4 Evaluation
6.4.1 Evaluation Methodology
We evaluate the proposed express link placement algorithm on three different network sizes, namely
4x4, 8x8, and 16x16. A canonical 3-stage credit-based wormhole router is assumed. The flit size of the
baseline mesh network is 256 bits, so the bisection bandwidth increases proportionally with the network
sizes. Based on previous findings [50], the ratio of long packets (512 bits) to short packets (128 bits) is set
to 1:4 to reflect the characteristics of real applications.
We evaluate the proposed algorithms by running multi-threaded benchmarks on the cycle-accurate
full-system simulator gem5 [9] with GARNET [4] for detailed timing of the on-chip network. The
evaluated PASREC 2.0 benchmarks [8] include emerging applications such as recognition, min-
ing, and synthesis (RMS) to represent a wide range of general-purpose computing applications. The
latest DSENT [68] NoC power simulator is integrated into GARNET to estimate NoC power consumption.
We compare the following topologies/schemes in this section.
A mesh network (baseline system),
The hybrid flattened butterfly (HFB) as proposed in [42],
Mesh with express link placement given by the proposed simulated annealing with random initial
placement (OnlySA), and
61
Mesh with express link placement given by the proposed simulated annealing with D&C-based
initial placement (D&C SA).
The hybrid flattened butterfly (HFB) given in [42] is proposed as an approach to scale the on-chip flattened
butterfly beyond a 4x4 router network. It divides the network into four quadrants, each having a fully-
connected flattened butterfly, and then connects them with local links.
6.4.2 PARSEC Benchmarks
Figure 6.3 and Figure 6.4 show the packet latency results of PARSEC benchmarks. Figure 6.3 plots the
average packet latency as a function of link limit C, averaged over the ten PARSEC benchmarks, as well
as the head latency L
D
and the serialization latency L
S
results of the proposed D&C SA on 4x4, 8x8, and
16x16 networks. The best link placement corresponds to the lowest point on the curve of D&C SA. Note
that in Figure 6.3, Mesh and HFB are represented only as single design points as they are fixed designs,
whereas the various placements of express links offer a wide range of design options. Figure 6.4 compares
the average packet latency for the Mesh, HFB, and the proposed D&C SA on an 8x8 network. It can be
seen that with the increase of link count, the serialization latency starts to cancel out the saving from the
head latency.
The performance improvement of D&C SA becomes much higher compared to fixed topologies Mesh
and HFB on 8x8 and 16x16 due to higher placement flexibilities. On the small 4x4 network, the proposed
D&C SA reduces 8.07% latency compared to Mesh and has the same latency as HFB. The latency savings
compared to Mesh and HFB increase to 23.50% and 8.03%, respectively, on the 8x8 network, and 36.40%
and 20.13%, respec-tively, on the 16x16 network.
The resulting difference between D&C SA and OnlySA also enlarges as the network size increases.
The two proposed express link placement schemes OnlySA and D&C SA are both allowed the same
runtime in Figure 6.3. We can see that OnlySA achieves the same latency results in 4x4. However, on 8x8
network, OnlySA results in 7.37% higher latency than D&C SA, and this increases to 9.41% on 16x16
network. Detailed runtime and result comparisons are shown in the next subsection.
6.4.3 Synthetic Traffic Patterns
Besides PARSEC benchmarks, we also evaluate the proposed scheme on various representative
synthetic traffic patterns, including uniform random (UR), transpose (TP), and bit-reverse (BR). We omit
the results of OnlySA because D&C SA is more effective and less time-consuming.
Figure 6.5(a) shows the average packet latency on an 8x8 network. The proposed topology achieves
an average of 24.41% and 16.85% latency reduction compared to Mesh and HFB, respectively.
Due to the fact that real applications typically have very low network load, on-chip networks are more
sensitive to latency than throughput. Adding express links in general is a way to leverage this fact by
reducing latency at the cost of lowered throughput, as reported in prior work [29]. Adding express links
may lead to insufficient utilization of the overall bandwidth between two routers. For example in Figure
6.1(b), there are only three links between Router 1 and 2 or Router 7 and 8 while the maximum link
allowed is four, meaning that the bandwidth is not fully utilized. Figure 6.5(b) compares the throughput
results of the three topologies. Mesh has the highest throughput. The use of express links in HFB results
in less than half of the Mesh throughput, mainly because of the bottleneck links between 2-D flattened
62
butterfly blocks. In contrast, the proposed D&C SA effectively recovers part of the unused bandwidth in
HFB. D&C SA has higher throughput than HFB in all three traffic patterns, with an average of 63.71%
higher throughput compared to HFB. It also restores to more than three quarters of Mesh throughput.
6.4.4 Power Consumption
Figure 6.6 shows the dynamic and static power consumption of the PARSEC benchmarks for Mesh,
HFB, and the proposed scheme. As can be seen from the results, the static power consumption accounts
for about two-thirds of the overall power consumption, confirming the relatively low average activity of
NoC components. The proposed express link placement algorithm reduces the total router power con-
sumption by 10.44% compared to Mesh and 0.63% compared to HFB on average for an 8x8 network.
The power saving mainly comes from the reduction in dynamic power consumption due to the reduced
packet forwarding activities. The dynamic power in D&C SA is reduced by 15.10% and 6.64% compared
to Mesh and HFB, respectively. It can also be seen in Figure 6.6 that the static power consumption of the
three topologies is very similar.
6.5 Summary
This research investigates the opportunities in express link placement for general-purpose many-core
platforms. In order to mini-mize average packet latency under bisection bandwidth constraints, we need
to find not only a balance between the number of bisection express links and the serialization latency but
also an optimal placement of express links. To achieve that, we effectively transform the design space of
express link placement problem from two-dimensional to one-dimensional and propose an efficient simu-
lated annealing-based algorithm. The algorithm adopts divide-and-conquer to increase the effectiveness of
the initial solution and an optimized search space to remove invalid placements so as to speed up the sim-
ulated annealing procedure. Evaluation results demonstrate the effectiveness of the proposed scheme. It
achieves 23.50% and 8.03% average packet latency reduction on 8x8 network and 36.40% and 20.13% on
16x16 network compared to the traditional mesh topology and the hybrid flattened butterfly, respectively.
63
Input port
X
Switch
Output port
Output port Input port
VC allocator &
Switch allocator
… …
Routing Unit
X direction Routing Table
Destination Outport #
2 1
3 2
4 3
5 3
6 2
7 3
8 3
Y direction Routing Table
Destination Outport #
9 4
17 5
25 6
33 6
41 5
49 6
57 6
2 5 1 6 4 8 3 7
9
17
25
33
41
49
57 63
1
To R2
#3
#2
#1 #4 #5 #6
To R4
To R3
To R9
To R25
To R17
(a) Express links and outport numbering of Router 1 on 8x8 network.
(b) Structure and routing table of Router 1.
#1 to #6: Numbers of
outports to other routers
Figure 6.2: Router implementation example.
64
1 2 4
0
5
10
15
20
(a) 4x4
Link limit C
Average latency (cycles)
1 2 4 8 16
0
10
20
30
(b) 8x8
Link limit C
1 2 4 8 16 32 64
0
20
40
60
80
(c) 16x16
Link limit C
D&C_SA
OnlySA
HFB
Mesh
L
D
L
S
Figure 6.3: Average packet latencies as a function of link limit C.
0
5
10
15
20
25
30
Average packet latency (cycles)
blackscholes
bodytrack
canneal
dedup
ferret
fluidanimate
raytrace
swaptions
vips
x264
average
Mesh HFB D&C_SA
Figure 6.4: Average packet latency results on 8x8 network.
0
10
20
30
Average packet latency (cycles)
UR TP BR Avg
(a) Latency
0
0.05
0.1
0.15
0.2
0.25
0.3
Throughput (packets per cycle)
UR TP BR Avg
(b) Throughput
Mesh
HFB
D&C_SA
Figure 6.5: Router power consumption comparison on an 8x8 network.
Router power consumption(W)
blackscholes
bodytrack
canneal
dedup
ferret
fluidanimate
raytrace
swaptions
vips
x264
average
0
0.5
1
1.5
2
Mesh(s) Mesh(d) HFB(s) HFB(d) D&C_SA(s) D&C_SA(d)
Figure 6.6: Router power consumption comparison on an 8x8 network.
65
Chapter 7
Smart Butterfly: Proactive Power Gating on
Flattened Butterfly
While proactive power gating is a promising technique to reduce the static power consumption of
network-on-chip (NoC), its effectiveness is often hindered by the requirement of maintaining network
connectivity and the limited knowledge of traffic behaviors. In this research, we present Smart Butterfly
[71], a core-state-aware NoC power-gating scheme based on flattened butterfly that utilizes the active/sleep
state information of processing cores to improve power-gating effectiveness. Smart Butterfly exploits the
rich connectivity of the flattened butterfly topology to allow more on-chip routers to be power-gated when
their attached cores are asleep. We present two heuristic algorithms to determine the set of routers to be
turned on to maintain connectivity and allow trade-offs between power consumption and average packet
latency. Simulation results show an average of 42.85% and 60.48% power reduction of Smart Butterfly
over prior art on 4x4 and 8x8 networks, respectively.
7.1 Motivation: Connectivity Analysis of Mesh and Express Link-
Based Topologies
A key challenge to enable core-state-aware power gating is to ensure the connectivity of all active
cores while minimizing the set of routers that needs to be powered on. Although the Router Parking work
shows that the mesh is a viable candidate to allow some routers to be powered-off without disconnecting
cores, flattened butterfly networks require much fewer powered-on routers due to their rich connectivity.
We first identify the minimal number of routers that need to be powered on additionally in flattened
butterfly to ensure full connectivity of a given set of active cores.
Theorem: In a flattened butterfly network, in which the set of routers attached to the active cores form K
connected components, full connectivity of all K components can be maintained only if a minimum of
(K 1) additional routers are powered on.
Proof. Proof sketch: 1) Full connectivity) At least(K 1) additional routers. In a flattened butterfly,
all active routers in a row (or in a column) are already connected and are in the same component. For a
sleeping router on the i-th row and j-th column, it can merge at most two components if turned on, one
comprised of all the active routers on the i-th row and the other formed by all the active routers on the j-th
column. This means that turning on one sleeping router can reduce the number of components by no more
than one. Therefore, by induction, at least(K 1) additional routers exist in powered-on state to connect
K components and hence maintain full connectivity of all active cores.
2) The boundary case of(K 1) additional routers is achievable. We start from connecting two compo-
nents. Assume A and B are two components. There exists at least one sleeping router which will connect
A and B if turned on. We can turn on this router to merge A and B. Repeat this step(K 1) times and we
66
can connect all K components with(K 1) additional routers.
While this theorem states that a selective set of(K 1) additional routers is needed, there are different
ways of choosing these routers, leading to different paths and varying packet latency among cores. More-
over, it may be beneficial to turn on extra routers besides the (K 1) routers to further reduce average
packet latency. Therefore, we focus on this more complex but more important problem and present two
efficient algorithms to solve it.
7.2 Problem Statement: Proactive Power Gating on Flattened But-
terfly Topology
Before presenting the problem formulation, we first describe the network topology as well as the
packet latency model used.
An n by m flattened butterfly network is defined as a grid with n rows and m columns. There is a
core with a router attached to it on each grid point, making N, n m cores (and routers) in total. All the
routers on the same row or column are directly connected by a physical link between them. Each core in
the network can be either active or in sleep state. Let c
i
= 1 denote core i is active and c
i
= 0 otherwise.
The set C,fijc
i
= 1g is the set of all active cores in the network. Similarly, let s
i
= 1 denote router i
(which is attached to core i) is active and s
i
= 0 otherwise, and S,fijs
i
= 1g is the set of all active routers
in the network. As a router cannot be put to sleep state if the corresponding core is in the active state, the
following constraint exists,
s
i
c
i
;8i2f1;:::;Ng: (7.1)
Therefore, the objective is to minimize the average packet latency L as shown in Equation 2.1, where H
and D
M
are functions of set S, subject to the maximum number of active routers S
max
, i.e.,jSj S
max
.
It is impractical to enumerate all the possible solutions of the above problem when N is large. Therefore
we propose two efficient heuristic algorithms, both having near-optimal performance and polynomial-time
complexity.
7.3 Proposed Solutions: Exact Cost and Merit Value-Based
Approaches
7.3.1 Exact Cost-Based Approach
The first algorithm is an exact cost-based approach, which starts from the state that only the routers
connected to active cores are ON, and then turns on other routers one by one as needed. At each step
when we determine which router to turn on, and choose the one that minimizes APL. We assume t
c
is a
small fixed value to compute APL in designing the heuristics. It is also worth mentioning that we define
L
i; j
to be a finite large number (e.g., 10
4
) instead of¥ in the algorithm if router i and j are not connected
to each other yet.
67
Computing APL in the algorithm involves solving an all-pairs shortest path problem. Our implementa-
tion has a runtime of O(N
4
) by bookkeeping L
i; j
of active router set. When router s is added to the current
active router set, we can update L
i; j
of active router set in O(N
2
) time.
7.3.2 Merit Value-Based Approach
In case O(N
4
) time complexity is still not fast enough for online implementation, we propose a faster
algorithm, namely the merit value-based algorithm with O(N
2
) time complexity. Similar to the exact
cost-based approach, the merit value-based approach turns on routers one by one. At each step of deciding
which router to turn on, we first consider the routers that can connect two components, with the help of a
disjoint set. If there is a tie (i.e., either multiple or no routers that can connect two components), we use a
pre-computed merit value associated with each router as tie-breakers. The merit value of a router serves
as a rough approximation of the reward of turning on that router. The merit value of router s is computed
as the sum of communication rate r
i; j
where router i and router j are not directly connected to each other
but are both directly connected to router s (so that they become connected if router s is turned on). When
a sleeping router is turned on, the merit values of other routers are updated accordingly.
We use an array-based disjoint-set implementation whose find operation has O(1) time complexity and
union operation has O(n+m) time complexity. As computing and updating merit values take O(N
2
) time,
the overall time complexity is O(N
2
), which is much smaller than that of the exact cost-based approach.
7.4 Evaluation
7.4.1 Evaluation Setup
In the simulation, the proposed Smart Butterfly is evaluated on both 4x4 and 8x8 flattened butterfly
(FB) networks with real application traces including four MPSoC traces (namely mms2, mpeg4, toybox,
and vopd t) and eight CMP traces (referred to as spec1 to spec4 for 4x4 network and spec5 to spec8
for 8x8 network). The MPSoC traces are collected from 12 to 16-core real applications [10]. For 4x4
networks, the application traces are concentrated onto 3 to 4-core traces to form the active core set.
The CMP traces are synthesized based on the memory and cache access traffic from a subset of SPEC
benchmarks. Cores of all the test traces are randomly mapped to the NoC tiles.
We compare the following eight schemes, including both aggressive and conservative algorithms
proposed in the Router Parking work [63] (these algorithms can be applied to FB as well).
Mesh BB: A branch and bound algorithm on mesh network that minimizes APL at given maximum
number of ON routers
Mesh RPA: Router Parking - Aggressive on mesh network
Mesh RPC: Router Parking - Conservative on mesh network
FB BB: A branch and bound algorithm on FB that minimizes APL at given maximum number of ON
routers
FB RPA: Router Parking - Aggressive on FB
FB RPC: Router Parking - Conservative on FB
FB EC: The proposed exact cost-based approach on FB
FB MV: The proposed merit value-based approach on FB
68
Table 7.1: Simulated network configurations.
Network Type Mesh FB Mesh FB
Link Width (bit) 512 128 512 32
Average T
S
1 1.6 1 4.8
T
R
3 3 3 3
T
L
1 1 1 1
Router Radix 5 7 5 15
9 9.2 9.4 9.6 9.8
0.4
0.45
0.5
0.55
0.6
Latency (cycles)
NoC Power (W)
4x4: toybox
13 14 15 16 17
0.3
0.35
0.4
0.45
0.5
Latency (cycles)
4x4: spec3
FB_BB
FB_EC
FB_MV
FB_RPC
FB_RPA
20 22 24 26
1.1
1.3
1.5
1.7
1.9
Latency (cycles)
8x8: toybox
20.2 20.4 20.6 20.8
1.6
1.8
2
2.2
2.4
Latency (cycles)
8x8: spec7
Figure 7.1: Trade-off curves between overall NoC power and average packet latency.
The network configurations in the simulation are listed in Table 7.1. Based on a previous study [50],
the on-chip traffic is composed of approximately 80% short packets and 20% long packets. For fair
comparison, both mesh and flattened butterfly networks have the same total buffer size in number of bits
and the same total bisection bandwidth (so the individual link width of FB is narrower than that of mesh).
For each of the test case, the per-hop contention latency t
c
is acquired by feeding the trace into Garnet,
a cycle-accurate NoC simulator [4]. NoC power (comprised of router power and link power) is calculated
by the NoC power model DSENT [68] with 32nm technology.
7.4.2 Evaluation Results
Power-Latency Trade-Offs
Figure 7.1 shows the simulation results of the five algorithms on flattened butterfly (aforesaid Schemes
4-8) in the form of trade-off curves between overall NoC power and average APL. Only two 4x4 and two
8x8 test cases are shown due to lack of space.
As shown in the figure, the trade-off curves of the two proposed heuristic algorithms are close to the
optimal curve of branch and bound-based algorithm on 4x4 network. The branch and bound result is
not shown for 8x8 network because it did not finish in a reasonable time period. Compared with Router
Parking, we can see that at the same level of power consumption, FB EC and FB MV achieve 13% and
12% lower APLs on average, respectively, compared to FB RPA. At the same level of APL, FB EC and
FB MV save 28% and 27% of NoC power consumption on average, respectively, compared to FB RPC.
It is worth mentioning that, the proposed FB EC and FB MV can produce a range of power-latency
trade-off points that can be used by system operators under different constraints and scenarios.
69
1 2 3 4 5 6 7 8
0
0.2
0.4
0.6
0.8
toybox
NoC Power (W)
8.5
9
9.5
10
10.5
11
1 2 3 4 5 6 7 8
0
0.5
mpeg4
9
10
11
12
1 2 3 4 5 6 7 8
0
0.2
0.4
0.6
mms2
12
14
16
18
20
1 2 3 4 5 6 7 8
0
0.2
0.4
0.6
vopd_t
5.2
5.4
5.6
5.8
6
APL (cycles)
1 2 3 4 5 6 7 8
0
0.5
spec1
NoC Power (W)
12
14
16
1 2 3 4 5 6 7 8
0
0.2
0.4
0.6
spec2
12
14
16
18
20
1 2 3 4 5 6 7 8
0
0.5
spec3
10
15
20
1 2 3 4 5 6 7 8
0
0.5
spec4
10
15
20
APL (cycles)
Dynamic
Static
Figure 7.2: 4x4 results: (1-3) Mesh BB, Mesh RPA, Mesh RPC, (4-5) FB RPA, FB RPC, (6-8) FB BB,
FB EC, FB MV .
Comparison of Eight Schemes
Figure 7.2 and Figure 7.3 compare the minimal NoC power consumption (left y-axis and the bars)
that can be achieved by each of the eight schemes (x-axis), and the corresponding average packet latency
(right y-axis and the curves) for different test cases. As can be seen, the power and latency results for
FB EC and FB MV are very similar to those of FB BB, demonstrating the effectiveness of the two
proposed heuristic algorithms.
In addition, the proposed FB EC and FB MV schemes are considerably better than the mesh-based
schemes in both power and latency. For example, FB EC achieves on average 42.85% and 60.48% less
NoC power consumption compared to the best results on 4x4 and 8x8 mesh network, respectively. This
advantage mainly comes from two aspects. First, each router in the flattened butterfly (higher radix but
narrower width) consumes less power compared to mesh routers (around 16% and 29% on 4x4 and 8x8
networks at the same injection rate, respectively). Second, the proposed algorithms can utilize the express
channels in the flattened butterfly to maintain connectivity of active cores while allowing more routers to
be powered off, thus having more power savings and less detours than the mesh. Note that although the
serialization latency in flattened butterfly is slightly higher than in mesh, the reduced detours and the use
of express channels in FB lead to much lower average packet latency than mesh.
7.5 Summary
In this chapter, we propose Smart Butterfly, an effective NoC power-gating scheme that applies core-
state-awareness to flattened butterfly networks. Smart Butterfly exploits the rich connectivity of flattened
70
1 2 3 4 5 6
0
0.5
1
1.5
2
2.5
3
toybox
NoC Power (W)
20
24
28
32
1 2 3 4 5 6
0
0.5
1
1.5
2
2.5
3
mpeg4
20
24
28
32
36
1 2 3 4 5 6
0
1
2
mms2
20
24
28
32
36
1 2 3 4 5 6
0
1
2
3
vopd_t
20
25
30
35
40
APL (cycles)
1 2 3 4 5 6
0
2
spec5
NoC Power (W)
20
40
60
1 2 3 4 5 6
0
1
2
3
spec6
20
25
30
35
40
1 2 3 4 5 6
0
1
2
3
4
spec7
20
25
30
35
40
45
1 2 3 4 5 6
0
5
spec8
20
30
40
APL (cycles)
Dynamic
Static
Figure 7.3: (1-2) Mesh RPA, Mesh RPC, (3-4) FB RPA, FB RPC, (5-6) FB EC, FB MV .
butterfly networks, and selectively powers off routers attached to sleeping cores to save more power. Fur-
thermore, it achieves a wide range of power-latency trade-offs by adjusting the number of ON routers. We
propose two heuristic algorithms to implement Smart Butterfly with different complexity and performance.
Simulation results show that the two heuristic algorithms are able to achieve near-optimal solutions with
low complexity, resulting in 42.85% and 60.48% less power consumption, on average, on 4x4 and 8x8
networks compared to a recently proposed mesh-based technique, respectively.
71
Chapter 8
Trade-Off between Dynamic and Static Power
Recent research has proposed to minimize NoC static power by proactively power-gating selected
routers when not all the connected cores are active. However, as more routers are powered off, on-chip
packets are forced to take detours more frequently, resulting in a higher hop count and increased dynamic
power that may potentially offset the static power savings. This chapter investigates such a trade-off
between static and dynamic power in detail, and explores how the overall NoC power consumption can
be minimized through proactive power-gating. Three efficient and effective algorithms are proposed to
minimize NoC static power, dynamic power, and overall power consumption, respectively. Evaluation
results based on PARSEC benchmarks demonstrate the importance of the trade-offs and show a substantial
improvement in total NoC power savings of the proposed algorithms compared with previous work that
did not give full consideration to both static and dynamic power.
8.1 Motivation: Trade-offs Between Dynamic and Static Power
The NoC static power, P
S
, is determined by the number of active routers,
P
S
=gjRj; (8.1)
whereg is the static power consumption of a single router, and R is the set of active routers. Note that
R includes routers that are directly connected to active cores (referred to as anchor routers hereinafter,
e.g., Routers 2, 4, 9, 11 in Figure 8.1), as well as additional powered-on routers to main the connectivity
among all the active cores (e.g., Routers 3, 6, 10 in Figure 8.1). In order to achieve maximum static power
savings, we can power off as many routers as possible with the condition that full connectivity of anchor
routers is maintained. Unfortunately, doing this may potentially increase dynamic power consumption.
As mentioned in the beginning of this chapter, dynamic power consumption of router j is proportional
to the router load l
i
, which can be characterized as the number of packets processed by router i per unit
time. With only the minimal set of R to keep connectivity, the paths between the anchor routers are
1 3
6 7 8
9 10 12
13 14 15 16
2 4
5
11
Figure 8.1: Mesh Example.
72
limited, forcing packets between some router pairs to detour and be forwarded through more routers. This
increase in hop count means that some routers have to process a larger number of packets during the same
time span, i.e., a higher load l
i
, leading to penalty in dynamic power.
Specifically, the total NoC dynamic power, P
dyn
, is proportional to the summation of all router load,
å
i
l
i
. Notice that the load sum, å
i
l
i
, is the total number of packets processed by all the routers per unit
time. This is equivalent to the total number of hops traveled by all the packets during the unit time (e.g.,
a packet traveling 3 hops is essentially processed three times). Thus, if H denotes the overall packet hop
count per unit time (referred to as overall hop count hereinafter), we have
å
i
l
i
= H =
å
i
å
j
h
i j
r
i j
; (8.2)
where h
i j
denotes the number of hops a packet needs to travel from anchor router i to j, which is
determined by the currently active routers. r
i j
denotes the communication rate, i.e., number of packets
sent from i to j per unit time, collected by the aforesaid central controller in each execution epoch. The
total NoC dynamic power P
dyn
is therefore determined by
P
dyn
=r H; (8.3)
wherer is the dynamic power consumption of a single packet going through one router hop. Once the
NoC design is given (i.e., router radix, link width, frequency, etc.),r can be considered as a fixed value at
the granularity targeted in this chapter.
In conclusion, there exists a trade-off between static and dynamic power consumption in proactive
power gating, and the optimal solution to minimize the overall power must take into account both factors.
8.2 Problem Statement: Minimum Static Power, Minimum
Dynamic Power, and Minimum Overall Power
Based on the above analysis, we can formulate the proactive power gating problem as follows.
Given:
1. an n n mesh on-chip network consisting of N= n
2
routers (each router is connected to a core);
2. a set of anchor routers R
0
, satisfying each active core is connected to one anchor router in R
0
;
3. communication rates between the active cores r
i j
, where the i-th and j-th routers are among the
anchor routers in R
0
;
4. static power consumption per router g, and the coefficient r for dynamic power consumption per
packet per hop.
Find:
1. The minimum active router set R
S
R
0
that maintains the connectivity between the anchor routers.
As mentioned above, R
S
provides the minimum static power consumption because it turns on the
least possible number of routers, i.e.,jR
S
j is minimum, but with potentially high overall hop count
and high dynamic power consumption.
73
2. The minimum active router set R
D
R
0
that minimizes hop count.
R
D
denotes the active router set with a minimum number of routers, while ensuring all the packets
travel through the shortest path (i.e., the smallest number of hops) between any two anchor routers
to minimize the dynamic power consumption. In other words, we aim to find the minimum number
of routers to provide minimum paths for all the packets. Turning on more routers thanjR
D
j does
not further reduce the overall hop count H.
3. The active router set R
P
R
0
that minimizes the overall NoC power consumption.
Since the NoC static power is minimized when R
S
routers are powered on and the dynamic power
is minimized whenjR
D
j routers are powered on, the size of R
P
for the minimum overall NoC power
consumption satisfiesjR
S
jjR
P
jjR
D
j.
Note that, in the selection of active routers for R
P
, it is important (and challenging) to take into account
the communication rates r
i j
between the active cores, as r
i j
may greatly affect the overall hop count and
thus dynamic power consumption.
8.3 Proposed Solutions
8.3.1 Minimal Active Router Set to Maintain Connectivity
We first show that the problem of finding R
S
is equivalent to the well-known Rectilinear Steiner
Minimum Tree problem (RSMT), which is proved to be NP-complete [61]. First, the routers injR
S
j form
a rectilinear Steiner tree for R
0
. Assuming every link on the mesh NoC has the length of 1, the total
length of the spanning tree R
S
is equal tojR
S
j 1 because of the cycle-free nature of tree structures.
Therefore, the process of finding the minimum number of routersjR
S
j is equivalent to calculating the
length of RSMT for R
0
.
According to Hanan’s theorem [31], there exists a RSMT with Steiner points chosen from a candidate
set S
Cand
, which is the intersection points of all horizontal and vertical lines drawn through all the
initial points. Therefore, in this chapter, S
Cand
is the set of intersection points of all horizontal and
vertical lines drawn through the anchor routers in R
0
. For the R
0
=f2;4;9;11g example in Figure 8.1,
S
Cand
=f1;3;10;12g. Note that R
S
is all the routers in MST(S[ R
0
), and the Steiner point set in the
RSMT is a subset of additionally powered-on routers.
With the above analysis, we propose an algorithm named Communication-Aware Iterative 1-Steiner
(CAIS). It is based on the Iterative 1-Steiner heuristic for RSMT, which has a worst-case 3/2 approximation
ratio [40, 61]. The basic procedure is as follows: CAIS starts with a minimum spanning tree (MST) of
R
0
denoted as MST(R
0
), and adds the Steiner points one at a time from S
Cand
until no Steiner points
can further reduce the length of the current MST L(MST(R
0
)). At each iteration, CAIS selects the next
Steiner point that maximizes the reduction of the length of the current MST. The MST length reduction
DL(s;R) of any Steiner point s is calculated by
DL(s;R)= L(MST(R
0
)) L(MST(R
0
[fsg)): (8.4)
74
To further take into consideration the communication rates to minimize the overall hop count of R
S
,
CAIS uses the overall hop count H defined in (8.2) as the tiebreaker because it weighs on communication
rates. More precisely, whenever there is a tie in choosing the next Steiner point, CAIS selects the one that
results in a smaller H. The pseudo code is shown in Algorithm 5.
ALGORITHM 5: Communication-Aware Iterative 1-Steiner (CAIS)
1 Initialize S;MST(R
0
) ; // S is the Steiner point set
2 while S
Cand
6=? andDL(s;R
0
[ S)> 0 do
3 find all s2 S
Cand
which maximizesDL(s;R
0
[ S) ;
4 if there are multiple solutions of s then
5 find s with minimum overall hop count H
6 end
7 ifDL(s;R
0
[ S)> 0 then
8 S= S[fsg ; // add s to set S
9 end
10 Remove s from S
Cand
;
11 end
12 Return R
S
is the set of all the routers on R
0
[ S
8.3.2 Minimum Active Router Set to Minimize Hop Count
The router set R
D
allows all the packets to use shortest paths (i.e., minimizes the overall hop count)
with a minimal number of active routers. While keeping all the routers powered on apparently achieves
the minimal hop count, there might be some unnecessarily powered-on routers. As shown in the example
in Figure 8.1 where R
0
=f2;4;9;11g, some routers are not receiving or forwarding any packets (e.g.,
f13;14;15;16g), and some other routers can be powered off but still allow packets to be re-routed through
the minimum path (e.g., if we power off Router 1 and 5, the packets from Router 2 to Router 9 can be
routed through Router 6 and 10). Therefore, the minimum router set R
D
should exclude the unnecessary
routers and maximize the “sharing” of routers among multiple minimum paths.
The problem of finding R
D
has an exponential search space similar to the R
S
problem. In order to
develop an efficient algorithm for R
D
, we identify one crucial property, which is the memoryless property
of the R
D
solution on the mesh topology. To be specific, assume s is a router within the rectangle defined
by router i and j (e.g., Router 6 in the rectangle defined by Router 4 and 9 in Figure 8.1). A shortest path
p
min
i j
can be derived by a combination of the shortest paths p
min
is
between i and s and p
min
s j
between s and
j, i.e., p
min
i j
= p
min
is
[ p
min
s j
. For example, a minimum path between Router 4 and 6 isf4;3;2;6g, and one
between Router 6 and 9 isf6;10;9g. Combining these two paths, we get a minimum pathf4;3;2;6;10;9g
between Router 4 and 9.
This gives the following hint to find p
min
i j
. After choosing a satisfying router s and building p
min
is
, we
only need to find p
min
s j
to construct the shortest path between i and j. In the example of Routers 4 and 9,
once we choose Router 6 as the via node and derive the path from 4 to 6, we only need to build the path
from Router 6 to 9 to acquire the minimum path between 4 and 9.
75
Inspired by this memoryless property, we propose an algorithm Constructing Along sIngle Dimension
(CAID), which progressively constructs the shortest paths between anchor routers along one dimension
(from top to bottom in this chapter). When processing row
i
, all the shortest paths among the anchor
routers from row
1
to row
i1
have already been constructed and remain unchanged. CAID only constructs
the horizontal paths within row
i
and the vertical paths from row
i
to row
i+1
based on the relative positions
of the active routers on row
i
to the bottom (explained next). Here the construction of a particular path
means to turn on all the routers on that path. In this way, we reduce the solution search space to one row
in each iteration.
The CAID algorithm first makes decisions on the paths from current row to the next row (vertical
paths) and then connects all the routers in the current row (horizontal paths) in each iteration. In order
to ensure minimum paths between any pair of routers, the CAID algorithm constructs the vertical paths
for each router based on four basic patterns shown in Figure 3, where (x
s
;y
s
) is the current active router
(y
s
= row
i
). x
min
and x
max
indicate the smallest and largest x coordinates of the remaining anchor routers,
i.e., the anchor routers from row
i+1
to the last row. We elaborate the four patterns and their corresponding
decisions in the following.
Pattern A (Outsider): When s is outside the left or right boundaries of the remaining routers (i.e., x
min
>
x
s
or x
s
> x
max
), CAID constructs an XY path from s to the corner router (x
min
;row
i+1
) or
(x
max
;row
i+1
).
Pattern B (Uncovered Half): When x
min
< x
s
< x
max
, and no routers in the current row are active on one
side of s (i.e., [x
min
;x
s
) or (x
s
;x
max
]), CAID constructs the path from s to (x
s
;row
i+1
), i.e., the
router right below s.
Pattern C (Sink): When x
min
< x
s
< x
max
and there are active routers in the current row on both sides of
s, if there exists an remaining anchor router from column x
lhs
to x
rhs
where x
lhs
= max
x
fx< x
s
g
and x
rhs
= min
x
fx> x
s
g (x is the x coordinates of active routers on row
i
), CAID constructs the
path from s to(x
s
;row
i+1
).
Pattern D (Bridge): When x
min
< x
s
< x
max
, and there are no remaining anchor routers from column x
lhs
to x
rhs
, CAID constructs no new vertical paths.
76
The pseudo code of CAID is shown as follows.
ALGORITHM 6: Constructing Along sIngle Dimension (CAID)
1 Initialize R
D
= R
0
;
2 sort the routers in R
0
based on their row numberfrow
i
g;
3 for row
i
infrow
i
g do
4 if row
i
is the last row with anchor routers then
5 R
D
= R
D
[fsjx
row
i
min
x
s
x
row
i
max
;y
s
= row
i
g ; // construct horizontal paths
6 break;
7 end
8 find x
min
and x
max
of the remaining rows;
9 for active router s in row
i
do
10 if Pattern A==true then
11 R
D
= R
D
[ XY path(s;(x
m
;row
i+1
))
a
; // routers from s to corner
12 else if Pattern B or C==true then
13 R
D
= R
D
[ XY path(s;(x
s
;row
i+1
)) ; // routers from s to row
i+1
14 else if Pattern D==true then
; // do nothing
15 end
16 R
D
= R
D
[fsjx
s
2[x
row
i
min
;x
row
i
max
];y
s
= row
i
g ; // construct horizontal paths
17 end
18 Return R
D
a
Function XY path(a;b) returns the set of routers on the path following XY routing from a to b
CAID has n iterations (n rows), each having n routers. For each router, x
min
, x
max
, and whether there
are any routers from column x
lhs
to x
rhs
are calculated in O(logN) (N= n
2
) by maintaining a binary search
tree. The step of adding horizontal paths for each row takes O(n). Overall, CAID has a complexity of
O(n)(O(n) O(logN)+ O(n))= O(N logN).
8.3.3 Active Router Set to Minimize Overall NoC Power
With R
S
achieves the minimal static power and R
D
achieves the minimal dynamic power, the active
router set R
P
for achieving the minimal overall NoC power can be found by powering on additional
routers on top of R
S
. We adopt a similar basic flow of CAIS to find R
P
, i.e., iteratively turn on routers until
a satisfying result is achieved. However, choosing one router to power on in each iteration may make no
difference in the hop count, as a new path often requires more than one router. Therefore, instead of adding
one router a time, we propose the Communication-Aware Iterative 1-Path (CAIP) algorithm that chooses
one path at each iteration and turns on all the routers on that path. In this way, we can explore the full
range of static-dynamic power trade-off, because the starting point is R
S
, and we can reach up to R
D
as the
overall hop count H can always be reduced if there exists two routers with no minimum path between them.
Similar to the CAIS algorithm, a criterion of selecting a new path is needed for the CAIP algorithm.
This criterion should reflect not only the cost of turning on a path but also the reward of doing so. The
cost equals the static power increase, whereas the reward should comprise of two parts, namely (i) the
dynamic power savings due to the hop count decrease brought by the new path, and (ii) the potential of
the newly powered-on routers on the new path to further reduce dynamic power in future iterations as
77
part of other minimum paths. More precisely, some powered-off routers may belong to minimum paths
of more than one router pair, and the future reward of turning on other paths of these “shared” routers
should also be considered. For example, using the numbering in Figure 8.1, assume no minimum path
is available between Routers 2 and 11 or between Routers 9 and 11. Then for adding a minimum path
f9;10;11g between Router 9 and 11, its reward calculation should include its own contribution in hop
count reduction as well as the potential to further reduce the hop count between Router 2 and 11.
Based on the above analysis, we give the criterion of selecting the paths as follows. We first define the
weighted hop count reductionDH
i j
for each router pair, which equals the hop count decrease in the overall
hop count H if we connect router i and j with a minimum path,
DH
i j
=(jp
i j
j M(i; j)) r
i j
; (8.5)
where p
i j
is the current path from router i to j , and M(i; j) is the Manhattan distance, which is the length
of a minimum path on mesh networks. The difference between p
i j
and M(i; j) is weighted by r
i j
to reflect
the actual decrease on H.
WithDH
i j
calculated, we define the overall reward function g(s) , i.e., the potential power savings of
turning on a powered-off router s as
g(s)=r
å
p
min
i j
3s
DH
i j
g;8s = 2 R
S
; (8.6)
whereg is subtracted from the g(s), indicating the cost of static power consumption brought by powering
on the router. The reward part uses the summation of DH
i j
over all the router pairs i and j that has a
minimum path going through s (p
min
i j
3 s).
Finally, the gain function of turning on a minimum path p
min
i j
(k)between router i and j is defined by
g(p
min
i j
(k))=
å
s2p
min
i j
(k)
g(s): (8.7)
p
min
i j
(k) denotes the k-th minimum path between i and j as there could be more than one minimum path.
The CAIP algorithm first sorts all the anchor router pairs in R
S
in descending order based on DH
i j
.
When processing each router pair i and j, a minimum path with the maximum gain g(p
min
i j
(k)) is selected.
CAIP then powers on the path, and updates g(s) by subtracting DH
i j
from the gain functions of all the
other powered-off routers on the minimum paths between i and j (because this gainDH
i j
has been already
utilized). The algorithm continues for the rest of the router pairs until (i)8DH
i j
= 0, i.e., each anchor
router pair has a minimum path between them, or (ii) P(R
P
) P(R
D
), i.e., R
P
has reached R
D
where no
more active router would reduce H further.
78
The pseudo code of CAIP is shown as follows.
ALGORITHM 7: Communication-Aware Iterative 1-Path (CAIP)
1 Initialize R
P
= R
S
;
2 Sort the router pairs based onDH
i j
;
3 forDH
i j
in the sorted listDH
i j
do
4 if P(R
P
) P(R
D
) (reaches the end of R
S
to R
D
range) then
5 return the router set R
P
= R
D
;
6 end
7 find p
min
i j
(k)s:t:g(p
min
i j
(k)) g(p
min
i j
(l));8l;
8 R
P
= R
P
[ p
min
i j
(k) ; // turn on all the routers on the path
9 calculate overall power P(R
P
)= P
sta
(R
P
)+ P
dyn
(R
P
);
10 if P(R
P
)< P
min
then
11 P
min
= P(R
P
);R
P;min
= R
P
;
12 end
13 for router s on a minimum path between i and j do
14 g(s)= g(s)DH
i j
;8s = 2 R
P
;
15 end
16 end
17 Return the router set R
P;min
;
8.4 Evaluation
8.4.1 Evaluation Setup
We evaluate the proposed power gating schemes on an 8x8 mesh-based NoC, with the multi-threaded
PARSEC benchmarks [8] running on the cycle-accurate full-system gem5 simulator [9]. We experiment
with 8, 16, and 32 active cores, and these active cores are randomly selected from the 64 cores on the 8x8
mesh. The following results are the average of ten randomly generated active core sets unless otherwise
specified. The NoC power consumption is estimated by the latest NoC power model DSENT [68] with
32nm bulk CMOS LVT technology. Given the light communication rates in real applications and thus
the low likelihood of deadlock, we adopt a deadlock recovery approach where the up/down routing is
used together with a dedicated channel to drain packets and recover from deadlock. Table 8.1 lists the key
parameters in the evaluation.
We compare the following schemes.
1. Baseline: no power gating applied (NoPG);
2. Two Router Parking schemes proposed in [64]: aggressive (RPA) and conservative (RPC). RPA
powers off as many routers as possible to minimize static power, which has the similar goal as our
CAIS; whereas RPC powers off a subset of routers to minimize hop count increase
1
, which has the
similar goal as our CAID.
1
The paper claims RPC minimizes latency overhead, which is equivalent to minimizing overall hop count per unit time in
their paper setup
79
Table 8.1: Key Parameters in Evaluation.
Network topology 8x8 mesh
Input buffer depth 5-flit for data VC, 1-flit for control VC
Link bandwidth 128 bits/cycle
Router 3-stage, 3GHz
Private I/D L1 cache 32KB, 2-way, LRU, 1-cycle latency
Shared L2 cache per bank 256KB, 16-way, LRU, 6-cycle latency
Cache block size 64B
Virtual channel 2 VCs/VN, 3 VNs
Coherence protocol Two-level MESI
Memory controllers 4, located one at each corner
Memory latency 128 cycles
0
20
40
60
80
100
(a) 8 Active Cores
Router power percentage (%)
NoPG
RPA
CAIS
RPC
CAID
CAIP
0
2
4
6
8
10
0
20
40
60
80
100
(b) 16 Active Cores
NoPG
RPA
CAIS
RPC
CAID
CAIP
0
2
4
6
8
10
0
20
40
60
80
100
(c) 32 Active Cores
NoPG
RPA
CAIS
RPC
CAID
CAIP
0
2
4
6
8
10
Average hop count
Dynamic
Static
H
Figure 8.2: Application-level assignment.
3. The three proposed schemes: CAIS, CAID, and CAIP to find R
S
, R
D
, and R
P
respectively (to
increase clarity, we name the algorithms in a way that the last letter also matches the objectives of
minimizing static power, dynamic power, and overall power).
8.4.2 Comparison of the Power Gating Schemes
Figure 8.2 shows the results of NoC power consumption and hop count of the six schemes under
different numbers of active cores, averaged over the PARSEC benchmarks. For easy readability, the hop
count shown in the figure is averaged to each packet (i.e., H=(å
i
å
j
r
i j
)) instead of H directly which
would otherwise be a less intuitive value that depends on the rates of packets.
Among the five power gating algorithms, RPA and CAIS aim at minimizing static power while
maintaining full connectivity. Compared to RPA, the proposed CAIS achieves an average of 24.44%
less static power consumption with almost the same dynamic power consumption, indicating CAIS is
preferred in case where a minimum router set is desired.
Similarly, both RPC and CAID minimize the hop count increase between the anchor routers. CAID
achieves an average of 20.37% power savings compared to NoPG with guaranteed minimal hop count
as NoPG. In comparison, RPC solutions require 23.86% more routers to be powered on than CAID on
average while still incurring a 2.63% hop count increase compared to NoPG. RPC also has 11.71% less
80
static power savings compared to CAID.
Finally, compared to NoPG, the proposed CAIP achieves 43.35% overall NoC power consumption
savings under 8 active cores, and 33.91%, 34.09% under 16 and 32 active cores, respectively. As can
be seen, CAIP does not have the minimal hop count (with an 8.65% increase compared to NoPG) or the
minimum number of active routers, but it achieves the minimal overall NoC power among all the schemes.
This verifies that the minimal overall power may be different from the minimal static or dynamic power
configurations.
8.4.3 Trade-off between Static Power and Dynamic Power
To further study the trade-off between static and dynamic power, we look at the results of one of the
benchmarks, x264, in detail. Other PARSEC benchmarks exhibit similar trends, although the specific
values are not exactly the same. Figure 8.3 plots the changes of static power, dynamic power, and overall
NoC power consumption as the number of active router increases, for a randomly generated set of 16
active cores. The static power is linearly proportional to the number of active routers. The solution
generated by CAIS corresponds to the left-most case, in which the dynamic power accounts for as high as
71.31% of the overall NoC power due to the large hop count caused by detour. As new minimum paths
are added, the dynamic power consumption reduces because of the smaller hop count between the active
cores. The dynamic power gradually gets close to a minimum as the active router set gets close to CAID.
The overall NoC power consumption is minimized at 36 active routers with 3.92 W, of which 56.01%
comes from dynamic power and 43.99% comes from static power.
30 35 40 45
1
2
3
4
5
Router power (W)
Number of active routers
Overall
Dynamic
Static
Figure 8.3: Two assignment orders: (a) poor
balancing and (b) good balancing.
0
20
40
60
80
100
120
(a) App Execution Time
NoPG
RPA
CAIS
RPC
CAID
CAIP
Execution time percentage (%)
0
20
40
60
80
100
(b) NoC Energy
NoPG
RPA
CAIS
RPC
CAID
CAIP
Energy percentage (%)
Figure 8.4: Full system simulation results.
Figure 8.5 depicts the solutions generated by the three pro-posed algorithms for the above example.
The anchor routers are shown in white blocks. The CAIS solution in Figure 8.5(a) needs to power on
additional 11 routers (in green blocks) to ensure full connectivity; whereas CAID turns on 31 additional
routers (in red blocks) to minimize hop count. Finally, the CAIP algorithm achieves minimum power
consumption by turning on 20 additional routers (in blue blocks), or 9 more routers on top of CAIS.
81
(a) CAIS (b) CAID (c) CAIP
anchor router
turned on by CAIS
turned on by CAID
turned on by CAIP
power-gated router
Figure 8.5: x264 result with 16 active cores.
8.4.4 Impact on Application Performance and NoC Energy
We also investigate the impact of the proposed power gating schemes on application execution time
and NoC energy (i.e., the product of the application execution time and the NoC power).
Figure 8.4(a) and (b) compare the total execution time and NoC energy averaged over the PARSEC
benchmarks under the six schemes. RPA and CAIS take longer to run with 6.69% and 7.55% execution
time overhead, respectively, but CAIS is able to save 16.69% NoC energy compared to NoPG, which is
9.31% more than that of RPA. For RPC and CAID, they have negligible performance overhead because of
their minimal impact on hop count, and they save 12.43% and 16.34% NoC energy compared to NoPG,
respectively. This indicates that CAID can be adopted to save NoC energy when high performance of
NoC is required during runtime. Lastly, among these six schemes, the minimal power solution CAIP
achieves 31.48% NoC energy saving with only 3.68% performance overhead compared to NoPG.
8.5 Summary
This chapter identifies and explores the important trade-off between static and dynamic power in proac-
tive NoC power gating schemes. Keeping a minimum number of active routers to minimize static power
does not necessarily leads to the maximum NoC power savings because dynamic power can be increased
due to packet detours. We present three efficient algorithms to find the active router set to minimize static
power, dynamic power, and overall NoC power consumption, respectively. Simulation results show the
advantage of the proposed algorithms over prior art. In particular, the algorithm that targets overall NoC
power minimization can achieve 33.91% NoC power savings and 31.48% NoC energy savings on average,
with only 3.68% overhead in application execution time.
82
Chapter 9
Conclusions and Future Research
9.1 Conclusions
This dissertation explores the opportunities and feasible solutions in reducing on-chip packet latency
and power consumption for on-chip networks (a.k.a., networks-on-chip, or NoCs). The main contributions
of this work include improving NoC performance through application mapping under different situations,
exploring opportunities of latency reduction through physical express link placement, and designing
power minimization schemes with identification of the trade-off of dynamic and static power consumption
in core-state-aware power gating.
Average packet latency is among the most important performance criteria of on-chip networks. In
the first place, this research discusses the application mapping methods for three different scenarios to
achieve minimum NoC latency. First, it is revealed how the additional express links on flattened butterfly
NoC topology may influence the latency models and therefore cause the effectiveness degradation of
application mapping algorithms, and then an effective mapping method for mapping on flattened butterfly
NoCs is proposed based on the partitioning algorithm in computer aided design. Second, this work
considers the situation when there are multiple applications running on a single chip multi-processor
(CMP), where balancing the on-chip packet latencies of these applications can be an important criterion.
This work tackles at this balancing requirement without performance degradation by wisely choosing
the metric of NoC balancing and proposing an effective heuristic mapping algorithm. Third, this work
explores the application mapping method with minimum latency increase when there is a tight thermal
budget on chip, i.e., there is a maximum temperature limit for the multi-core chip.
The introduction of express links reveals another method to reduce average packet latency. This
work explores the opportunity of customized placement of express links between non-adjacent cores
to improve the overall performance of an on-chip network. A trade-off between the head flit latency
and the serialization latency is identified, and then a detailed framework which locates the optimal
express link placement method is proposed with an effective heuristic solution, a detailed router structure
implementation and an analysis of additional power and area overhead.
This dissertation then investigates the power gating opportunities for on-chip routers. As the ratio
of static power consumption of NoC routers has been increasing with technology scaling, power gating
has become a promising method to reduce NoC power consumption. One way to utilize potential
router idleness is through core state-aware power gating, or proactive power gating, i.e., selectively
shutting down routers connected to sleeping cores. This research first emphasizes on core state-aware
power gating on flattened butterfly NoC topologies and proposes heuristic algorithms to minimize the
number of active routers while minimizing NoC performance degradation. Then, this research discusses
the overall power consumption in proactive power gating for a generic mesh-based NoC topology,
and identifies the trade-off between dynamic and static power consumption. Three heuristic algorithms
are then proposed to tackle with different power and performance requirements of the running applications.
83
9.2 Recommendation for Future Research
With the advent and prosper of multi-core processors, on-chip networks have become key structure to
support the communication between the cores. Based on the proposed frameworks and algorithms of this
dissertation, researchers can extend it in the following ways.
9.2.1 Thermally aware application mapping and express link placement
As mentioned in Chapter 5, thermal issues raise when two power hungry cores stay too close to
each other on multi-core platforms. Placing hot cores away from each other can eliminate any potential
hotspots, but may also result in NoC packet latency increase if these two hot cores are heavily commu-
nicating with each other. Chapter 5 proposes a trade-off solution between hotspot elimination and NoC
performance. As mentioned in Chapter 6, express links introduce additional connectivity to the NoC by
adding links to non-adjacent tiles. Combining this idea with TAPP in Chapter 5, we find that if we map
two heavily communicating hot cores at two ends of an express link, the hotspot can be eliminated with
minimum performance degradation. Therefore, for a many-core application with tight thermal budget, one
method to achieve high performance is to combine the methods proposed in Chapter 5 and Chapter 6, i.e.,
to place the threads of the application onto a NoC-based many-core platform with express links.
9.2.2 Thread migration and multi-threaded cores in application mapping
This research focuses on design-time application mapping in various platforms, as discussed in
Chapter 3, Chapter 4, and Chapter 5. Considering the fact that some applications may have different
communication patterns at different stages, the performance of NoC can be further improved if the
application is separated into different sections and the mapping gets updated for each section. In the next
decision epoch, if the optimal mapping solution differs from the current mapping, thread migration might
be needed to switch to the optimal solution. However, thread migration bring extra costs resulted from
cache flushing, context switching, thread re-running, etc. It is desired to compare the estimated costs
and the performance improvement brought by thread migration at each decision epoch for the proposed
environments in Chapter 3, Chapter 4, and Chapter 5.
In addition, this dissertation considers each core has one single thread executed on it in the application
mapping projects. In case of multi-threaded cores, where each core can have more than one threads
running, some modifications to the application mapping methods in this dissertation are necessary to deal
with such situations.
9.2.3 Routing implementation for core state-aware power gating
In the proposed core state-aware power gating methods, we assume that the packets between any two
cores can go through the minimum paths to achieve an estimation of the overall NoC power consumption.
However, minimum path is not guaranteed without a carefully designed routing algorithm. On the other
hand, the routing algorithm which can achieve minimum paths between any two active cores may be
too complicated and require too much resource to implement, and a trade-off solution has to be located
between the router complexity and the NoC packet latency. Therefore, one interesting, challenging, and
meaningful extension of the proposed core state-aware power gating is to implement a satisfying routing
scheme which can leverage the NoC performance and the NoC design complexity.
84
Reference List
[1] Static mapping of mixed-critical applications for fault-tolerant MPSoCs, author=Kang, Shin-haeng
and Yang, Hoeseok and Kim, Sungchan and Bacivarov, Iuliana and Ha, Soonhoi and Thiele, Lothar,
booktitle=Proceedings of the 51st Annual Design Automation Conference, pages=1–6, year=2014,
organization=ACM.
[2] Tilera corporation. http://www.tilera.com/products/processors.
[3] ABTS, D., JERGER, N. D., KIM, J., GIBSON, D., AND LIPASTI, M. H. Achieving predictable
performance through better memory controller placement in many-core CMPs. ACM SIGARCH
Computer Architecture News 37, 3 (2009), 451–461.
[4] AGARWAL, N., KRISHNA, T., PEH, L., AND JHA, N. Garnet: A detailed on-chip network model
inside a full-system simulator. In Proc. IEEE Int’l Symp. on Performance Analysis of Systems and
Software (2009), pp. 33–42.
[5] ANNAVARAM, M. A case for guarded power gating for multi-core processors. In High Perfor-
mance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on (2011), IEEE,
pp. 291–300.
[6] AWASTHI, M., NELLANS, D., SUDAN, K., BALASUBRAMONIAN, R., AND DAVIS, A. Managing
data placement in memory systems with multiple memory controllers. International Journal of
Parallel Programming 40, 1 (2012), 57–83.
[7] BARROSO, L. A., AND H
¨
OLZLE, U. The case for energy-proportional computing. Computer, 12
(2007), 33–37.
[8] BIENIA, C., KUMAR, S., SINGH, J., AND LI, K. The PARSEC benchmark suite: Characteriza-
tion and architectural implications. In Proc. Int’l Conf on Parallel Architectures and Compilation
Techniques (2008), pp. 72–81.
[9] BINKERT, N., BECKMANN, B., BLACK, G., REINHARDT, S. K., SAIDI, A., BASU, A., HES-
TNESS, J., HOWER, D. R., KRISHNA, T., SARDASHTI, S., ET AL. The gem5 simulator. ACM
SIGARCH Computer Architecture News 39, 2 (2011), 1–7.
[10] BOGDAN, P., AND MARCULESCU, R. Non-stationary traffic analysis and its implications on multi-
core platform design. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions
on 30, 4 (2011), 508–519.
[11] CHEN, C., F., L., SON, S., AND KANDEMIR, M. Application mapping for chip multiprocessors.
In Proc. Design Automation Conf. (2008), pp. 620–625.
85
[12] CHEN, C.-H., AGARWAL, N., KRISHNA, T., KOO, K.-H., PEH, L.-S., AND SARASWAT, K. C.
Physical vs. virtual express topologies with low-swing links for future many-core NoCs. In Networks-
on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on (2010), IEEE, pp. 173–180.
[13] CHEN, L., AND PINKSTON, T. M. Nord: Node-router decoupling for effective power-gating of
on-chip routers. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on
Microarchitecture (2012), IEEE Computer Society, pp. 270–281.
[14] CHEN, L., ZHU, D., PEDRAM, M., AND PINKSTON, T. M. Power punch: towards non-blocking
power-gating of NoC routers. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st
International Symposium on (2015), IEEE, pp. 378–389.
[15] CHOI, J., OH, H., KIM, S., AND HA, S. Executing synchronous dataflow graphs on a spm-based
multicore architecture. In Proceedings of the 49th Annual Design Automation Conference (2012),
ACM, pp. 664–671.
[16] CORMEN, T., LEISERSON, C., RIVEST, R., AND STEIN, C. Introduction to algorithms, vol. 2.
MIT press Cambridge, 2001.
[17] DALLY, W., AND TOWLES, B. Route packets, not wires: on-chip interconnection networks. In Proc.
Design Automation Conf. (2001), pp. 684–689.
[18] DALLY, W., AND TOWLES, B. Principles and practices of interconnection networks. Elsevier, 2004.
[19] DALLY, W. J. Express cubes: Improving the performance of k-ary n-cube interconnection networks.
Computers, IEEE Transactions on 40, 9 (1991), 1016–1023.
[20] DAS, R., AUSAVARUNGNIRUN, R., MUTLU, O., KUMAR, A., AND AZIMI, M. Application-to-
core mapping policies to reduce memory system interference in multi-core systems. In IEEE Int’l
Symp. on High Performance Computer Architecture (2013), pp. 107–118.
[21] DAS, R., MUTLU, O., MOSCIBRODA, T., AND DAS, C. Application-aware prioritization mecha-
nisms for on-chip networks. In Proc. IEEE/ACM Int’l Symp. Microarchitecture (Dec 2009), pp. 280–
291.
[22] DAS, R., NARAYANASAMY, S., SATPATHY, S. K., AND DRESLINSKI, R. G. Catnap: energy
proportional multiple network-on-chip. In ACM SIGARCH Computer Architecture News (2013),
vol. 41, ACM, pp. 320–331.
[23] DAYA, B., CHEN, C.-H., SUBRAMANIAN, S., KWON, W.-C., PARK, S., KRISHNA, T., HOLT, J.,
CHANDRAKASAN, A., AND PEH, L.-S. SCORPIO: A 36-core research chip demonstrating snoopy
coherence on a scalable mesh NoC with in-network ordering. In Computer Architecture (ISCA), 2014
ACM/IEEE 41st International Symposium on (June 2014), pp. 25–36.
[24] DICK, R. P., RHODES, D. L., AND WOLF, W. Tgff: task graphs for free. In Proceedings of the 6th
international workshop on Hardware/software codesign (1998), IEEE Computer Society, pp. 97–
101.
[25] DUMITRIU, V., AND KHAN, G. N. Throughput-oriented NoC topology generation and analysis for
high performance SoCs. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 17, 10
(2009), 1433–1446.
86
[26] EBRAHIMI, E., LEE, C., MUTLU, O., AND PATT, Y. Fairness via source throttling: A config-
urable and high-performance fairness substrate for multi-core memory systems. In Proc. Int’l Conf.
Architectural Support for Programming Languages and Operating Systems (2010), pp. 335–346.
[27] GABOR, R., WEISS, S., AND MENDELSON, A. Fairness and throughput in switch on event multi-
threading. In Proc. IEEE/ACM Int’l Symp. Microarchitecture (2006), pp. 149–160.
[28] GRATZ, P., KIM, C., SANKARALINGAM, K., HANSON, H., SHIVAKUMAR, P., KECKLER, S. W.,
AND BURGER, D. On-chip interconnection networks of the TRIPS chip. Micro, IEEE 27, 5 (2007),
41–50.
[29] GROT, B., HESTNESS, J., KECKLER, S. W., AND MUTLU, O. Express cube topologies for on-
chip interconnects. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th
International Symposium on (2009), IEEE, pp. 163–174.
[30] GROT, B., HESTNESS, J., KECKLER, S. W., AND MUTLU, O. Kilo-NoC: A heterogeneous
network-on-chip architecture for scalability and service guarantees. ACM SIGARCH Computer
Architecture News 39, 3 (2011), 401–412.
[31] HANAN, M. On steiner’s problem with rectilinear distance. SIAM Journal on Applied Mathematics
14, 2 (1966), 255–265.
[32] HANSSON, A., GOOSSENS, K., AND R
ˇ
ADULESCU, A. A unified approach to constrained map-
ping and routing on network-on-chip architectures. In Proc. IEEE/ACM/IFIP Int’l Conf. Hard-
ware/Software Codesign and System Synthesis (2005), pp. 75–80.
[33] HEISSWOLF, J., KNIG, R., KUPPER, M., AND BECKER, J. Providing multiple hard latency and
throughput guarantees for packet switching networks on chip. Computers and Electrical Engineering
39, 8 (2013), 2603 – 2622.
[34] HO, W. H., AND PINKSTON, T. M. A design methodology for efficient application-specific on-chip
interconnects. Parallel and Distributed Systems, IEEE Transactions on 17, 2 (2006), 174–190.
[35] HOSKOTE, Y., VANGAL, S., SINGH, A., BORKAR, N., AND BORKAR, S. A 5-GHz mesh inter-
connect for a Teraflops processor. Proc. IEEE/ACM Int’l Symp. Microarchitecture 27, 5 (2007),
51–61.
[36] HOWARD, J., ET AL. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In
IEEE Int’l Solid-State Circuits Conference Digest of Technical Papers (2010), pp. 108–109.
[37] HU, J., AND MARCULESCU, R. Energy-aware mapping for tile-based NoC architectures under
performance constraints. In Proc. Asia and South Pacific Design Automation Conf (2003), pp. 233–
239.
[38] INTEL. First the tick, now the tock: Intel microarchitecture (nehalem). Tech. rep., 2008.
[39] JANG, W., AND PAN, D. A3map: Architecture-aware analytic mapping for networks-on-chip. ACM
Trans. Des. Autom. Electron. Syst. 17, 3 (2012), 26:1–26:22.
[40] KAHNG, A. B., AND ROBINS, G. A new class of iterative steiner tree heuristics with good per-
formance. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 11, 7
(1992), 893–902.
87
[41] KERNIGHAN, B. W., AND LIN, S. An efficient heuristic procedure for partitioning graphs. Bell
system technical journal 49, 2 (1970), 291–307.
[42] KIM, J., BALFOUR, J., AND DALLY, W. Flattened butterfly topology for on-chip networks. In
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (2007),
IEEE Computer Society, pp. 172–182.
[43] KONGETIRA, P., AINGARAN, K., AND OLUKOTUN, K. Niagara: A 32-way multithreaded sparc
processor. Proc. IEEE/ACM Int’l Symp. Microarchitecture 25, 2 (2005), 21–29.
[44] KUHN, H. W. The hungarian method for the assignment problem. Naval Research Logistics Quar-
terly 2 (1955), 83–97.
[45] KUMAR, A., PEH, L., KUNDU, P., AND JHA, N. K. Express virtual channels: Towards the ideal
interconnection fabric. In Proc. Int’l Symp. on Computer Architecture (2007), pp. 150–161.
[46] KUMAR, A., PEH, L.-S., KUNDU, P., AND JHA, N. K. Express virtual channels: towards the
ideal interconnection fabric. In ACM SIGARCH Computer Architecture News (2007), vol. 35, ACM,
pp. 150–161.
[47] LEE, M. M., KIM, J., ABTS, D., MARTY, M., AND LEE, J. W. Probabilistic distance-based
arbitration: Providing equality of service for many-core CMPs. In Proc. IEEE/ACM Int’l Symp.
Microarchitecture (2010), pp. 509–519.
[48] LI, S., AHN, J. H., STRONG, R. D., BROCKMAN, J. B., TULLSEN, D. M., AND JOUPPI, N. P.
Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore archi-
tectures. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium
on (2009), IEEE, pp. 469–480.
[49] LU, Z., XIA, L., AND JANTSCH, A. Cluster-based simulated annealing for mapping cores onto 2d
mesh networks on chip. In IEEE Workshop on Design and Diagnostics of Electronic Circuits and
Systems (April 2008), pp. 1–6.
[50] MA, S., JERGER, N. E., AND WANG, Z. Whole packet forwarding: Efficient design of fully adap-
tive routing algorithms for networks-on-chip. In High Performance Computer Architecture (HPCA),
2012 IEEE 18th International Symposium on (2012), IEEE, pp. 1–12.
[51] MAGNUSSON, P., CHRISTENSSON, M., ESKILSON, J., FORSGREN, D., HALLBERG, G., HOG-
BERG, J., LARSSON, F., MOESTEDT, A., AND WERNER, B. Simics: A full system simulation
platform. IEEE Computer 35, 2 (2002), 50–58.
[52] MARTIN, M. M. K., ET AL. Multifacet’s general execution-driven multiprocessor simulator (gems)
toolset. ACM SIGARCH Computer Architecture News 33, 4 (2005), 92–99.
[53] MATSUTANI, H., KOIBUCHI, M., AMANO, H., AND WANG, D. Run-time power gating of on-chip
routers using look-ahead routing. In Design Automation Conference, 2008. ASPDAC 2008. Asia and
South Pacific (2008), IEEE, pp. 55–60.
[54] MATSUTANI, H., KOIBUCHI, M., IKEBUCHI, D., USAMI, K., NAKAMURA, H., AND AMANO, H.
Ultra fine-grained run-time power gating of on-chip routers for cmps. In Networks-on-Chip (NOCS),
2010 Fourth ACM/IEEE International Symposium on (2010), IEEE, pp. 61–68.
88
[55] MATSUTANI, H., KOIBUCHI, M., IKEBUCHI, D., USAMI, K., NAKAMURA, H., AND AMANO, H.
Ultra fine-grained run-time power gating of on-chip routers for cmps. In Networks-on-Chip (NOCS),
2010 Fourth ACM/IEEE International Symposium on (2010), IEEE, pp. 61–68.
[56] MATSUTANI, H., KOIBUCHI, M., WANG, D., AND AMANO, H. Run-time power gating of on-
chip routers using look-ahead routing. In Proceedings of the 2008 Asia and South Pacific Design
Automation Conference (2008), IEEE Computer Society Press, pp. 55–60.
[57] MICHAEL, R. G., AND DAVID, S. J. Computers and intractability: a guide to the theory of NP-
completeness. 1979.
[58] MURALI, S., AND DE MICHELI, G. Bandwidth-constrained mapping of cores onto NoC architec-
tures. In Proc. IEEE/ACM Design, Automation and Test in Europe (2004), pp. 20896–.
[59] OGRAS, U. Y., AND MARCULESCU, R. “It’s a small world after all”: NoC performance optimiza-
tion via long-range link insertion. Very Large Scale Integration (VLSI) Systems, IEEE Transactions
on 14, 7 (2006), 693–706.
[60] PARK, S., KRISHNA, T., CHEN, C.-H., DAYA, B., CHANDRAKASAN, A., AND PEH, L.-S.
Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI.
In Proceedings of the 49th Annual Design Automation Conference (2012), ACM, pp. 398–405.
[61] ROBINS, G., AND ZELIKOVSKY, A. Minimum steiner tree construction. The Handbook of Algo-
rithms for VLSI Phys. Design Automation (2009), 487–508.
[62] SAHU, P. K., AND S., C. A survey on application mapping strategies for network-on-chip design.
Journal of Systems Architecture 59, 1 (2013), 60 – 76.
[63] SAMIH, A., WANG, R., KRISHNA, A., MACIOCCO, C., TAI, C., AND SOLIHIN, Y. Energy-
efficient interconnect via router parking. In High Performance Computer Architecture (HPCA2013),
2013 IEEE 19th International Symposium on (2013), IEEE, pp. 508–519.
[64] SAMIH, A., WANG, R., KRISHNA, A., MACIOCCO, C., TAI, C.-M., AND SOLIHIN, Y. Energy-
efficient interconnect via router parking. In High Performance Computer Architecture (HPCA2013),
2013 IEEE 19th International Symposium on (2013), IEEE, pp. 508–519.
[65] SINGH, A., KUMAR, A., AND SRIKANTHAN, T. Accelerating throughput-aware runtime mapping
for heterogeneous MPSoCs. ACM Trans. Des. Autom. Electron. Syst. 18, 1 (2013), 9:1–9:29.
[66] SKADRON, K., STAN, M. R., HUANG, W., VELUSAMY, S., SANKARANARAYANAN, K., AND
TARJAN, D. Temperature-aware microarchitecture. In ACM SIGARCH Computer Architecture News
(2003), vol. 31, ACM, pp. 2–13.
[67] SPRACKLEN, L., AND ABRAHAM, S. Chip multithreading: opportunities and challenges. In IEEE
Int’l Symp. on High Performance Computer Architecture (2005), pp. 248–252.
[68] SUN, C., CHEN, C.-H., KURIAN, G., WEI, L., MILLER, J., AGARWAL, A., PEH, L.-S., AND
STOJANOVIC, V. Dsent - a tool connecting emerging photonics with electronics for opto-electronic
networks-on-chip modeling. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International
Symposium on (May 2012), pp. 201–210.
89
[69] TAYLOR, M. B., KIM, J., MILLER, J., WENTZLAFF, D., GHODRAT, F., GREENWALD, B., HOFF-
MAN, H., JOHNSON, P., LEE, J.-W., LEE, W., ET AL. The raw microprocessor: A computational
fabric for software circuits and general-purpose programs. Micro, IEEE 22, 2 (2002), 25–35.
[70] VANDIERENDONCK, H., AND SEZNEC, A. Fairness metrics for multi-threaded processors. Com-
puter Architecture Letters 10, 1 (2011), 4–7.
[71] YUE, S., CHEN, L., ZHU, D., PINKSTON, T. M., AND PEDRAM, M. Smart butterfly: reducing
static power dissipation of network-on-chip with core-state-awareness. In Proceedings of the 2014
international symposium on Low power electronics and design (2014), ACM, pp. 311–314.
[72] ZHU, D., CHEN, L., PINKSTON, T. M., AND PEDRAM, M. Tapp: Temperature-aware application
mapping for NoC-based many-core processors.
[73] ZHU, D., CHEN, L., YUE, S., AND PEDRAM, M. Application mapping for express channel-based
networks-on-chip. In Proceedings of the conference on Design, Automation & Test in Europe (2014),
European Design and Automation Association, p. 238.
[74] ZHU, D., CHEN, L., YUE, S., PINKSTON, T. M., AND PEDRAM, M. Balancing on-chip network
latency in multi-application mapping for chip-multiprocessors. In Parallel and Distributed Process-
ing Symposium, 2014 IEEE 28th International (2014), IEEE, pp. 872–881.
90
Abstract (if available)
Abstract
On-chip networks (a.k.a. networks on-chip or NoCs) have become the key communication media for modern many-core platforms. With tens to hundreds of cores integrated onto current and future many-core processors, a scalable NoC design with high performance and low power consumption is crucial for researchers to better utilize on-chip cores and achieve an efficient computing system. First and foremost, NoC performance improvement is of paramount importance. The performance of a NoC mainly refers to its average packet latency, which greatly influences the actual runtime of the attached many-core system. Therefore, on-chip packet latency is always among the most principal criteria of NoC designs. Another NoC design factor which is equally important, if not more, is the NoC power consumption. Recent studies as well as real-chip experiments show that NoC components can draw a substantial percentage of the overall chip power. Furthermore, this power consumption percentage of traditional on-chip networks increases with the growing core count and technology scaling. ❧ This research aims at providing performance improvement and power consumption reduction solutions for on-chip networks. First, several methods to reduce average on-chip packet latencies through application mapping are proposed, which intelligently assign running threads onto physical tiles to improve NoC performance, considering various scenarios and constraints of many-core platforms. Second, this research presents a novel NoC topology design methodology by selectively inserting express links onto mesh-based networks, aiming at minimizing on-chip latency for general-purpose many-core processors with no power overhead. These two topics focus on NoC performance improvement. Third, this research provides a proactive power gating method for on-chip routers. Specifically, a core state-aware power gating method for express link-based NoCs is first proposed, which utilizes the rich connectivity of express link-based networks as well as the knowledge of currently sleeping cores to selectively power gate routers, reducing NoC static power consumption with minimum latency overheads. Finally, a generic analysis of NoC proactive power gating is conducted for mesh-based NoCs, which identifies the trade-off between dynamic power consumption and static power consumption. Three efficient heuristic algorithms are then proposed for mesh-based NoC power gating, targeting the situations of minimum active routers, high-performance applications, and minimum overall power consumption, respectively.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Thermal modeling and control in mobile and server systems
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
Lifetime reliability studies for microprocessor chip architecture
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
High performance packet forwarding on parallel architectures
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
Asset Metadata
Creator
Zhu, Di
(author)
Core Title
Performance improvement and power reduction techniques of on-chip networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
10/15/2016
Defense Date
03/22/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer architecture,OAI-PMH Harvest,on-chip network,performance improvement,power consumption
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Nakano, Aiichiro (
committee member
), Pinkston, Timothy M. (
committee member
)
Creator Email
dizhu@usc.edu,zhud07@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-228632
Unique identifier
UC11277387
Identifier
etd-ZhuDi-4255.pdf (filename),usctheses-c40-228632 (legacy record id)
Legacy Identifier
etd-ZhuDi-4255.pdf
Dmrecord
228632
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhu, Di
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer architecture
on-chip network
performance improvement
power consumption