Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
(USC Thesis Other)
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOVEL GRAPH REPRESENTATION OF PROGRAM ALGORITHMIC FOUNDATIONS FOR
HETEROGENEOUS COMPUTING ARCHITECTURES
by
Yao Xiao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2024
Copyright 2024 Yao Xiao
Acknowledgements
First, I would like to express my sincere gratitude to my advisors and mentors, Prof. Paul Bogdan and
Prof. Shahin Nazarian, for their constant support and invaluable guide throughout my study at University
of Southern California. Prof. Paul’s exceptional expertise, relentless dedication to research, and insightful
feedback have not only steered the trajectory of my research but have also significantly impacted my
growth as a researcher. He consistently encourages me to expand the reach of my research, embrace
collaborative opportunities, and explore uncharted research territory. His mentorship has played a vital
role in broadening my horizons and enriching my personal and academic growth. Under their guidance,
I’ve had the privilege of not only presenting my work to broader audiences but also engaging in fruitful
collaborations, uncovering new dimensions of learning and exploration. I feel incredibly fortunate to have
had these experiences. Prof. Shahin’s belief in my abilities provides me with the motivation and confidence
needed to overcome obstacles and pursue academic accomplishments. Without his invaluable support and
guidance, the completion of this thesis would not have been possible.
Second, I would like to offer my sincere gratitude to my dissertation committee, Prof. Sandeep Gupta
and Prof. Jyotirmoy Deshmukh at University of Southern California for their valuable insights and constructive feedback on my work. I would like to also express my appreciation to Prof. Massoud Pedram
at University of Southern California, who served on my qualifying exam committee, for their valuable
suggestions that contributed to the improvement of my final dissertation.
ii
I would like to also thank my friends from high school, undergraduate, graduate school and internship
for the meaningful conversations as well as memorable times we shared during our travels and culinary
adventures these years.
Last but not least, I would like to express my deepest gratitude to my parents, Qingguo Xiao and Yaolin
Fang, my grandparents, Maosheng Xiao, Yan Cao, and all my family members. Their unconditional love
and support ever since my childhood have been the most precious gifts that inspires me going this far. It
hasn’t been easy being away from home, and not being able to spend much time with them. Our weekly
video calls on Friday and Saturday nights have become a cherished routine over the years, bringing us
happiness and a sense of togetherness despite the physical distance.
Finally, I would like to express my gratitude to the agencies including National Science Foundation,
Defense Advanced Research Projects Agency and University of Southern California that contributed to the
funding of my research.
Thank you for everyone who is part of this incredible journey.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Optimal Parallelization Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Spatio-Temporal Modeling of Computations and Communications . . . . . . . . . . . . . . 7
2.2 Mathematical Optimization Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Topological Sort Based Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Simulation Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Complex Network and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Effects of λ1 and λ2 on Load Balancing and Cluster Count . . . . . . . . . . . . . . 16
2.4.4 Comparisons With Sequential Execution and Thread-based Execution . . . . . . . 17
2.4.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 3: The Prometheus Framework in Processing-In-Memory . . . . . . . . . . . . . . . . . . 22
3.1 Application Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Community-to-vault Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Scalable PIM System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.4 NoC Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.5 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 4: End-to-end Programmable Computing Systems . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Problem formulation and framework overview . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Dynamic dependency used in PGL is effective in representing code as graphs. . . . 46
4.3.3 The interdependence between advanced software code optimally executed on
heterogeneous hardware exhibits a complex multifractal and universal behavior. . 52
4.3.4 Graph auto-encoders can exploit network universality properties for partitioning
large software into small kernels mapping them onto heterogeneous computing
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 5: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
v
List of Tables
2.1 Configuration parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Benchmarks and descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 The main properties of DADGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Benchmarks and descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Comparison of the state-of-the-art techniques on the NVIDIA dataset (left) and AMD
dataset (right). The F1 score is the harmonic mean of the precision and recall. . . . . . . . 50
4.2 Comparison of different graph partitioning algorithms on the 17 applications . . . . . . . . 57
4.3 Comparison of different frameworks on the 17 applications . . . . . . . . . . . . . . . . . . 58
vi
List of Figures
1.1 Flow chart of my Ph.D. works. The high-level goal is to provide application performance
improvement. First, we start by transforming applications into finer-grained graphs via
LLVM IR. Next, with a subgoal of identifying the optimal parallelization degree, we
partition the graphs into clusters to be mapped on either the multicore system (ICCAD
2017) or PIM system (DATE 2018). For the heterogeneous systems including GPUs
and CPUs, we aim to provide flexibility and programmability by finding specialized
clusters such as FFT or matrix multiplication to be accelerated on GPUs. Then, taking
into consideration reducing the burdens of programmers writing pragmas for HLS
applications, we propose a three-step optimization of graphs to be synthesized into RTL.
Finally. we exploit the SIMD capability for autovectorization by propose RL with graph
neural networks to learn VF and IF factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Overview of the framework. A: Input C/C++ applications. B: A compiler based parser
obtains the dynamic LLVM IR instructions from C/C++ applications. C: Construct a
DADG by analyzing the dependencies between LLVM IR instructions. The DADG nodes
represent IR instructions; edges represent dependencies between instructions; weights are
collected by instrumenting a lightweight function rdtsc() and some inlined code to find
the latency and data size for memory operations. Weights for the rest of instructions are
set to 1. D: We develop a mathematical optimization model to partition the DADG into
several clusters considering maximum intra-cluster edge weights, load balancing, and
availability of hardware resources. E: We map clusters onto NoC based on Topological
Sort for parallel execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The flow chart of constructing a DADG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Comparison among different partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Mapping clusters onto NoC for parallel execution. First, we convert the representation of
the cluter graph into the ordering graph by Topological Sort. Topological Sort, essentially,
reorders a directed acyclic graph (DAG) based on the rule that for every direct edge eij
between nodes i and j, i comes before j in the ordering graph. Then, we map nodes
with no incoming edges in the ordering graph onto NoC, making sure nodes and their
neighbors should be adjacent to each other to reduce the transmission distance. . . . . . . 14
2.5 The three DADGs representing MM.1, Dijkstra, and Blackscholes respectively . . . . . . . . 16
vii
2.6 Load balancing and cluster count. Lower is better. . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Speedup on the 32-core NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Scalability. FFT: left column; qSort: right column . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Overview of the Prometheus framework. The Prometheus framework follows three
steps: In step 1, we transform application into a two-layered graph, one representing model
of communication where nodes denote memory operations, a.k.a, load and store, and the
other representing model of computation where nodes denote non-memory operations
such as xor and zext. This transformation is performed through code modification, LLVM
IR conversion, dynamic trace generation, reduction, profiling, and graph generation.
In step 2, we propose an optimization model to better partition the graph into highly
connected communities to minimize the energy consumption caused by data access to
another community. In step 3, we add a router into the logic layer to form a scalable and
efficient NoC substrate and perform community-to-vault mapping. . . . . . . . . . . . . . 23
3.2 HMC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Overview of application transformation. First, we convert a C program to LLVM IR
instructions. Second, we profile and execute the instructions in order to collect dynamic
traces including computations, the amount of time and data size for CPUs to finish
each memory operation. Third, we remove control IR statements by identifying a series
of patterns. Fourth, we analyze data and control dependencies between instructions
and construct a two-layered graph. Black dotted lines represent memory dependencies
detected by alias analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Building on the generated two-layered graph, we partition the graph into interdependent
communities representing a set of IR instructions to be executed in sequential. . . . . . . . 27
3.5 Community-to-vault mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Speedup comparison among DDR3, HMC+METIS, and Prometheus . . . . . . . . . . . . . 32
3.8 NoC traffic with different parallelism approaches . . . . . . . . . . . . . . . . . . . . . . . 33
3.9 The comparison of normalized energy consumption among DDR3, HMC, and Prometheus 33
viii
4.1 Autonomous heterogeneous computing system. The recent advance of technologies
enables the fast progress of autonomous cars and unmanned aerial vehicles(a). However,
with the commonly used system components such as the controller and convolutional
neural networks for image recognition (b), parallelization and communication overhead
become inevitable concerns for programmers as the complicated and ever-changing
software needs to be parallelized and executed on a heterogeneous system (c). The
proposed framework makes the manual process autonomous without human intervention
by profiling applications (d), constructing dynamic execution graphs (e), and mapping
kernels onto the platform via machine learning models (f). . . . . . . . . . . . . . . . . . . 36
4.2 Overview of the proposed Programmable Graph Learning framework (PGL). PGL
constructs a dynamic execution graph for each input software program via low-level
virtual machine (LLVM) intermediate representation (IR). PGL then utilizes a novel feature
extraction algorithm based on random walks and multi-fractal analysis to construct node
features that capture the topological dependencies and structures in dynamic execution
graphs. These features are further used by a graph autoencoder (GAE) to partition the
graph into clusters (i.e., software kernels) and a graph neural network (GNN) model such
as graph convolutional networks (GCN) and multilayer perceptrons (MLP) to predict the
best hardware device for each kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Dynamic execution graphs and multifractal properties. Panel (a), (b), and (c) shows basic
graph patterns in graphs where the code contains either loops or sequential statements.
Panel (d), (e), and (f) shows the constructed code graphs for sequence alignment, signal
processing, and convolutional neural network, respectively. The graphs are a hybrid
of fundamental graph patterns in (a-c). Panel (g) shows the multifractal spectrum and
some definitions such as α0 and spectrum width w. Panel (h) shows a generalized fractal
dimension for a graph. Panel (i) shows three multifractal spectra (green, red, and blue
lines) for (d-f) to demonstrate multifractal spectrum can identify the heterogeneous graph
structures in different dynamic execution graphs. . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Convergence of normalized accuracy with different percentages of training steps in the
NVIDIA (a) and AMD (b) datasets. Each color line indicates normalized accuracy for
a given framework and each color shading associated with a line shows the standard
deviation for the framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Example of code with its graph representation and box counting algorithm used to analyze
the multifractal properties. (a) A loop kernel called example6 in red; (b) The dynamic
execution graph with initialization in a blue rectangle and wrap-up in a green rectangle
with a zoom-in view on one iteration of the loop; (c) The box counting algorithm by
varying the size of a box r to count the number of boxes N(B). . . . . . . . . . . . . . . . 54
4.6 Multifractal analysis can characterize the universal power-law relationship between
multifractal properties and system-level metrics. Network multifractal properties are used
as inputs to fit a power-law model axb
to find the relationship between network properties
and system-level metrics. Panel (a-f) shows the parallelization degree of code graphs in
terms of generalized fractal dimension (a-b), spectrum width (c), spectrum height (d), α0
(e), and complexity (f). Panel (g-i) shows the communication overhead for spectrum width
(g), α0 (h), and complexity (i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
ix
4.7 The breakdown of the execution time of each application in the standard dataset running
on different frameworks. The execution time, measured in clock cycles, is roughly divided
into two parts: communication and computation in (a). We also report communication
overhead that is calculated by clock cycles in communication divided by the total clock
cycles in (b). As we can see, PGL, compared to the other frameworks, has the smallest
communication overhead. It is because PGL has an optimization model that partitions the
graph into different clusters to minimize inter-cluster communication. . . . . . . . . . . . 60
4.8 The breakdown of the execution time of each application in the real-life dataset running
on different frameworks. The execution time, measured in clock cycles, is roughly divided
into two parts: communication and computation in (a). We also report communication
overhead that is calculated by clock cycles in communication divided by the total clock
cycles in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 Experimental results. Panel (a) shows comparison of different partitioning algorithms.
We compare the graph partitioning GAE with different traditional algorithms. Panel (b)
shows comparison of different frameworks. We compare PGL with different frameworks
in terms of application performance. We conclude that our approach can achieve 2.02x
better compared to the state-of-the-art techniques. . . . . . . . . . . . . . . . . . . . . . . 61
4.10 PGL Limitations. (a). It is not beneficial for small code to pay for the overhead of PGL
while mapping a few instructions onto cores. (b). Random memory accesses from pointer
manipulation are not beneficial in PGL because there will be thousands of false memory
dependencies due to LLVM alias analysis. This may increase the communication overhead. 64
x
Abstract
The recent technological advances have significantly contributed to a rapid increase in algorithmic
complexity of various applications, from digital signal processing to autonomous aerial, ground and underwater systems. In order to control and manage this increased algorithmic complexity, heterogeneous
computing systems require intelligent, flexible and highly efficient programming strategies to provide high
performance while minimizing energy costs. However, the current monolithic programming models and
task mapping to compute engines do not fully exploit the recent architectural innovations and can exacerbate the load imbalance and communication inefficiencies.
In order to fully utilize the capabilities of hardware platforms, the compilation of parallel programs
requires expert heuristics to decide how many threads to spawn and how to schedule them onto heterogeneous computing systems. Due to workload imbalance, synchronization overhead, and resource sharing
contention, the overall performance may lead to sub-optimal executions. Therefore, it is crucial for programmers to decide which code segments run on a specific processor (e.g., CPU or GPU).
In this dissertation, we develop a novel programming model for heterogeneous computing platforms.
Specifically, we first collect the representative dynamic trace generated from executing a program. This
trace contains a sequence of low level virtual machine (LLVM) intermediate representation (IR) instructions
to be executed. Then, for each instruction, we check if data, control, and memory dependencies exist and
insert a directed edge to construct the graph. By developing this framework, we are able to partition
the graph into clusters which will be later mapped to heterogeneous platforms or processing-in-memory
xi
(PIM). Experimental results demonstrate the system performance improvement over some state-of-the-art
techniques in the field.
xii
Chapter 1
Introduction
Edge, fog, and exascale computing (EC) are essential for validating scientific theories and developing revolutionary biological, nano- or neuro-technologies to tackle 21st century challenges (e.g., precision
medicine, energy crisis, climate change, and smart, safe and secure cities). Advanced scientific and engineering investigations call for revolutionary EC design approaches that break the hardware-software
boundary and propose breakthroughs for heterogeneous scalable computing platforms and memory storage. For example, graph analytics decodes the links among heterogeneous data streams and offers insights
to ensure accurate prediction and decision-making. Inferring causal and higher-order complex relationships from unstructured time-varying data calls for a paradigm shift in EC design.
Today’s general-purpose manycore computing systems are widely used in industry and research. However, due to synchronization overhead, load imbalance, and resource sharing, parallel programming models
such as pthreads and OpenMP can exacerbate application performance, making it non-ideal to design scalable systems. Moreover, these platforms are based primarily on the stored-program computer concept and
are implemented using von Neumann computing architecture, which separates the computation and storage into two distinct components connected by an off-chip bus. Data is frequently exchanged between the
computing units (e.g., CPUs/GPUs) and the memory units. Due to the disparity between processing and
memory technologies, the latency and energy of data movement are generally a few orders of magnitude
1
higher than the computation in both servers [1] and mobile devices [2]. Hence, data communication cost
dominates computation cost and becomes the major bottleneck for improving the overall system performance and execution efficiency, which is a phenomenon widely known as the ”memory wall”. The situation
is more severe when dealing with big-data applications (e.g., training of deep neural networks) with intense
demand in computation and memory accesses [3–5].
Edge, Fog, and Exascale Computing Challenges. EC desiderata can be achieved via a cross-layer
understanding of performance and energy inefficiency from high-level algorithms (including programming
languages, compilers, and software libraries) to lower-level hardware and operating system. Towards this
end, the EC challenges coming from the application-algorithm side are: (1) How to decide and design the
EC (deep machine) hierarchy in terms of hardware partitioning (how many cores per tile, how to interconnect tiles, blocks, units, sockets, modules, racks) and heterogeneous memory allocation (e.g., DRAM,
SRAM, MRAM)? How to determine the data layouts and execution schedules on this hierarchical EC architecture for running applications/programs? (2) How to couple the concurrency structure and dynamic
behavior of applications with opportunities for dynamic voltage and frequency scaling (DVFS) on die?
Can we develop new models of computation (MoC) that enable the computing architectures and applications to self-program and self-manage to achieve high performance and energy efficiency? (3) How can
we construct a MoC for emerging applications (from scarce or possibly poorly written software code) to
better understand the concurrency structure (e.g., data and inter-task dependencies) and dynamics (e.g.,
data reuse, data movement, fine-grained synchronization)? Can this MoC enable the identification of true
computation and communication requirements (management of memory and data movement)? Can the
MoC help in minimizing the data movement and I/O operations?
Figure 1.1 shows the high-level flow chart of my Ph.D. works. The overarching goal is to improve
application performance, i.e., reduce latency. Based on this goal, we propose different frameworks for
different hardware platforms such as multicore systems and heterogeneous systems and objectives such
2
Figure 1.1: Flow chart of my Ph.D. works. The high-level goal is to provide application performance
improvement. First, we start by transforming applications into finer-grained graphs via LLVM IR. Next,
with a subgoal of identifying the optimal parallelization degree, we partition the graphs into clusters to
be mapped on either the multicore system (ICCAD 2017) or PIM system (DATE 2018). For the heterogeneous systems including GPUs and CPUs, we aim to provide flexibility and programmability by finding
specialized clusters such as FFT or matrix multiplication to be accelerated on GPUs. Then, taking into
consideration reducing the burdens of programmers writing pragmas for HLS applications, we propose a
three-step optimization of graphs to be synthesized into RTL. Finally. we exploit the SIMD capability for
autovectorization by propose RL with graph neural networks to learn VF and IF factors.
3
as parallelization degree and energy. In general, for each application to be considered, we are using the
LLVM compiler to get intermediate representation (IR) and analyze data / control / memory dependencies to construct the finer-grained graph. Then for the ICCAD 2017 and DATE 2018 papers, we propose
an optimization model to find the optimal parallelization degree with constraints of load balancing and
cluster count for multicore systems and processing-in-memory (PIM). We then propose a framework to
provide flexibility and program-ability for heterogeneous systems by finding specialized clusters that can
be accelerated on GPUs and reinforcement learning (RL) mapping to take into account the dynamic nature of traffic between cores for each application. Next, with a sub-goal of reducing programmer’s burden
with no pragmas when writing code for high-level synthesis (HLS). Finally, we exploit the SIMD capability
of executing one instruction for multiple data by proposing RL with graph neural networks to learn the
optimal vector factor (VF) and interleaving factor (IF) for each application to be vectorized.
We therefore propose several state-of-the-art design methodologies and architectures to address the
afore-mentioned issues. Chapter 2 describes mathematical and algorithmic tools for building dynamic
MoCs that support complex reasoning about the nature and type of computations and the optimal parallelization degree. Chapter 3 discusses the Prometheus framework in the emerging memory technology,
that is, processing-in-memory. Chapter 4 proposes a self-optimizing and self-programming computing system framework in heterogeneous platforms to improve system performance and energy efficiency. Chapter
5 discusses our framework on end-to-end programmable computing systems. Chapter 6 discusses an optimized graph analytics based high level synthesis framework. Chapter 7 discusses plasticity-on-chip design
by exploiting self-similarity for data communications.
4
Chapter 2
Optimal Parallelization Degree
The tight power and thermal constraints call for fine-grained exploration of chip multiprocessors
(CMPs) and data-center-on-a-chip [6] to provide performance improvement in exascale computing. To
make use of CMPs, software paradigms have been shifted from sequential programming to multi-threading.
However, three fundamental inefficiency issues can appear if threads are spawned without careful consideration of the underlying hardware.
(1) Non-negligible on-chip communication overhead. With applications being randomly partitioned, plenty of inter-core communications are generated, leading to many flits injected into the network. Those flits are unwanted as none of them would exist with running the application in only one core.
Therefore, intelligent application partitioning is required to minimize inter-core communication overhead.
(2) Limited off-chip memory bandwidth. Due to limited off-chip memory bandwidth, the performance of data-intensive multi-threaded programs is negatively affected due to frequent updates in the
main memory. A critical thread may be delayed due to race conditions to access the main memory simultaneously. Increasing the number of threads to the point of off-chip bandwidth saturation degrades the
power consumption with no performance increase.
(3) Increased critical sections. Locks are used to prevent multiple threads from writing to shared
variables simultaneously to guarantee the correctness of a multi-threaded program. In other words, due
5
Figure 2.1: Overview of the framework. A: Input C/C++ applications. B: A compiler based parser obtains
the dynamic LLVM IR instructions from C/C++ applications. C: Construct a DADG by analyzing the
dependencies between LLVM IR instructions. The DADG nodes represent IR instructions; edges represent
dependencies between instructions; weights are collected by instrumenting a lightweight function rdtsc()
and some inlined code to find the latency and data size for memory operations. Weights for the rest
of instructions are set to 1. D: We develop a mathematical optimization model to partition the DADG
into several clusters considering maximum intra-cluster edge weights, load balancing, and availability of
hardware resources. E: We map clusters onto NoC based on Topological Sort for parallel execution.
to synchronization, the serial portions of the program increase. According to Amdahl’s Law, speedup is
limited by the sequential parts. Therefore, as more threads are spawned, thread-based execution could
potentially become slower compared to sequential execution.
In this section, the goal is to design a novel methodology to automatically parallelize complex programs without increasing the programmer’s efforts. Considering the three pitfalls mentioned previously,
we propose a complex network based parallelization framework to partition applications into highly interdependent clusters of tasks representing communities in a graph rather than threads such that the
amount of data transferred among communities is minimized. As shown in Figure 2.1, we first construct
the weighted dynamic application dependency graph where the nodes denote individual low level virtual
machine (LLVM) intermediate representation (IR) [7] instructions generated by Clang compiler, and edges
represent data dependencies between different instructions on the same virtual registers. Edge weights
represent latency (L1 hits, L1 misses, or L2 misses assuming there is a shared L2 cache among cores) and
data sizes (1, 2, 4, 8 bytes, cache line, or memory page). Second, based on the constructed graph of IR
instructions we present the mathematical optimization model to detect community structures ensuring
6
that (1) the number of inter-cluster communication flits is minimal; (2) communities reach approximately equalized execution times; (3) the number of communities is smaller than or equal
to the number of cores. Third, having calculated the optimal communities and their dependencies, we
construct a cluster graph where nodes indicate communities. We then use topological sort to map the
clusters onto the NoC, while ensuring that the clusters at the same depth are executed in parallel. In case
the number of clusters is smaller than the core count, the rest of the cores are shut off using power gating.
There are three primary issues that cause performance degradation in parallel computing: (1) load
imbalance, (2) resource sharing, (3) synchronization. Our framework mitigates these three bottlenecks by a robust real-time aware optimization approach that (1) prevents the difference of execution
times between two consecutive clusters from being too large considering cache miss cycles and data sizes
for memory instructions, (2) confines most of data movement within each cluster as the framework tries
to partition the dependency graph into clusters with maximized intra-cluster communications, making
better use of caches. Moreover, the mesh-based NoC is used to route flits efficiently for cases where a
core requires variables stored in another core’s caches, and (3) applies pipeline parallelism to parallelize
sequential applications rather than multi-threading to reduce synchronization overhead caused by threads
with locks and barriers.
2.1 Spatio-Temporal Modeling of Computations and Communications
To describe the spatio-temporal interdependencies between the computations and memory operations,
we adopt an architecture independent LLVM IR [7][8]. The rationale for adopting this compiler framework
is that it is a language-independent type-system that exposes the primitives used to implement high-level
language (HLL) features. It includes an instruction for typed address arithmetic, and a mechanism for
implementing the exception handling HLL features. Furthermore, IR is crucial in LLVM. It is an abstract
machine language which mimics the basic computations, memory operations, and branch instructions
7
with unlimited virtual registers to prevent register spilling. Therefore, backends can easily produce from
IR, machine code suited for any target platform regardless of ARM in portable mobiles or x86 in laptops
and high-end servers. As shown in Figure 2.1, there are several features for our approach:
IR instructions are collected dynamically. Static compilation has several drawbacks. (1) Memory
operations are difficult to detect dependencies, which could potentially increase communication overhead
if we map dependent memory operations onto different cores. (2) The number of iterations in one loop
sometimes cannot be statically determined. Depending on how many iterations one loop has, load imbalance appears between different clusters. Therefore, rather than static compilation, dynamic execution
traces are collected to reflect true dependencies and break one loop into several iterations executing sequentially, increasing the chances of grouping different iterations into clusters.
Memory operations are instrumented to get correct values for latency and data sizes. The
store and load instructions have different execution times and data sizes if data required to fetch reside in
L1, L2, L3, or main memory. Taking into account those values could potentially group computations and
memory operations with the same registers into one cluster, leading to more efficient use of caches and
less communication overhead. In this way, load balancing is achieved by explicitly formulating weight
constraints in an optimization model.
The parser collects C/C++ essential instructions within an outer loop and constructs a DADG
from dynamic traces generated by the compiler. Figure 2.2 shows the parser where we maintain three
hash tables called source table, destination table, and dependency table respectively. The source/destination
tables are used to keep track of source/destination registers with keys being source or destination registers
and values being the corresponding line number. The dependency table is to store dependencies between
nodes with keys being the line number for current instruction, and values being clock cycles, data sizes
and line numbers of previous instructions dependent on the same virtual register. The parser instruments
8
Figure 2.2: The flow chart of constructing a DADG Figure 2.3: Comparison among different partitions
the lightweight rdtsc function and some inlined code to collect the attributes of memory operations, i.e.,
data sizes and clock cycles as edge weights.
For example, Figure 2.3 illustrates twelve IR instructions generated by the compiler Clang. When the
parser reads the first instruction, it checks source registers as indicated in Figure 2.2. Since this instruction
does not have the source register, then only the destination register is hashed into the destination table with
keys being %1 and values being 1. Instructions 2 and 3 follow the same procedure as the first instruction.
When the parser reads the fourth instruction, it checks whether the source registers in the instruction
match with any destination registers in previous instructions. In this case, the source register %1 matches
with the same destination register in node 1. Thus, this is hashed into the dependency table with keys being
4 (the line number of the current instruction), values being 1 (the line number of the previous instruction
which depends on the source register %1), and weights being 1 (non-memory operations). The dependency
table can be regarded as a DADG in which keys represent nodes and key-value pairs indicate directed edges.
9
2.2 Mathematical Optimization Model
In order to propose a rigorous mathematical strategy for parallelizing applications, we build on our
architecture independent spatio-temporal DADG representation of an application and formulate a novel
community detection problem that seeks to: (1) determine a strongly connected subcomponent of the
DADG that encapsulates strong causal dependencies between computations and memory operations dependent on registers in computations yet is complex enough to require localized specialized processing
elements (functional units in cores or accelerators) and corresponding memory systems; (2) perform load
balancing by distributing computations and related memory operations among the identified computational communities to improve system performance under uncertain inputs on average; (3) minimize the
deviations between the number of strongly connected communities (subgraphs of the DADG) and the
hardware resources (cores or accelerators). To make the discussion more concrete, we introduce a series
of definitions that help us construct the community detection problem as follows.
Definition 1: A dynamic application dependency graph (DADG) is a weighted directed graph G =
G(ni
, eij , wij |i, j ∈ |N|) where each node ni represents one LLVM IR instruction, and each edge eij ,
associated with different weights wij , characterizes dependency from the current node ni to the previous
node nj or control flow such as jump and branch to guarantee the strict program order.
Definition 2: A weight wij between node i and j is defined as latency function T (eij ) times data
size D(ni). Latency function T (eij ) calculates the latency from node i to node j based on the timing
information for memory operations provided by the compiler. Likewise, data size D(ni) calculates the
number of bytes node i requires to transfer from one location to another location (possible locations are
disk, main memory, caches, processor registers).
Definition 3: A quality function Q for DADGs is an indicator of how good a partition of clusters for
the parallel execution is based on load balancing, available hardware resources, and data movement.
10
Using these definitions, the mathematical optimization model in terms of intelligently partitioning
DADGs can be formulated as follows:
Given a graph G, find non-overlapping clusters nc which maximize a quality function Q:
Q =
Xnc
c=1
[
W(c)
W
− (
S
(c)
2W
)
2
] − R1 − R2 (2.1)
R1 =
λ1
W2
Xnc
c=1
[Wc − Wneighbor(c)
]
2
(2.2)
R2 =
λ2
n2
c
(nc − N)
2H(nc − N) (2.3)
where nc denotes the cluster size; N denotes the core count; W(c) denotes the sum of weights all connected
within cluster c (W(c) =
P
i∈c
P
j∈c wij ); W is the sum of weights of all edges (W =
P
i
P
j wij );
S
(c)
is the sum of weights of all the edges adjacent to cluster c; λ1 and λ2 are regularization parameters;
neighbor(c) denotes clusters connected to the cluster c; H(x) is the Heaviside step function (H(x) =
R x
−∞ δ(s)ds); δ(x) is the Dirac delta function.
Intuitively, the first term in equation (3.2) aims to maximize the intra-cluster weights and reduce the
communication requirements among different clusters. The second and third terms aim to balance the
computational (processing) requirements for each core and account for the limited number of hardware
resources imposed by pipeline parallelism. Similar to prevention of over-fitting in machine learning, R1
and R2 are regularization terms and 1
W2 ,
1
n2
c
are used to make sure those terms have the same unit:
(1) The first term in equation (3.2) limits data movement almost within each cluster. It measures the
difference between the sum of edge weights in a cluster and the sum of edge weights adjacent to the cluster.
Through maximization of this term, we try to find partitions where data movement is constrained.
(2) R1 is used for load balancing. By measuring the sum of the deviation squared between the total
weights in a cluster c and its neighbors, R1 is magnified and the value of Q is reduced if clusters have
11
unbalanced weights. Therefore, by maximizing Q, R1 is meant to be minimized by equalizing works in
clusters c and its neighbor. Therefore, we try to balance the weights/work between different clusters,
making stages in pipeline parallelism have roughly equalized execution times.
(3) R2 is used to ensure the number of clusters doesn’t exceed the core count. (nc − N)
2 means if
the number of clusters nc is different from the number of available cores N in a system, Q is further
reduced. However, H(nc − N) takes a value of 0 until nc equals N, and then has a value of 1 if nc is
greater than N. Hence, R2 is large only when nc exceeds the number of cores in the system. If nc is less
than N, H(nc − N) = 0 and the rest of idle cores are turned off to save energy while providing the best
performance. Therefore, in order to maximize Q, R2 should be minimized by making sure that nc (the
number of communities) is slightly less than N (core count).
(4) For regularization parameters λ1 and λ2, both can be adjusted during run-time. If λ1 = λ2 = 0,
the quality function Q is reduced to a standard model without considering balanced works and available
resources. If λ1, λ2 are very large numbers, the first term in equation (3.2) can be ignored, and the model
tries to detect communities such that balanced works and nc less than N are achieved without evaluating
the maximum communication messages restricted within one community. Therefore, values of λ1 and λ2
should be somewhere in-between.
The advantages of applying the mathematical optimization model to partition sequential programs are:
(1) minimal programmer’s efforts to write parallel programs to exploit the speedup provided by
multi-core chips. (2) easy to detect independencies at the granularity of IR instructions and balanced load among clusters. (3) limited communication overhead in NoC leading to small chances
in the congestion. As shown in Figure 2.3, mapping the entire application into one core would mean no
communication overhead among cores, but this approach cannot improve performance as the other core
is being idle all the time. The second method is just to partition the graph randomly. However, this random partitioning can cause significant communication overhead among cores, making cache utilization
12
and performance poor. The last one can group many instructions into clusters such that the number of
inter-cluster flits is minimized. Data movement is restricted by keeping data locally as much as possible
to save energy and improve performance.
2.3 Topological Sort Based Mapping
Determining the optimal number of clusters raises the question on how to map them onto the NoC
such that (1) hop count of communicated flits among clusters is minimized and (2) independent clusters
can be executed in parallel.
A cluster graph (CG) is constructed where nodes represent clusters and edges indicate data dependencies. There are two properties associated with CG. (1) CG is directed: As there should be an order
in which tasks are executed due to program sequential semantics, one task waits data provided by the
other tasks until it is executed, leading to a directed graph. (2) CG is acyclic: One cluster depends on
data which are generated from its previous clusters. Based on the directed and acyclic graph, we sort CG
topologically to ensure that for any directed edge (v, w) in E ∈ CG, v precedes w in the ordering, which
can be expressed as an Ordering Graph (OG). Based on OG, clusters are mapped into NoC for pipeline
parallelism. In conclusion, we propose the following algorithm which is a combination of topological sort
and mapping. Algorithm 1 exploits parallelism and pipelining. We define the depth of cluster v in OG
as the number of edges from v to its root, and level of cluster v as the number of clusters at the same depth
as v. Therefore, (1) the depth of i represents a stage i + 1 in pipelining. In Figure 6, clusters 2 and 3
in OG at the depth of 1 represent the 2
nd stage while cluster 4 at the depth of 2 represents the 3
rd stage.
Moreover, cluster 4 cannot be executed before clusters 2 and 3 as it waits data generated by 2
nd stage. (2)
different levels at the same depth of i represent the number of clusters which can be executed in
parallel. In Figure 2.4, clusters 2 and 3 at the stage 2 can be executed in parallel as they both only depend
13
Figure 2.4: Mapping clusters onto NoC for parallel execution. First, we convert the representation of the
cluter graph into the ordering graph by Topological Sort. Topological Sort, essentially, reorders a directed
acyclic graph (DAG) based on the rule that for every direct edge eij between nodes i and j, i comes
before j in the ordering graph. Then, we map nodes with no incoming edges in the ordering graph onto
NoC, making sure nodes and their neighbors should be adjacent to each other to reduce the transmission
distance.
on availability of data produced by cluster 1. After mapping, if there are still idle cores, to save power
consumption, they are turned off using power gating.
Algorithm 1: Mapping Algorithm
[1] Counter = 0 while CG is not empty do
Vpartial = No_Incoming_Edges (CG) if Counter == 0 then
Map Vpartial to (0, 0) else
Map Vpartial to their nearest parent clusters based on Greedy Heuristic Running_In_Parallel
(Vpartial) Delete Vpartial from CG Counter++ if There still exist idle cores C in NoC then
Power_Gating (C)
2.4 Evaluation
In this section, we provide simulation configurations and experimental results to demonstrate the validity of our framework.
2.4.1 Simulation Configurations
We simulate a symmetric CMP with all out-of-order cores in NoC with the parameters shown in Table
2.1. Three different types of execution are considered and compared: sequential execution where all instructions are executed in one core, thread-based execution where the number of threads spawned is equal
14
Table 2.1: Configuration parameters
CPU
cores OOO, 2-wide issue, 16 MSHRs
L1 private caches 64KB, 4-way associative, 32-byte blocks
L2 shared caches 256KB, distributed across nodes
Network
Topology Mesh
Routing Algorithm XY routing
Flow Control Virtual channel flit-based
Table 2.2: Benchmarks and descriptions
Benchmark Description Source
Mandelbrot Calculate Mandelbrot Set OmpSCR[9]
MM.1 Simple matrix multiplication
Stencil 2D nine point stencil operation SHOC[10]
MD Simulate molecular dynamics OmpSCR[9]
FFT Compute Fast Fourier Transform OmpSCR[9]
Dijkstra Find the shortest path MiBench[11]
Blackscholes Calculate European options PARSEC[12]
FFT6 Compute 1D FFT OmpSCR[9]
MM.2 Strassen’s matrix multiplication
qSort Quicksort algorithm OmpSCR[9]
to core count, and optimization-based execution discussed in this paper. Thread-based and optimizationbased executions are evaluated on a 32-core NoC as all of our workloads can be configured to 32 communities if proper λ2 is applied. Table 2.2 shows the simulated workloads.
2.4.2 Complex Network and Basic Properties
Figure 2.5 shows the DADG connectivity structure for several applications (e.g., MM.1, Dijkstra, and
Blackscholes). Table 2.3 summarizes their main attributes.
In Figure 2.5, DADGs can be classified into 3 categories: high/medium/low discernibility. High
discernible DADGs are clearly seen as some interconnected clusters. One of examples is MM.1. Medium
discernible DADGs may be seen as some regular patterns by humans. One of examples is Dijkstra. Those
applications mentioned above can be parallelized by programmers without many efforts. The similarity in
high and medium discernible DADGs is that there are some visible patterns to humans. The reason is that
array declaration in C/C++ programs corresponds to one IR instruction called "getelementptr". Therefore,
all array-related operations depend on this instruction. This node is becoming central and betweenness is
15
Table 2.3: The main properties of DADGs
Benchmark Nodes Edges Avg degree Avg path length
Mandelbrot 1,045,696 1,315,168 2.549 7.718
MM.1 1,489,656 1,957,420 2.701 14.517
Stencil 2,107,098 2,847,856 2.549 13.703
MD 1,498,210 2,070,557 2.571 19.307
FFT 610,011 799,933 2.486 13.425
Dijkstra 398,674 550,462 2.433 11.179
Blackscholes 1,236,128 1,516,241 2.35 23.296
FFT6 338,920 458,273 2.513 21.753
MM.2 1,514,622 2,063,665 2.422 16.33
qSort 1,118,977 1,442,563 2.353 29.312
Figure 2.5: The three DADGs representing MM.1, Dijkstra, and Blackscholes respectively
very high due to array operations. One cluster should be centered on at least one of those nodes. As we can
see in Figure 2.5, we can infer that we have at least three distributed arrays and operations on one array
barely depend on those on another arrays. However, low discernible DADGs are difficult to be seen as
interconnected clusters clearly to humans. Those applications are hard to be parallelized by programmers.
One of examples is blackscholes. Therefore, we apply the parallelization framework to those applications
to reduce programmer’s efforts, making it practical to be executed in parallel.
2.4.3 Effects of λ1 and λ2 on Load Balancing and Cluster Count
To illustrate the effects of different values of λ1 and λ2 on load balancing and cluster count, results
regarding 10 applications are shown in Figure 2.6. |R1| is the second term used in equation 3.2, meaning
sum of the difference of loads between two consecutive clusters. The higher|R1|, the worse load balancing.
In Figure 2.6, from figures in the first column, |R1| has its peak when λ1 = 0. However λ1 can be
adjusted to fine-tune the load difference among clusters, making tasks more balanced. After λ1 approaches
16
some threshold, |R1| plummets to 0 as R1 is becoming the dominant term in equation (3.2) compared to
the first term. It is natural that to balance load among clusters to its best, one node is assigned to one
cluster and |R1| = 0. Therefore there is no point in increasing the value of λ1 to the threshold. In the
rest of figures, Since in this case N is set to be 0, we want to make sure that all clusters generated by the
framework can be fully mapped into NoC by increasing λ2. All applications, although some have large
cluster counts at the beginning, can be confined within core count if proper λ2 is applied.
2.4.4 Comparisons With Sequential Execution and Thread-based Execution
We simulated all 10 applications on the 32-core NoC. In optimization-based execution, we set both λ1
and λ2 to be 0.5. In Figure 2.7, for embarrassingly parallel programs such as Mandelbrot, the speedup given
by the thread-based execution is 10% higher than the optimization-based execution as threads are independent from each other whereas communities still have load imbalance and inter-community communications. In such a case, each thread can be mapped to different cores to execute faster due to no communication overhead. Nevertheless, for most non-embarrassingly parallel programs, in general, optimizationbased execution achieves better performance. Locks and barriers are applied to ensure correctness of multithreaded programs, giving rise to more synchronization overhead. Besides, on-chip inter-thread or intrathread communication appears to cause congestion, slowing down the entire program. The optimizationbased execution speedup varies from 10.2% to 131.82% when compared to thread-based execution. The
degree of data movement influences significantly the overall speedup.
2.4.5 Scalability
We evaluate the cluster count and cycle counters for the slowest and fastest stages for two applications
FFT and qSort based on different input sizes. In Figure 2.8, input size is linearly proportional to the cycle
counter in the slowest stage and inversely proportional to the counter in the fastest stage. But when
17
adjusting λ1 to be 0.5, the cycle counter in the slowest stage is reduced by nearly 10% as increasing λ1
reorganizes the graph by distributing some nodes in a dense cluster into a sparse cluster, taking into
consideration the difference between loads in clusters. In terms of cluster size, although input size varies,
cluster size only changes slightly, making it easy for λ2 to reduce the size within core count.
2.5 Conclusions
The main goal of this section is to try to run a sequential program in NoC such that we can gain the
maximum speedup in multi-core systems. Considering the deficiencies of multi-threading, we propose a
complex network inspired parallelization framework to efficiently partition sequential programs. We first
construct the DADG of an application where nodes indicate LLVM IR instructions and edges represent
dependencies on virtual registers. Next, we formulate the optimization model by considering not only
the inter-cluster edges, but also load balancing and cluster size. We can adjust the load among different
clusters and cluster size by fine-tuning the regularization parameters λ1 and λ2 used in equation (3.2). In
order to save energy and prevent NoC congestion, data communications are mainly constrained in each
cluster. Finally, we construct a CG where nodes denote clusters and edges indicate data dependencies.
Having constructed the CG, we propose an algorithm based on topological sort, to identify clusters for
parallel execution and map them onto the NoC. Our evaluation with 10 workloads performed on a 32-core
NoC shows that load balancing and core count can be alleviated by the framework under various input
sizes. The overall speedup of most applications is 10.2% to 131.82% higher compared to the thread-based
execution.
18
Figure 2.6: Load balancing and cluster count. Lower is better.
19
Figure 2.7: Speedup on the 32-core NoC
20
Figure 2.8: Scalability. FFT: left column; qSort: right column
21
Chapter 3
The Prometheus Framework in Processing-In-Memory
The era of big data enables programmers to write memory intensive applications. However, traditional
systems are unable to handle such a big volume of data with fast response as they are designed to execute
computations. Therefore, once last level cache miss is generated, data has to be fetched from the main
memory via off-chip links. Memory bandwidth becomes a bottleneck for those applications. One technique
to address this issue is to bring processing units close to main memory [13]. This was proposed a decated
ago, but never succeeded due to design complexity. Nowadays, processing-in-memory (PIM) regains its
popularity because 3D-stacking technologies allow memory layers stacked upon each other and connected
via TSVs (through-silicon vias). Hybrid memory cube (HMC) provided by Micro [14] is an example of the
commercial PIM systems. As shown in Figure 3.2, according to HMC 2.1 specification, inside one cube,
there are eight memory layers and one logic layer stacked on top of each other with 32 partitions, which
are also called vaults.
However, there are two key challenges required to be addressed to exploit the benefits of PIM systems:
(1) Where should data reside among different vaults to reduce data movement and utilize internal memory
bandwidth? [15] reported that performing 512-way multi-constraint graph partitioning improves performance of the PIM accelerator due to reduced off-chip network traffic. (2) PIM systems should be scalable
22
Figure 3.1: Overview of the Prometheus framework. The Prometheus framework follows three steps:
In step 1, we transform application into a two-layered graph, one representing model of communication
where nodes denote memory operations, a.k.a, load and store, and the other representing model of computation where nodes denote non-memory operations such as xor and zext. This transformation is performed
through code modification, LLVM IR conversion, dynamic trace generation, reduction, profiling, and graph
generation. In step 2, we propose an optimization model to better partition the graph into highly connected
communities to minimize the energy consumption caused by data access to another community. In step 3,
we add a router into the logic layer to form a scalable and efficient NoC substrate and perform communityto-vault mapping.
Figure 3.2: HMC Architecture
to hundreds of vaults in the future. We try to alleviate those challenges by (1) formulating the first challenge as optimization problem and partitioning the graph to have minimal inter-vault communications;
(2) designing a scalable PIM system with NoC to efficiently route packets to the destination vault.
The goal of this section is to find an approach to wisely partition data across different vaults in HMCbased systems to exploit high intra-vault memory bandwidth while improving performance and reducing
energy consumption. Therefore, we propose the Prometheus framework taking into account the interactions among computations and communication. First, by adopting an LLVM intermediate representation,
dynamic trace generation, reduction, code profiling and graph representation, we describe a C/C++ application as an interdependent two-layer weighted graph, where nodes denote LLVM IR instructions and
23
Figure 3.3: Overview of application transformation. First, we convert a C program to LLVM IR instructions. Second, we profile and execute the instructions in order to collect dynamic traces including
computations, the amount of time and data size for CPUs to finish each memory operation. Third, we
remove control IR statements by identifying a series of patterns. Fourth, we analyze data and control dependencies between instructions and construct a two-layered graph. Black dotted lines represent memory
dependencies detected by alias analysis.
edges represent the data and control dependencies among LLVM instructions. Moreover, the weights associated with the edges represent the amount of time required for specific computational processes to
wait for one another to complete their work. Consequently, one layer represents a model of computation where nodes denote computation operations such as add and mul while the other layer represents a
model of communication where nodes denote memory operations, i.e., load and store. Second, we propose
an optimization framework that partitions the two-layer network into highly interacting groups of nodes
(clusters) such that the energy consumption required for data movement and accesses is minimized. Third,
we introduce a community-to-vault mapping strategy which maps each highly interconnected cluster onto
a vault while exploiting the NoC communication infrastructure and the high internal memory bandwidth
provided by TSVs.
3.1 Application Transformation
Figure 3.3 shows a high-level logic diagram on how an input C/C++ application is transformed into
a two-layer network. First, each input C/C++ application is converted to LLVM IR instructions by Clang.
Then, we modify and use Contech [8] to collect dynamic traces of the application and latency for memory
24
operations. Next, we remove all IR instructions corresponding to control statements in C such as if-else,
for, and while. Finally, we construct a two-layer network by analyzing data and control dependencies
to preserve strict program order and functionality. LLVM IR Conversion. We transform each C/C++
application to its corresponding LLVM IR instructions using the Clang compiler: Clang -emit-llvm -S.
Profiling. We profile the program by instrumenting the lightweight function rdtsc() and some inline
code before and after each memory operation to get the amount of clock cycles (T) and data size (D). The
weights associated with edges in the two-layered network is the product of T and D. The rationale for considering this weighted two-layer network representation is motivated by our goal to partition dependent
memory operations into the same vault in order to minimize data movement. This profiling is architecture
independent but results can indicate the underlying memory hierarchy: The larger T and D, the further
data away from cores (possibly in LLC or main memory) as data in memory have to be fetched via off-chip
links, which is time-consuming. Therefore, we encode data storage and memory hierarchy into weights
used in the graph.
Dynamic Trace Generation & Trace Reduction. We utilize Contech to collect dynamic IR traces.
Like full loop unrolling, dynamic traces are aware of how many iterations loops have, leading to finegrained load balancing when traces are partitioned into clusters. Furthermore, due to the nature of dynamic
traces, we are aware of the execution flow of the application and there is no need to store IR instruction
corresponding to control statements in C such as if-else, for, and while. Therefore, we perform trace reduction to lower execution overhead by identifying some patterns associated with control statements and
removing them. For example, if statements have the following structure: the type of the first instruction
is load and the second instruction dependent on the first one is icmp. As long as we find such pattern in
a basic block consisting only of two instructions, we remove this basic block. As illustrated in Figure
3.3, lines 2 and 3 in the third file correspond to the for statement. We check whether this basic block has
only two instructions in which one represents load while the other denotes icmp that depends on the first
25
load instruction. If all requirements are satisfied, we remove lines 2 and 3 as indicated in the fourth file
without green colored texts.
Graph Generation. We encode communications, computations, and their interconnected dependencies by constructing two-layered graphs where one layer represents the model of communication, and the
other one denotes the model of computation. Nodes in one layer denote computations whereas nodes in
the other layer denote communications. Edges in the communication layer are detected by alias analysis∗
whereas the rest of edges are analyzed by data and control dependencies. A formal description of the
two-layer network representation of an application is provided in section III.B (see definition 1). As shown
in Figure 3.3, except lines 8 and 11, which are computation nodes, the rest are nodes in the space of the
communication layer. We analyze dependencies in two phases: During the first phase, we analyze data and
control dependencies among instructions, which corresponds to non-dotted edges in Figure 3.3. During
the second phase, we perform alias analysis to connect different subcomponents in the communication
layer into one graph. Black dotted edges in Figure 3.3 demonstrate this connection.
3.2 Community Detection
Community detection in networks is a technique to find groups of vertices which have higher probability of connecting with each other than vertices in another groups. Therefore, in this paper, it is applied
to partition the two-layered graph into interconnected communities to be mapped to vaults. Thus we
build a community graph, which is similar to task graph, where nodes represent communities including
a series of instructions executed sequentially while edges and their weights represent dependencies and
communication cost between communities and encoding concurrent interactions. Therefore, the goal of
this section is to formulate a mathematical optimization model and partition the graph into communities
while balancing the load among communities.
∗
In LLVM, we perform alias analysis using -basicaa -aa-eval -print-all-alias-modref-info.
26
Figure 3.4: Building on the generated two-layered graph, we partition the graph into interdependent communities representing a set of IR instructions to be executed in sequential.
Before formulating the optimization model, we introduce two formal definitions for input and output
graphs.
Definition 1: A two-layered graph T G is a weighted directed graph G1 = T G(ni
, li
, eij , wij |i, j ∈
{1, ..., N}; li ∈ {1, 2}) where nodes ni
in the layer li
(1 or 2) represent memory or non-memory instructions; edges eij represent dependencies found by alias analysis in the memory layer (li = 1) and
data/control dependencies; Edge weights wij represent latency times data size for memory instructions.
Definition 2: A community graph CG is a weighted directed graph G2 = CG(Vi
, di
, Eij , Wij |i, j ∈
{1, ..., N}) where nodes Vi represent a series of IR instructions to be executed in sequential, which are
called communities; edges Eij represent dependencies between communities; edge weights Wij represent
communication cost from one node i to another j. Depth di represents the largest number of hops node i
takes to the root which is considered as a starting point†
.
Based on these definitions, we formulate an optimization model as follows: Given a two-layered graph
T G, find communities which maximize the following function
max F = Q − R (3.1)
†Note that the depth of node 2 should be 2 rather than 1 because the longest path is {2, 3, 1}. The depth can be found using
levelization.
27
Q =
1
2W
X
ij
[Wij −
sisj
2W
]δ(Ci
, Cj ) (3.2)
R =
α
2W
X
1≤u≤nc
1≤v≤nc
u̸=v
|Wu − Wv|δ(du, dv) (3.3)
where W is the sum of the total weights in T G; Wi
is the sum of weights in the community i; Wij is the
weight between nodes i and j; si
, the strength of a node i, is the sum of the weights of edges adjacent to
the node i; Ci
is the community to which node i belongs; nc is the number of communities; di
is the depth
of community i; δ(i, j) equals 1 when i = j; α controls the importance of load balancing.
The function Q measures the difference between the sum of weights within a community Win =
P
ij Wijδ(Ci
, Cj ) and that adjacent to the community Wadj =
P
ij
sisj
2W δ(Ci
, Cj ). By maximizing F, Q
should also be maximum. Therefore, Win, which represents workloads in a community, increases and
Wadj , representing communication cost decreases. Therefore, data movement is confined almost within
each community. The function R quantifies the load balancing at any depth. As shown in Figure 3.4,
communities at the same depth can be executed in parallel. Therefore, we need to make sure the loads
in those communities should be balanced to reduce the overhead of core idle waiting. Wu computes the
load in the community u and δ(du, dv) ensures that communities u and v are at the same depth. Hence,
in order to maximize F, R should be minimized, enforcing Wu and Wv to be nearly equalized at the same
depth (δ(du, dv) = 1).
3.3 Community-to-vault Mapping
In this section, based on the community graph, we try to build a scalable PIM system and map communities to vaults.
28
Figure 3.5: Community-to-vault mapping
3.3.1 Scalable PIM System
Some memory-intensive applications require more memories to store huge amount of data. Therefore,
in order to increase memory capacity in PIM systems, more HMCs are utilized and connected via highspeed Serializer/Deserializer (SerDes) links to provide high memory bandwidth. However, SerDes links
consume almost half of HMC’s power [15][16][17]. In order to save power wasted on SerDes links, we
propose a scalable PIM system with NoC in the logic layer to efficiently route packets to the destination
vault instead of the crossbar used in HMC as shown in Figure 3.1. Therefore, in order to have more
memories, we simply add extra vaults with routers to this design instead of connecting it to HMCs via
SerDes links to reduce energy consumption.
3.3.2 Mapping
We propose Algorithm 2 to map communities detected to available vaults in the scalable PIM system.
First, we rank the priorities of communities by first assigning higher priorities to communities at the
lower depth. For example, in Figure 3.5, the starting community in the depth 0 gets the highest priority.
If communities are at the same depth, assign higher priorities to one with the higher communication cost.
For example, communication costs for communities 3, 4, and 5 at the depth 1 are 102, 12, 27 respectively,
then the priority order in this depth should be 3 > 5 > 4. After priority assignment, we map communities
29
onto NoC in a greedy way as more important communities (with higher priorities) should take up the
better location, which is the center of the chip as shown in Figure 3.5.
Algorithm 2: Community-to-vault Mapping Algorithm
[1] /* Priority Assignment */ PriorityQueue = () for nodes in each depth do
Sort nodes by comm costs in the descending order PriorityQueue.append(nodes) /*
Community-to-vault Mapping */ Mapping = () for node ∈ PriorityQueue do
node is the starting point place = the center of the mesh-based NoC else
place = closest to the parent node it depends (Greedy) Mapping.append((node, place))
3.4 Evaluation
3.4.1 System Configuration
DDR3. We utilize 64 in-order cores to model a DDR3-based system. Each core has a 64KB L1 private
caches and a 256KB distributed L2 shared caches as shown in Table 1, with a memory controller to connect
to memory subsystem, i.e., DDR3. This system is the baseline for our evaluation.
HMC. Table ?? shows configuration parameters of our evaluated scalable HMC-based system, which
includes 64 vaults with eight memory layers and one logic layer. In the logic layer, one vault consists of the
same cores used in the DDR3-based system and NoC to connect them. To further evaluate different data
partitioning schemes, we apply METIS [18], a multi-way graph partitioning, and our proposed community
detection (CD) into clusters to be mapped onto different vaults in our system.
3.4.2 Simulation Configuration
Figure 3.6: Simulation Flow
30
Table 3.1: Benchmarks and descriptions
Benchmark Description Source
BS Calculate European options PARSEC[12]
SC Solve the online clustering problem PARSEC[12]
BP Back-propagation Rodinia[19]
KM K-means Rodinia[19]
MD Simulate molecular dynamics OmpSCR[9]
FFT Compute Fast Fourier Transform OmpSCR[9]
MDB Calculate Mandelbrot Set OmpSCR[9]
We use Contech [8] as the frontend functional simulator to generate dynamic LLVM traces from C/C++
applications, write a compiler-based parser to construct a two-layered graph and perform community
detection to partition the graph into clusters. We model 3D-stacked memory layers that follow the 2.1
specification [14] using ASMSim [20] and NoC communication substrate using Booksim2 [21] as backend
timing simulators. Both simulators are cycle-accurate and trace-based‡
. The simulation flow is shown in
Figure 3.6. Table 3.1 lists the 7 benchmarks we use to validate the system.
For our energy evaluation, we model the energy consumption of caches in cores using CACTI 6.0 [22]
and compute the energy of memory layer access, which is 3.7 pJ/bit [17] assuming memory operations
dominate. Next, following [23], we derive the total energy consumption of a transaction from node i to
node j described as follows:
Eij = N(nhopsErouter + (nhops − 1)Ef lit)
where N, nhops, Erouter, and Ef lit represent the number of bits to be transferred, the number of hops,
energy consumption of routers and flit transfer respectively. we assume that interconnect consumes 2
pJ/bit for flit transfer Ef lit and 1.5 pJ/bit for routers to process flits Erouter [24].
31
Figure 3.7: Speedup comparison among DDR3, HMC+METIS, and Prometheus
3.4.3 System Performance
Figure 3.7 compares the speedup between DDR3 and HMC-based systems. The embarrassingly parallel
application (i.e. MDB) and applications such as MD and BS may not benefit too much from PIM due to low
off-chip memory bandwidth usage compared to what PIM could provide. Therefore, speedup improves
only at most 4x compared to the DDR3-based system. However, if applications such as SC require high offchip memory bandwidth, then compared to DDR3-based systems which could only provide no more than
100 GB/s, our proposed HMC-based system could provide 1TB/s. Distinction may be more pronounced if
we increase the number of vaults per cube. Therefore, the speedup improvement for SC is 9.8x higher than
DDR3-based systems.
In HMC, we adopt METIS and CD to partition the graph into interconnected clusters. However, for
embarrassingly parallel programs, our graph representation cannot guarantee that clusters after graph
partitioning are independent to each other. Therefore, the performance improvement of applications such
as MD and MDB is at most 1x compared to DDR3-based systems where it is easy to parallelize using threads.
Nevertheless, our graph partitioning scheme outperforms METIS because this scheme tries to minimize
communication while balancing the load.
‡Booksim2 supports Netrace traces as simulation input.
32
3.4.4 NoC Traffic
Figure 3.8: NoC traffic with different parallelism approaches
Figure 3.8 illustrates the normalized NoC traffic with respect to threads/OpenMP and our proposed
graph partitioning. For applications such as MD and MDB, there are few data dependencies among threads
while clusters are interconnected in our graph representation. Therefore, NoC traffic for those applications
degrades somewhat compared to threads running almost independent on cores. However, for most nonembarrassingly parallel applications, community detection tries its best to confine data movement within
a cluster, leading to lower energy.
3.4.5 Energy Consumption
Figure 3.9: The comparison of normalized energy consumption among DDR3, HMC, and Prometheus
33
Figure 3.9 shows the comparison of normalized energy consumption among DDR3, HMC, and Prometheus
systems. Computations should remain the same (green bar in Figure 3.9) for all systems, while HMC
improves energy consumption regarding off-chip links compared to DDR3 as HMC has higher off-chip
memory bandwidth and shorter distance between cores and memory, causing shorter execution time.
Prometheus further improves energy consumption compared to HMC as we apply community detection
to partition the graph into clusters to minimize data communications between vaults. In other words, NoC
traffic is reduced for most applications except MD and MDB, thus energy consumption for NoC (yellow
bar in Figure 3.9) improves a lot.
3.5 Conclusions
In this section, we present Prometheus, an optimization framework to find the best data partitioning
scheme for PIM systems to improve the performance and energy consumption. Prometheus exploits the
high memory bandwidth (∼ 1TB/s) of PIM systems by (1) representing each application as a two-layered
graph where in the computation layer, nodes denote computation instructions and edges denote data dependencies; in the communication layer, nodes denote load/store instructions and edges are formed by alias
analysis. (2) performing community detection to find interconnected clusters ensuring that data movement
is almost confined within each cluster and workloads among clusters are balanced. (3) designing a scalable
PIM system where vaults are connected via NoC rather than crossbar and mapping clusters to vaults in a
greedy fashion. Our evaluation with 64 vaults and one in-order core per vault demonstrates that performance improvement is at most 9.8x compared to traditional DDR3-based systems and 0.38x compared to
PIM systems with METIS graph partitioning. Energy consumption improvement is 2.3x higher than PIM
system without community detection as Prometheus tries to reduce NoC traffic between different vaults.
34
Chapter 4
End-to-end Programmable Computing Systems
Recent technological advances contributed to rapid increases in algorithmic complexity of applications,
ranging from signal processing to autonomous systems. To control this complexity and endow heterogeneous computing systems with autonomous programming and optimization capabilities, we propose a
unified, end-to-end, programmable graph representation learning (PGL) framework that mines the complexity of high-level programs down to low-level virtual machine intermediate representation, extracts specific
computational patterns, and predicts which code segments run best on a core in heterogeneous hardware.
PGL extracts multifractal features from code graphs and exploits graph representation learning strategies
for automatic parallelization and correct assignment to heterogeneous processors. The comprehensive
evaluation of PGL on existing and emerging complex software demonstrates a 6.42x and 2.02x speedup
compared to thread-based execution and state-of-the-art techniques, respectively. This end-to-end programmable computing system leads to higher processing efficiency, which is crucial for future AI and
high-performance computing applications such as autonomous vehicles and machine vision.
4.1 Introduction
Many real-world applications across science and engineering (e.g., self-driving cars [25], digital signal
processing [26], autonomous aerial [27], ground and underwater systems [28]) urgently need increasing
35
Figure 4.1: Autonomous heterogeneous computing system. The recent advance of technologies enables
the fast progress of autonomous cars and unmanned aerial vehicles(a). However, with the commonly used
system components such as the controller and convolutional neural networks for image recognition (b),
parallelization and communication overhead become inevitable concerns for programmers as the complicated and ever-changing software needs to be parallelized and executed on a heterogeneous system (c).
The proposed framework makes the manual process autonomous without human intervention by profiling
applications (d), constructing dynamic execution graphs (e), and mapping kernels onto the platform via
machine learning models (f).
36
computational performance to match the rapid increase in the complexity of algorithms. Heterogeneous
computing systems combine multiple types of hardware accelerators (e.g., GPUs, FPGAs) to achieve such
computational gains.
To manage the need for computational gains, heterogeneous systems require intelligent, flexible, and
efficient programming strategies that can match the requirements of real-world applications to the strengths
of the heterogeneous architecture. To optimize this matching in terms of performance and energy efficiency, we need to improve the mappings, compiler transformations [29], accelerator utilization, [30],
cache locality [31], and load balancing [32]. However, the existing monolithic programming models and
task mapping to compute platforms do not fully exploit the recent heterogeneity as well as architectural
innovations in current hardware systems. They also fail to efficiently use the heterogeneous processing elements which could exacerbate the load imbalance and communication inefficiencies [32–34]. For example,
the conventional CPU-only or GPU-only optimization techniques may not be suitable for a heterogeneous
system that combines both. This is due to the architectural and programming model differences of these
hardware accelerators. Therefore, novel optimization approaches are required to realize the potential of
heterogeneous systems and achieve the goals of exascale performance.
Traditional compilation techniques rely on cost models (of relatively simple hardware) based on expert
heuristics [35]. However, the growing need for heterogeneous hardware systems to improve performance
and the resulting complexity in the hardware, have led to increasing complexity of the compilation targets.
Thus, the traditional compilation techniques are insufficient to exploit the promising potential of heterogeneous hardware systems. For example, the search conducted with those techniques must be repeated for
each new program and might require several compilations and executions. That makes them impractical
for real-world applications [36]. Furthermore, due to workload imbalance, synchronization overhead, and
resource sharing contention [32], the overall performance of those techniques may be sub-optimal.
37
Machine learning, in particular, deep learning techniques [37], have been explored in compiler optimization to learn better cost models [37–40]. For example, a recent work [41] proposed an end-to-end
deep reinforcement learning (DRL) method for ML compiler graph optimizations where the learned policies are generalized to new graphs and transferable to different tasks. Neuro-vectorizer[42, 43] proposed
an end-to-end deep reinforcement learning framework for the automatic vectorization of loops. In addition, ML-driven techniques are also used to optimize the execution time of tensor computation graphs [44]
as well as deep neural networks in TASO [45] and SOAP [46]. However, there is still a need for compiler
approaches that are capable of exploiting recent advances in machine learning to learn how to accurately
map computations (e.g., kernels) onto heterogeneous hardware systems for a single application. Such techniques should be capable of learning better cost models in a dynamic and complex heterogeneous hardware
systems under uncertain conditions that complicate the use of traditional compilation techniques. Moreover, such ML-driven techniques will help remove the burden of writing correct and efficient code from
human programmers (particularly programmers with expertise outside of computer science).
To address these issues, we propose a machine learning framework to predict the optimal hardware
device (e.g., CPU or GPU) to provide better performance given a software kernel, which is defined as the
device mapping problem [47], as shown in Figure 4.1. However, unlike the previous work [35, 48, 49]
that uses ML to solve the device mapping problem, our approach focuses on how to accurately map computations onto heterogeneous hardware systems for a single application. As applications become more
diverse and complex, it is inefficient to map them only onto one type of hardware accelerators. For example, in autonomous driving, visualization and recognition tasks can be efficiently distributed, consisting
of many for loops, onto cores in GPUs to provide higher parallelization. On the other hand, sequential
decisions based on if-else statements require CPUs to provide fast execution on a single critical thread. In
this example, GPUs provide a higher number of compute engines for parallel computing whereas CPUs
have higher frequencies compared to GPUs, leading to faster execution of sequential threads. Therefore, a
38
CPU/GPU heterogeneous system where the best features of both hardware devices are efficiently combined
can achieve even further computational gains.
4.2 Methods
Setup. Given a software program, our goal is to identify the subgraphs (i.e., code segments) that
are optimal to run on CPUs or GPUs. Note that performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Our developed end-to-end framework consists of two
components: a GAE and a GNN. The unsupervised learning model GAE is used to partition the complex
program into several clusters / kernels to be mapped onto heterogeneous systems. Supervised learning
model GNN predicts the correct label for each kernel. In the implementation, we use kernels written in
OpenCL [35] as training and testing data with 5-fold cross validation for the GNN model. The ground-truth
labels are either CPU or GPU for the kernels. In order to evaluate the PGL framework, we first use the
GAE model to partition the graphs, to find kernels suitable for either CPU or GPU. Next, different GNN
models are used to predict the correct label to the underlying hardware. The configuration parameters
of the heterogeneous system are listed below. The hardware contains 32 CPUs and 32 GPUs connected
with the mesh-based network-on-chip. Each CPU has a 4-way 64KB L1 private cache, 256KB L2 shared
cache, and 4GB memory, clocked at 2.4 GHz. Each GPU has 768MB memory with 86.4GB/s bandwidth,
clocked at 575 MHz. Applications for the power-law relationship. The power-law relationship between multifractal properties and system-level metrics can be characterized by analyzing 132 programs
in 17 applications, which are discussed as follows: (1) Algebraic multigrid solver (AMS): the parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids; (2) Fast sequence
alignment (FSA): an ultrafast and memory-efficient tool for aligning sequencing reads to long reference
sequences; (3) DNA sequence mapping (DSM): a software package for mapping DNA sequences against a
large reference genome, such as the human genome, which consists of three algorithms: BWA-backtrack,
39
BWA-SW and BWA-MEM; (4) neural network (NN): an open source neural network framework written in
C and CUDA; (5) Dijkstra (DA): Dijkstra shortest path; (6) Epidemic simulation (ES): a simulation of an
epidemic, inspired by the 2019-20 novel Coronavirus Disease (COVID-19) pandemic; (7) Molecular dynamics (MD): a proxy application and research vehicle for particle code, in particular molecular dynamics; (8)
Graph partitioning (GP): graph partitioning algorithms that include contains multi-way partitioning algorithms, Fiduccia-Mattheyses-Sanchis (FMS), partitioning by locked moves (PLM), and partitioning by free
moves (PFM); (9) Euler equation solver (EES): a miniapp that solves the time-dependent Euler equations
of compressible gas dynamics in a moving Lagrangian frame using unstructured high-order finite element
spatial discretization and explicit high-order time-stepping; (10) Evolutionary algorithm (EA): Lamarckian
evolutionary algorithm for molecular design and optimization; (11) IO proxy application (IPA): a multipurpose, application-centric, scalable I/O proxy application for IO performance testing and multi-physics,
HPC applications; (12) Mesh refinement application (MRA): an adaptive mesh refinement mini-app; (13)
CNN: a convolutional neural network; (14) Poisson equation solver (PES): a solver for a standard Poisson
equation using a conjugate gradient iteration with a simple or spectral element multigrid preconditioner on
a block or linear geometry; (15) Monte Carlo kernel (MCK): a mini-app representing a key computational
kernel of the Monte Carlo neutron transport algorithm; (16) HACC: a stand-alone version of hardware
accelerated cosmology code (HACC)’s distributed-memory, pencil-decomposed, parallel 3D FFT; (17) Radiative transfer solver (RTS): a solver for the equation of radiative transfer in the multi-group two-moment
approximation.
Datasets. We start by using the 256 heterogeneous device mapping OpenCL kernels [35] for the
training and validation of GNNs. These kernels are labeled with CPU vs. GPU. We then manually convert
these kernels to C code. Furthermore, we use standard application benchmarks to validate the overall PGL
framework. These benchmarks are (1) Dijkstra to find the shortest path with an input of 100 nodes, (2) Fast
Fourier transform with an input vector of size 4096, (3) K cluster partitioning with an input of 256 2D tuples,
40
(4) Mandel to calculate the Mandelbrot set with an input of 4092 points; (5) Molecular dynamics with an
input of 1024 particles, (6) Neural network with an input of 5 hidden fully connected layers, (7) Neurons
with an input of 1024 neurons with the ReLU activation function, (8) Convolutional neural network with
an input architecture of a convolutional layer connected with a max pooling layer and a fully connected
neural network.
Baseline Comparisons.
When comparing the accuracy of the prediction results from GNN models, we use the following GNN
models: (1) GCN; (2) GAT; and (3) GGNN. We compare our graph representation to the ProGraML graph
representation [48], NCC [50], and DeepTune [35], state-of-the-art techniques to represent programs as
graphs. To quantify the benefits of graph partitioning, we compare the PGL framework with the following baselines in terms of the application performance: (1) K-means clustering connected with GCNs
(KM+GCN); (2) hierarchical divisive clustering where all observations start in one cluster, and divisions
are performed recursively as one moves down the hierarchy, connected with GCNs (HDC+GCN); (3)
modularity-based community detection where an optimization model is proposed to measure the structure
of graphs [32, 51], connected with GCNs (MOD+GCN); (4) METIS graph partitioning [52] connected with
GCNs (METIS+GCN); (5) feed-forward neural network, connected with GCNs [53] (NN+GCN). In addition,
we compare the PGL framework in terms of the application performance with the following baselines: (1)
threads in parallel programming (PAR); (2) modularity-based community detection to partition the graph
into clusters and a heuristic mapping [32] (CommDet); (3) sliding window based neural network to locate
specialized structures with a reinforcement learning based mapping (NN+RL) [53]; (4) Aladdin, a pre-RTL,
power-performance simulator for fixed-function accelerators [54].
Feature Extraction Each node in a GNN is associated with numerous features, which are further
used for clustering or classification to make decisions at the node level or graph level. In the literature,
the code2vec [55] and inst2vec [50] are commonly used to extract features by encoding programs via AST
41
paths. However, the trained representations can put larger weights on names rather than code structure,
which may lead to misclassification.
In order to exploit the graph structural information flow of programs, random walks reason about the
number of adjacent nodes and the density of connections around a node [56]. A random walk is defined
as a series of nodes, starting from n0, the jth node is generated by the following distribution with a fixed
length l.
P(nj = j|ni = i) =
Pwij
j wij
if (i, j) ∈ E
0 otherwise
(4.1)
where wij is the edge weight between node i and node j. In addition, multifractal analysis mathematically
studies structural complexity and topological heterogeneity of graphs [57]. The multifractal properties
such as generalized fractal dimensions provide the higher-order statistics of a graph, which can be quantified by a finite box-covering method. That is, to study the different fractal structures in a graph, the
box-covering method uses a box of the same size to cover the graph and then studies the relationship of
the size of a box (l) and the number of nodes in the ith box of size l (Ni(l)) as
X
i
Ni(l)
q ∼ l
τ(q)
(4.2)
where q is the distortion factor to differentiate the topological difference of fractal structures, and τ (q)
is the mass exponent. Next, we can obtain the generalized fractal dimensions D(q) from τ (q), which
characterizes the different fractal structures of a graph.
D(q) = τ (q)
q − 1
(4.3)
42
Therefore, to mine the local and scale-dependent topological properties of programs, we propose an
algorithm in Supplementary Notes 1 which exploits random walks and multifractal concepts for encoding
topological inter-dependencies (See the additional information for the full details of the algorithm.). Random walks explore the local topological density around node i in a graph by finding random paths starting
from node i to node j. Once a random path is identified, we backtrack to the final destination node j
to find the subgraph SG starting from i to j. Next, we perform a multifractal analysis on the subgraph
SG to estimate its generalized fractal dimension. The time complexity of the algorithm is bounded by the
Dijkstra strategy to find the shortest path for each node to every other node, which is O(ElogV ), where
E and V are the numbers of edges and nodes, respectively. Finding all shortest paths in a graph has a time
complexity of O(EV logV ).
4.3 Experimental Results
4.3.1 Problem formulation and framework overview
In order to combine the benefits of both CPUs and GPUs, as opposed to the traditional device mapping
problem, we formulate a new problem to be considered within the high-performance computing and machine learning contexts: Given a complex software application, the goal is to learn a mapping function that
predicts which code segments would run best on a specific hardware device in heterogeneous hardware
platforms.
The scheduling and mapping of dataflow graphs are a well-studied research area including synchronous
dataflow [58, 59] and dynamic dataflow [60]. [61, 62] extend the job-shop scheduling techniques to account
for inter-processor communication costs. Pino et al. [63] show how to construct schedules for heterogeneous multiprocessors. Falk et al. [64] give a parallel scheduling strategy based on clustering and demonstrate significant performance gains for multimedia applications. In recent work [38], most approaches
43
to deep learning in compiler optimization borrow ideas from deep learning in natural language processing. However, compiler domain has identified data structures such as abstract syntax tree and dataflow
that exhibit the aspects more important for compiler optimization than the token sequences in natural
language processing. Therefore, new graph representations of source code are developed to be used with
the help of the recent advances in graph-based deep learning models such as graph neural networks. For
example, [65] proposed a new compiler-based graph representation for deep learning models of code. It
incorporates the abstract syntax tree and control-data-flow graphs to understand program properties and
enable deep learning such as graph neural networks on the graph properties. In addition, the concurrent
execution of varying mixes of different applications on the many-core systems enables state-of-the-art
research in predictable application execution in terms of runtime mapping. For example, [66] proposed a
hybrid mapping that achieves run-time predictability by combining the design-time analysis of application
mappings with run-time management. [67] provided a general, completely automated hybrid application
mapping methodology for optimizing the mappings of multiple concurrent running soft real-time applications to a heterogeneous multiprocessor system on a chip to minimize latency and energy. However,
previous work on graph representation of code fails to expose some interesting graph motifs in programming languages that are recurring at different scales. The proposed dynamic execution graph illustrates
different self-repeating code structures that can be exploited in multifractal analysis to extract meaningful
features.
Therefore, to decipher the complex higher-order inter-dependencies of real-world software, we represent their computations in programs (code) as a graph where each node represents a compute instruction
and each edge represents an information flow from one instruction to another. While many prior works
have employed machine learning methods from natural language processing to represent programs as a
sequence of lexical tokens [35, 68, 69], recently there emerged a number of graph-based machine learning works that capture the structure of programs along with the syntactic and semantic information in
44
the graph representation [50, 55, 65]. It has been observed that the graph-based representation learning
strategies tend to have superior learning ability on the programs for many code analysis tasks, such as code
similarity learning [70], program classification [71], etc. For instance, [65] uses abstract syntax trees (ASTs)
and control-data flow graphs (CDFGs) independently to represent programs and apply GNNs for learning
predictive compiler tasks on these graphs, which outperforms the recurrent neural networks (RNNs) on
the token sequence representation of the programs. By modeling the program’s control, data, and call
dependencies as a graph, [48] exemplified a GNN to learn representations from the graph for both nodelevel and graph-level tasks including compiler analysis, program classification, and device mapping. The
graph representation of programs enables us to model the dynamic dependency structures of software programs and helps analyze program characteristics and automatically compile programs in heterogeneous
platforms. The automation is achieved via graph learning models to predict the type of each program
from an initial feature matrix. In order to obtain the representative higher-order topological features from
a graph representation, we perform a comprehensive multi-fractal analysis [57] and quantitatively relate
the topological structures hidden in a software graph with computational performance on multiprocessor
systems while accounting for communication and synchronization overheads.
To solve this challenging optimization problem, we propose a unified, end-to-end, programmable graph
representation learning (PGL) framework capable of mining the complexity of high-level programs down
to the universal IR, extracting the specific computational patterns, and predicting which code segments
run best on a specific core in heterogeneous hardware platforms. The proposed PGL framework, shown
in Figure 4.2, is flexible and capable of working with various graph representations of software code (e.g.,
regardless of the abstract syntax tree, data-control flow graph). We also propose and evaluate a dynamic
execution graph representation constructed from a partially executed trace of a code, where nodes represent LLVM intermediate representation (IR) instructions and edges represent control, data, and memory
45
Figure 4.2: Overview of the proposed Programmable Graph Learning framework (PGL). PGL constructs
a dynamic execution graph for each input software program via low-level virtual machine (LLVM) intermediate representation (IR). PGL then utilizes a novel feature extraction algorithm based on random
walks and multi-fractal analysis to construct node features that capture the topological dependencies and
structures in dynamic execution graphs. These features are further used by a graph autoencoder (GAE)
to partition the graph into clusters (i.e., software kernels) and a graph neural network (GNN) model such
as graph convolutional networks (GCN) and multilayer perceptrons (MLP) to predict the best hardware
device for each kernel.
dependencies, which can better identify the structural information flow and capture memory dependencies.
4.3.2 Dynamic dependency used in PGL is effective in representing code as graphs.
Recently, various graph representations were proposed for machine learning to represent and capture
the latent information flow in a program (e.g., abstract syntax tree (AST) [55], contextual flow graph (XFG)
[50], and control and data flow graph (CDFG) [65]). These graph representations allow the compiler to
analyze the effectiveness and correctness of programs, as well as enable parallel programming via graph
partitioning in high-performance computing [32]. However, these statically compiled graphs have several limitations. First, memory dependencies are difficult to be identified. If not handled properly, this
can exacerbate the data communication overhead and reduce the application performance. Second, the
number of iterations in for and while loops cannot be statically determined. This plays a significant role
in predicting whether the code is running in either CPU or GPU based on the workload. For example, if
46
the number of iterations is small, it is ideal to run the code on CPU, because of the faster clock frequency.
Otherwise, GPU is preferred because the number of cores on each chip is much denser to provide higher
parallelism. Therefore, in order to overcome these drawbacks, we use the information generated from
static compiler analysis and dynamic compilation to model the information flow in high-level programs
as a dynamic execution graph. Next, we propose the following representation.
Definition 1 Dynamic Execution Graph. A dynamic execution graph is a weighted directed acyclic graph
G = (V, E, W), where each node v, associated with an attribute va indicating the type of the node (e.g., add,
sub, store, or load), (v, va) ∈ V represents an LLVM IR instruction; each edge e, associated with an attribute
ea indicating the type of dependencies (e.g., control, data, or memory), (e, ea) ∈ E represents a dependency
between two instructions; a weight w ∈ W on each edge e represents the amount of data communication
between two instructions and the time to execute the instruction. It allows us to quantify communication
overhead in the memory hierarchy with L1, L2, and L3 caches.
Note that the dataflow graphs in the literature are coarse-grained as each node represents a function
in a program and each edge represents a signal path. However, each node in a dynamic execution graph
introduced in this manuscript represents one LLVM IR instruction. It is coarse-grained enough to reduce
simulation time and memory space for keeping track of all low-level assembly instructions and data structures. At the same time, It is fine-grained enough to express inter-dependencies between each pair of
instructions dynamically collected.
The motivation for adopting a finer granularity analysis is three-fold: Firstly, the high level languages
and high level programs may be designed in order to optimize certain software engineering objectives
(e.g., modularity), but they are not taking advantage of or keep up with recent hardware innovations and
developments (e.g., high parallelism in exascale computing). Secondly, the software development for certain applications may be done in a suboptimal way without considering the time complexity in algorithms
such as recursion used in Fibonacci numbers that leads to O(2N ) where N is the Nth Fibonacci number.
47
Figure 4.3: Dynamic execution graphs and multifractal properties. Panel (a), (b), and (c) shows basic graph
patterns in graphs where the code contains either loops or sequential statements. Panel (d), (e), and (f)
shows the constructed code graphs for sequence alignment, signal processing, and convolutional neural
network, respectively. The graphs are a hybrid of fundamental graph patterns in (a-c). Panel (g) shows
the multifractal spectrum and some definitions such as α0 and spectrum width w. Panel (h) shows a
generalized fractal dimension for a graph. Panel (i) shows three multifractal spectra (green, red, and blue
lines) for (d-f) to demonstrate multifractal spectrum can identify the heterogeneous graph structures in
different dynamic execution graphs.
48
Thirdly, to bridge the gap between the high performance offered by heterogeneous hardware platforms
and high flexibility offered by general purpose computing, we need a model of computation representation that allows us to flexibly capture the best of both worlds - the software and the hardware. Towards
this end, we adopted the dynamic execution graphs with a finer-grain assembly code representation to
retain the above-mentioned flexibility and provide higher software-hardware flexibility when compared
to the dataflow graphs used in the literature. However, this finer granularity does not necessarily mean
higher communication overhead. Higher granularity means more nodes and more edges in our implementation but the communication overhead refers to the amount of communication that takes place between
clusters after the partitioning. In order to prevent higher communication overhead, we introduce our partitioning algorithm that partitions the dynamic execution graphs into clusters. Indeed, we expect that a
resulting cluster from the partitioning operation to be more similar to a node in the dataflow graphs in
the literature. Each cluster is a sequence of instructions that are optimized to reduce data communication
between clusters. Therefore, our graph representation with partitioning does not have a higher communication overhead. We optimize the inter-cluster communication to make sure the communication overhead
between clusters is minimized.
To construct these dynamic execution graphs, we first collect the representative dynamic trace generated from executing a program. This trace contains a sequence of LLVM IR instructions to be executed.
Then, for each instruction, we check if one of the following dependencies exists and insert a directed edge
to construct the graph:
• Data dependency: Source registers of the current instruction depend on the destination registers of
the previous instructions.
• Control dependency: Source registers of the function calls and branches depend on the destination
register of the previous instructions.
49
Table 4.1: Comparison of the state-of-the-art techniques on the NVIDIA dataset (left) and AMD dataset
(right). The F1 score is the harmonic mean of the precision and recall.
Framework Accuracy (%) Precision Recall F1
DeepTune 65.28 ± 5.32 0.68 0.68 0.68
DeepLLVM 88.64 ± 4.61 0.91 0.91 0.91
NCC 75.63 ± 4.85 0.80 0.80 0.80
ProGraML-GGNN 80.36 ± 4.19 0.83 0.83 0.83
PGL-GCN 87.66 ± 3.17 0.90 0.90 0.90
PGL-GAT 89.73 ± 3.88 0.92 0.92 0.92
PGL-GGNN 91.52 ± 3.14 0.94 0.94 0.94
Framework Accuracy (%) Precision Recall F1
DeepTune 68.4 ± 4.52 0.70 0.68 0.69
DeepLLVM 90.9 ± 2.14 0.93 0.93 0.93
NCC 78.5 ± 3.74 0.79 0.79 0.79
ProGraML-GGNN 86.6 ± 3.28 0.89 0.87 0.88
PGL-GCN 92.97 ± 2.79 0.93 0.93 0.93
PGL-GAT 93.36 ± 2.45 0.94 0.94 0.94
PGL-GGNN 93.87 ± 2.27 0.94 0.94 0.94
• Memory dependency: Memory locations of current store-load instruction are the same as the previous store-load instructions. We perform this memory alias analysis using "-basicaa -aa-eval -printallalias-modref-info" in the LLVM environment.
Figure 4.3(a-c) shows some common zoomed-in graph patterns among dynamic execution graphs with
high-level C code. Loops are commonly used in any programming language that can execute a group
of statements multiple times. When arrays are used inside a loop statement, the corresponding dynamic
execution graph has a star shape. The central node that is connected to different branches is the "getelementptr" LLVM IR. It is used to get the address of a sub-element of an aggregate data structure. Each
branch corresponds to different instances of a[i] = i. When none of the arrays are used inside a loop,
the corresponding dynamic execution graph has a mesh shape. When only sequential statements such
as if-else are used in code, the corresponding dynamic execution graph has a tree shape to represent the
information flow from the beginning to the end. Figure 4.3(d-f) shows the constructed code graphs for
sequence alignment, signal processing, and convolutional neural network, respectively. Note that a node
is an LLVM IR instruction, not an operand or a high-level language (e.g., C/C++, Java) statement. Different from AST, XFG, and CDFGs, this specific graph representation in Figure 4.3(d-f) makes explicit some
hidden program information flows from the execution trace generated at run-time and analyzed via data,
control, and memory dependencies. Each graph contains multiple fundamental graph patterns in (a), (b),
and (c). For example, (d) clearly shows the mesh topology (b), and (e) has a star-shaped subgraph (a) that
indicates the use of loops with arrays. In order to quantify the structural difference among the graphs,
50
we also analyze the multifractal spectra of the graphs in (i), which validates that multifractal analysis is
able to detect the topological structures in graphs. This helps us to design the feature extraction algorithm
based on multifractal analysis in PGL.
In order to validate the effectiveness of PGL, we compare it with state-of-the-art techniques in terms of
the accuracy of the prediction results on the same dataset [35]. We compare PGL against the DeepTune and
DeepLLVM using the code released by their authors. We also compare our graph representation against
the ProGraML graph representation by extracting ProGraML graphs from the C versions of the kernels and
training a GGNN on the graphs. Each dataset contains a set of kernels written in OpenCL and the labels
associated with them. Each label is either 0 (CPU) or 1 (GPU). We then manually convert OpenCL into C
in order to be used in ProGraML and PGL. We use 5-fold cross validation to evaluate the machine learning
models by partitioning each dataset into training, validation, and testing sets. Accuracy is measured by
calculating the number of times that a framework is able to correctly predict the label for each kernel
divided by the number of kernels. We repeat each experiment 100 times to report the mean and standard
deviation. Precision is calculated by the true positive divided by the true positive plus the false positive. A
recall is calculated by the true positive divided by the true positive plus the false negative.
As we can see from Table 4.1, PGL outperforms the state-of-the-art token-based DeepLLVM[69] by
1.03x and graph-based ProGraML[48] by 1.14x in terms of accuracy because it provides a novel way for
program structural representation and enables the recent graph neural networks (GNNs) for the downstream tasks. In addition, we also test different graph neural networks including graph convolutional
network (GCN) [72], graph attention network (GAT) [73], and gated graph neural network (GGNN) [74]
along with PGL and it demonstrates that GGNN provides better accuracy compared to the rest by 1.04x.
In addition, we also test the impact of each framework on the fast convergence of the machine learning
model in terms of accuracy for the NVIDIA (a) and AMD (b) datasets. More specifically, each machine
learning is trained using 500 epochs to achieve stable results. In this experiment, we gradually remove
51
Figure 4.4: Convergence of normalized accuracy with different percentages of training steps in the NVIDIA
(a) and AMD (b) datasets. Each color line indicates normalized accuracy for a given framework and each
color shading associated with a line shows the standard deviation for the framework.
10 percent of training steps to understand which framework offers fast convergence in terms of accuracy.
As we can see from Figure 4.4, in general, PGL-GGNN offers the fastest convergence compared to others
because it reaches the approximately optimal results at 60% whereas DeepLLVM reaches its optimal results
at around 90%.
4.3.3 The interdependence between advanced software code optimally executed on
heterogeneous hardware exhibits a complex multifractal and universal behavior.
To decipher the mathematical relationship between the network properties (e.g., multifractal spectrum,
generalized fractal dimension) and the system-level metrics such as the parallelization degree and communication overhead, we investigate different software kernels employed in high-performance computing and construct their corresponding code graphs. The code graphs exhibit a wide variety of self-similar
structures due to loops and conditional statements. To quantify the higher-order topological complexity, we perform the multifractal analysis of code graphs and quantify their self-similar properties through
the multifractal spectrum (Figure 4.3(g)) and generalized fractal dimension (Figure 4.3(h)). The width of
the multifractal spectrum f(α) with respect to the Lipschitz-Holder exponents α measures the structural
52
complexity and heterogeneity of a network [57]. Here, α quantifies the dimension of the fractal structure,
and f(α) reflects the proportion of fractal structures with a given Lipschitz-Holder exponent α, i.e., the
distribution of fractal structures in the network. The multifractal spectrum of a monofractal graph is similar to a delta function where a single physical rule governs the graph structure at any scale and can be
interpreted in terms of the system level as the fact that the graph can be mapped to either CPUs or GPUs.
In contrast, the general multifractal spectrum exhibiting a non-zero width indicates that more than one
physical rule governs the graph topology, which means that the graph is heterogeneous and should be
carefully investigated in order to be mapped to both CPUs and GPUs.
For a dynamic execution graph constructed from a given software code implementation via compiler
analysis, we partition it into several interdependent clusters to identify the optimal parallelization degree
with respect to the characteristics of the heterogeneous computing system and minimize the inter-cluster
weights (data communication overhead) via the optimization framework [32, 75]. Each networked processing community represents a specific set of interdependent LLVM instructions, which is similar to a thread
or process in operating systems. The inter-cluster weights represent the amount of data communication
from one cluster to another, resulting from the optimization framework being minimized. To characterize
the computational requirements and properties of various software code, we consider two system-level
metrics such as the parallelization degree and communication overhead. The parallelization degree is defined as the number of processing communities (clusters) generated from the optimization framework. The
communication overhead is defined as the sum of inter-cluster weights between two clusters.
Dynamic execution graphs can exhibit some self-repeating patterns on different scales that we can
exploit to capture and understand the intrinsic graph structures. There are two fundamental techniques
in programming languages: iteration and recursion. Iteration is a significant routine for a program to
define a number of repetitions, usually via for-loops and while-loops. It corresponds to the mesh-like
topology in graph representation. Recursion is the other major approach for a program to solve a problem
53
Figure 4.5: Example of code with its graph representation and box counting algorithm used to analyze the
multifractal properties. (a) A loop kernel called example6 in red; (b) The dynamic execution graph with
initialization in a blue rectangle and wrap-up in a green rectangle with a zoom-in view on one iteration
of the loop; (c) The box counting algorithm by varying the size of a box r to count the number of boxes
N(B).
where the solution depends on solutions to smaller instances of the same problem. It corresponds to a
tree-like topology in graph representation. These two types of graphs can be analyzed at different scales
to understand the recurring structures to extract the hidden features. For example, Figure 4.5 shows an
example of a for loop and its corresponding graph structure. In order to analyze its self-repeating patterns
that can be seen in (b), we use the box-counting algorithm in multifractal analysis to calculate the dominant
fractal dimension. In other words, we follow the definition of the measure to find the number of boxes
(N(B)) with a box size r. The number of boxes is calculated by the optimal amount used to cover the
entire graph. For example, when r = 1, the number of boxes N(B) is the number of nodes in the graph,
which is 116 in this case. When r is the diameter of the graph, the number of boxes N(B) is 1.
We analyze 132 programs corresponding to 17 applications ranging from state-of-the-art high-performance
solvers to the machine learning domain. Relying on dynamic and static compiler analysis, we transform
each program into a dynamic execution graph and measure their corresponding multifractal properties
54
Figure 4.6: Multifractal analysis can characterize the universal power-law relationship between multifractal properties and system-level metrics. Network multifractal properties are used as inputs to fit a powerlaw model axb
to find the relationship between network properties and system-level metrics. Panel (a-f)
shows the parallelization degree of code graphs in terms of generalized fractal dimension (a-b), spectrum
width (c), spectrum height (d), α0 (e), and complexity (f). Panel (g-i) shows the communication overhead
for spectrum width (g), α0 (h), and complexity (i).
55
and system-level metrics. Each dot in Figure 4.6 represents one program. Supplementary Notes 2 discusses the general idea behind multifractal analysis. To investigate the existence of a mathematical relationship between the network properties and system-level computing metrics, we measure the generalized fractal dimension (Figure 4.6(a-b)), the spectrum width (Figure 4.6(c, g)), the spectrum height (Figure
4.6(d)), the dominant Lipschitz-Holder exponent α0 (Figure 4.6(e, h)), and the network complexity (Figure
4.6(f, i)). We observe that the network and system-level computing metrics obey a power-law model (i.e.,
axb
), indicating the existence of a universality phenomenon characterizing the efficient heterogeneous
software-to-hardware optimization. For example, Figure 4.6(a) shows the power-law trend between the
generalized fractal dimension where q = −10 characterizing the rare network motifs and the parallelization degree. The higher this dimension, the more frequent the rare patterns in code graphs, the higher
the parallelization degree. Going beyond rare network motifs, we investigate the width of the multifractal
spectrum which quantifies the richness in generating rules characterizing a dynamic complex software.
Figures 4.6(c) and 4.6(g) show the power-law relationship between the multifractal spectrum width and the
parallelization degree, and the communication overhead, respectively, indicating a universality signature.
The larger the multifractal spectrum width, the more heterogeneous the code graph and the higher the
parallelization degree and communication overhead.
Once we analyze different graph properties from multifractal analysis such as generalized fractal dimension, spectrum height and width, we are trying to relate the graph-level properties with some systemlevel metrics such as communication overhead and parallelization degree by fitting a power-law model
into the data to help us understand the relationship. As we can see in Figure 4.6, There exists such model
that can approximately estimate the system-level metrics from graph properties. This has two folds.
1. It provides a universal model that builds the relationship between the graph properties such as the
multifractal spectrum and the system-level metrics such as the parallelization degree and communication overhead. If such a model can accurately capture the relationship, the optimal degree of
56
Table 4.2: Comparison of different graph partitioning algorithms on the 17 applications
KM+GCN HDC+GCN MOD+GCN METIS+GCN NN+GCN PGL-GGNN
Algebraic multigrid solver 0.78 0.92 1.36 2.57 4.01 5.85
Fast sequence alignment 0.89 1.04 2.63 5.35 7.22 9.27
DNA sequence mapping 0.92 0.82 1.98 3.73 4.67 6.54
Neural network 0.86 1.12 4.52 8.42 9.64 11.85
Dijkstra 0.85 0.97 1.19 1.57 1.86 2.53
Epidemic simulation 0.98 1.21 2.52 4.22 6.53 8.34
Molecular dynamics 0.92 1.06 3.34 5.16 5.95 7.68
Graph partitioning 0.80 0.96 4.73 9.63 10.74 14.47
Euler equation solver 0.90 1.33 1.75 3.37 5.43 6.74
Evolutionary algorithm 0.94 0.89 1.42 2.56 4.12 6.2
IO proxy application 0.94 1.29 1.76 4.14 5.21 5.88
Mesh refinement application 0.93 1.05 2.78 3.75 4.52 6.32
CNN 0.88 1.27 2.56 5.12 6.43 7.69
Poisson equation solver 0.87 0.98 2.06 4.27 6.24 8.52
Monte Carlo kernel 0.79 0.87 1.89 3.64 4.88 6.03
HACC 0.92 0.86 2.21 4.83 5.75 7.84
Radiative transfer solver 0.97 1.26 2.44 5.12 6.70 8.93
parallelization can be calculated by the graph properties, without the manual tuning from a programmer. For example, if a dynamic execution graph from a piece of code has a spectrum width
that equals 2, then we would expect communication overhead between 80 and 120 ×108
clock cycles, and the parallelization degree between 10 and 24. On the other hand, if a future platform can
support millions of cores, then we can use this model to find how to write the code that can exploit
the benefits in exascale computing.
2. It provides us with what design choices we can gain to develop the feature extraction algorithm.
PGL contains a feature extraction algorithm used by the graph neural network to predict a label. It
shows that multifractal analysis can capture the graph topological structures, which can be further
used in the feature extraction algorithm.
In addition, when the dataset is evaluated in the full-system simulation, we notice that PGL achieves
1.89x on average, which is consistently better compared to state-of-the-art graph partitioning algorithms
such as METIS [52] and machine learning models, as shown in Table 4.2. It can also provide 4.73x speedup
on average, compared to state-of-the-art frameworks [53], as shown in Table 4.3.
57
Table 4.3: Comparison of different frameworks on the 17 applications
PAR CommDet Aladdin NN+RL PGL-GGNN
1. Algebraic multigrid solver 1 1.32 2.04 1.65 5.85
2. Fast sequence alignment 1 1.28 2.15 1.89 9.27
3. DNA sequence mapping 1 1.46 1.96 2.21 6.54
4. Neural network 1 1.88 3.21 2.67 11.85
5. Dijkstra 1 1.22 1.35 1.05 2.53
6. Epidemic simulation 1 1.09 1.77 2.04 8.34
7. Molecular dynamics 1 1.15 2.20 1.6 7.68
8. Graph partitioning 1 1.27 2.65 2.45 14.47
9. Euler equation solver 1 1.33 2.85 2.5 6.74
10. Evolutionary algorithm 1 1.54 2.54 2.2 6.2
11. IO proxy application 1 1.32 2.96 2.56 5.88
12. Mesh refinement application 1 1.65 2.33 2.75 6.32
13. CNN 1 1.13 1.91 2.24 7.69
14. Poisson equation solver 1 1.08 1.78 2.53 8.52
15. Monte Carlo kernel 1 1.24 2.12 2.64 6.03
16. HACC 1 1.35 2.52 2.31 7.84
17. Radiative transfer solver 1 1.42 2.22 1.75 8.93
4.3.4 Graph auto-encoders can exploit network universality properties for partitioning
large software into small kernels mapping them onto heterogeneous computing
systems.
GAE-based partitioning of large software graphs into different kernels. Graph auto-encoders (GAEs)
[76] are a category of GNNs that aims at representing nodes into low-dimensional vectors in an unsupervised training fashion. They are different from other GNNs that are typically used for supervised or
semi-supervised learning tasks. In our framework, the goal of the graph partitioning stage is to obtain a
good partition for each LLVM graph based on a learned representation that captures the intrinsic structural information of the graph, such that the subgraphs preserve the inherent characteristics of the data,
control and memory dependencies in the LLVM graph. To this end, we propose a graph partitioning strategy based on the GAE [77] and spectral clustering [78] for our task. Once the GAE partitions a dynamic
execution graph into kernels, we further refine the partitions to minimize the communication overhead.
At the partitioning level, it is true that we cannot guarantee a correct partition of a graph, meaning that
node i which should have been placed into cluster i is then in cluster j. But the correctness of the graph
58
partitioning doesn’t necessarily mean that this will lead to wrong results upon execution. The wrong results usually are caused by (1) missing instructions; (2) wrong order of instructions being executed; or (3)
wrong data being fetched. However, none of the scenarios can happen in our construction of the dynamic
execution graph because of the following safeguards: (1) Each graph contains all of the instructions and
their direction dependencies, and the proposed partitioning does not remove the instructions and their dependencies. (2) The order of executing the instructions is preserved by exploiting our proposed topological
sort strategy that guarantees directed dependencies among cluster, i.e., cluster i is executed before cluster
j if there is a direct edge from cluster i to j. (3) There are two possible cases when an instruction needs
data: (i) whenever it loads data from memory and (ii) whenever it depends on another instruction. When
it needs data from memory, it can be in any cluster. For example, when an instruction from cluster i needs
data from another instruction that is in cluster j, the topological sort during mapping stage resolves this
situation by asking cluster i to wait before the completion of cluster j to make sure the data is available and
sent to cluster i. To prevent livelock and deadlock situations, the optimization model used in partitioning
has a constraint that prevents cyclic dependencies in clusters.
We performed experiments on two different benchmark suites in terms of application performance and
reported the clock cycles spent either on communication or computation as shown in Figure 4.7 and 4.8.
For example, memory-intensive applications such as Dijkstra involves pointer address manipulation that
requires frequent data fetch from memory. When running in PAR without any optimization, the communication overhead compared to the execution time is 820.52 ×107
. However, PGL manages to reduce it to
109.38 ×107 via the optimization model to partition the graph into clusters while minimizing inter-cluster
communication. For compute-intensive applications such as FFT, the communication overhead running
in PAR is 63.18 ×107 whereas it is reduced to 31.10 ×107
in PGL. This indicates that even though PGL
uses a finer-grained graph representation, the optimization based partitioning approach would reduce the
communication overhead between clusters.
59
Figure 4.7: The breakdown of the execution time of each application in the standard dataset running on
different frameworks. The execution time, measured in clock cycles, is roughly divided into two parts:
communication and computation in (a). We also report communication overhead that is calculated by
clock cycles in communication divided by the total clock cycles in (b). As we can see, PGL, compared to
the other frameworks, has the smallest communication overhead. It is because PGL has an optimization
model that partitions the graph into different clusters to minimize inter-cluster communication.
60
Figure 4.8: The breakdown of the execution time of each application in the real-life dataset running on
different frameworks. The execution time, measured in clock cycles, is roughly divided into two parts:
communication and computation in (a). We also report communication overhead that is calculated by
clock cycles in communication divided by the total clock cycles in (b).
Figure 4.9: Experimental results. Panel (a) shows comparison of different partitioning algorithms. We
compare the graph partitioning GAE with different traditional algorithms. Panel (b) shows comparison of
different frameworks. We compare PGL with different frameworks in terms of application performance.
We conclude that our approach can achieve 2.02x better compared to the state-of-the-art techniques.
61
GNN-based mapping prediction on heterogeneous computing systems. Once the kernels are further refined, next for each kernel, we use a GNN to predict the correct platform to execute the kernel by updating
the node vectors iteratively in a similar fashion to the message passing. Note that our proposed PGL
is a general framework that can leverage various GNN models for the device mapping prediction stage,
whereas in this paper, we adopt three different variants of the GNN models: GCN [72], graph attention
network (GAT) [73, 79] and gated graph neural network (GGNN) [74], respectively. We also empirically
investigate the comparative effectiveness of these GNN strategies in representation learning on the partitioned LLVM graphs for the graph classification task in heterogeneous device mapping.
We fix the graph neural network as GCN with two hidden layers and 32 neurons per layer, which is used
to predict the correct label for each kernel. We compare the GAE with different partitioning algorithms
such as K-means (KM), hierarchical divisive clustering (HDC), modularity-based community detection
(MOD), METIS, and feed-forward neural network (NN) in terms of the total application execution speedup.
As shown in Figure 4.9(a), for the partitioning models without machine learning such as KM, HDC, MOD,
and METIS, the normalized execution speedup is smaller compared to the learning models such as NN and
GAE. It is mainly because the kernels after graph partitioning are not well recognized by the GCN model.
For the learning models, GAE outperforms NN by up to 32% because the GAE takes into account the graph
structures of code.
In order to validate the framework including the GAE and GNN models, we use the trained models
to predict each application. As shown in Figure 4.9(b), we use the traditional thread-based parallel programming running on CPUs as our baseline and compare the PGL framework with community detection,
a neural network with reinforcement learning, and Aladdin [54]. We observe that the PGL framework can
provide up to 6.42x speedup compared to the baseline and 2.02x speedup higher compared to the state-ofthe-art. Supplementary Notes 5-9 further show more experimental results on framework comparison.
62
4.4 Conclusions
We proposed an end-to-end learnable PGL framework to predict which code segments run best on a
specific hardware device. We first develop a node feature extraction algorithm based on random walks
and multifractal analysis concepts to quantify the local structures of a program. We also measure different
multifractal properties and find the universal relationship between these properties and system-level metrics such as parallelization degree and communication overhead. Next, we build the GAE together with
a decoder and spectral clustering to find cluster partition from the distance matrix. Then, we use graph
neural networks as the learning model to predict the type of each cluster. Our evaluation on 32 CPUs and
32 GPUs concludes that the PGL framework can provide up to 6.42x speedup compared to the baseline and
2.02x higher speedup compared to the state-of-the-art technique.
We believe the universal power-law model between the multifractal properties and system-level metrics could serve as an indication to designers who plan to explore the best mapping for their applications.
For example, if a designer wishes to know the optimal parallelization degree that is hard to find, the easiest
approach could be to quickly collect a certain multifractal property such as spectrum width or αo and use
the model to roughly determine the optimal degree. In the future, this learning framework is able to be
generalized to any system that may include FPGA and emerging technologies.
In Figure 4.10, we illustrate some of the limitations of PGL. We summarize the limitations and potential
future extensions of the proposed framework as follows:
• First, the runtime profiling in the PGL framework only supports C and C++ code that involves complicated computation. Simple code with only a few lines is not beneficial in PGL (see Figure 4.10(a)
for a simple illustration example where a kernel contains three operations and it only has two clusters where one contains one instruction and the other contains two instructions after partitioning.
The small code is not ideal in PGL because the overhead it spends on profiling and partitioning is not
63
Figure 4.10: PGL Limitations. (a). It is not beneficial for small code to pay for the overhead of PGL while
mapping a few instructions onto cores. (b). Random memory accesses from pointer manipulation are
not beneficial in PGL because there will be thousands of false memory dependencies due to LLVM alias
analysis. This may increase the communication overhead.
mitigated by mapping only a few instructions onto a specific core. In the future, developers could
build more runtime systems that support different languages.
• Second, the PGL is not suitable for high level programs that involves many memory random accesses,
due to memory dependencies that are hard to identify (see Figure 4.10(b)). we have a dynamic execution graph that involves memory address manipulation and indexing, which could lead to many
false memory dependencies between clusters. While the LLVM alias analysis reports the MustAlias,
MayAlias, and NoAlias dependencies, we treat MustAlias and MayAlias as memory dependencies
and add an edge between two instructions irrespective of whether they are must or may alias. This
may increase the communication overhead due to too many MayAlias cases. On one hand, after
we partition the dynamic execution graph into clusters to minimize inter-cluster communication, if
64
most of the memory dependencies are confined in one cluster, then it would not increase communication. On the other hand, if many false memory dependencies span across different clusters, then
communication overhead gets worse. In the future, this can be considered in the optimization model
to partition the graph.
• Third, the high level programs have to be compiled and run successfully to collect the execution trace
that is required to build the dynamic execution graph, which could be time and space consuming.
In the future, developers could mitigate this issue by combing code runtime profiling and graph
construction on the fly.
65
Chapter 5
Conclusions
With CPUs, GPUs, and specialized hardware accelerators co-existing on manycores, there is an increasing tension between programmability of CPUs and efficiency of accelerators. Therefore, we presented the
graph learning framework which incorporates the necessary SDH and DSSoC capabilities to tackle issues
such as performance inefficiency of general-purpose machines, inadequate programmability of domainspecific accelerators, and energy inefficiency of NoC by relying on NNs to classify task attributes and
intelligent schedulers to manage the set of domain specific tasks and enable reconfiguration of hardware
components. First, we transform target applications from LLVM IR into IDGs where nodes represent LLVM
IR instructions; edges represent dependencies among instructions; and edge weights represent the amount
of data to be transferred between two nodes. Based on IDGs, we propose sliding window based NN classifiers to detect existing patterns (MM/SGD/ReLU/parallel for-loop structures) in IDGs. NN classifiers are
trained offline with representative IDGs for various special patterns. We then partition the rest of the
graph into interconnected communities (tasks) with minimal data communication overhead to reduce energy consumption. This forms the task pool consisting of heterogeneous tasks: tasks are more suitable
for either CPUs with sequential execution, or GPUs with parallel execution, or accelerators with special
patterns. Next, the task allocator distributes tasks to agents following a load balancing goal. Distributed
66
intelligent schedulers map tasks to PEs and reconfigure the hardware platform as required. The environment returns back the rewards representing whether the mapping is perfect. Based on these values, agents
can learn the best mapping of hybrid tasks. We conducted experiments on NoC-based heterogeneous PEs
consisting of 32 CPUs, 32 GPUs, and hardware accelerators such as matrix multiplication and stochastic
gradient descent in a machine learning domain. Results indicate that SOSPCS, compared to state-of-theart application scheduling algorithms, provides performance and energy efficiency improvements as high
as 4.12x and 3.24x, respectively. Future work will focus on further optimizations that will consider moving most of this framework analysis to hardware and minimizing the software requirements due to long
delays.
67
Bibliography
[1] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, “Profiling
a warehouse-scale computer,” in Proceedings of the 42Nd Annual International Symposium on Computer
Architecture, ISCA ’15, (New York, NY, USA), pp. 158–169, ACM, 2015.
[2] D. Pandiyan and C. Wu, “Quantifying the energy cost of data movement for emerging smart phone
workloads on mobile platforms,” in 2014 IEEE International Symposium on Workload Characterization
(IISWC), pp. 171–180, Oct 2014.
[3] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint highthroughput accelerator for ubiquitous machine-learning,” in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, (New
York, NY, USA), pp. 269–284, ACM, 2014.
[4] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and Y. Chen, “Dadiannao: A neural
network supercomputer,” IEEE Trans. Comput., vol. 66, pp. 73–88, Jan. 2017.
[5] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), pp. 367–379, June 2016.
[6] P. Bogdan, “Mathematical modeling and control of multifractal workloads for data-center-on-a-chip
optimization,” in NOCS, 2015.
[7] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization, 2004. CGO 2004., pp. 75–86,
IEEE, 2004.
[8] B. P. Railing et al., “Contech: Efficiently generating dynamic task graphs for arbitrary parallel programs,” in TACO, 2015.
[9] A. J. Dorta et al., “The openmp source code repository,” in PDP, 2005.
[10] A. Danalis et al., The scalable heterogeneous computing (SHOC) benchmark suite. GPGPU, 2010.
[11] M. R. Guthaus et al., “Mibench: A free, commercially representative embedded benchmark suite,” in
Workload Characterization, 2001.
[12] C. Bienia et al., “The parsec benchmark suite: Characterization and architectural implications,” in
PACT, 2008.
[13] J. Draper et al., “The architecture of the diva processing-in-memory chip,” in ICS, 2002.
[14] H. M. C. Consortium, “Hybrid memory cube specification 2.1,” 2013.
68
[15] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for
parallel graph processing,” in Proceedings of the 42nd Annual International Symposium on Computer
Architecture, pp. 105–117, 2015.
[16] L. Nai et al., “Graphpim: Enabling instruction-level pim offloading in graph computing frameworks,”
in HPCA, 2017.
[17] S. H. Pugsley et al., “Ndc: Analyzing the impact of 3d-stacked memory+ logic devices on mapreduce
workloads,” in ISPASS, 2014.
[18] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,”
SISC, 1998.
[19] S. Che et al., “Rodinia: A benchmark suite for heterogeneous computing,” in IISWC, 2009.
[20] L. Subramanian et al., “The application slowdown model: Quantifying and controlling the impact of
inter-application interference at shared caches and main memory,” in MICRO, 2015.
[21] N. Jiang et al., “A detailed and flexible cycle-accurate network-on-chip simulator,” in ISPASS, 2013.
[22] N. Muralimanohar et al., “Cacti 6.0: A tool to model large caches,” in HP Laboratories, 2009.
[23] J. Hu and R. Marculescu, “Energy-and performance-aware mapping for regular noc architectures,” in
IEEE TCAD, 2005.
[24] G. Kim et al., “Memory-centric system interconnect design with hybrid memory cubes,” in PACT,
2013.
[25] C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso, A. Forechi, L. Jesus, R. Berriel,
T. M. Paixao, F. Mutz, et al., “Self-driving cars: A survey,” Expert Systems with Applications, vol. 165,
p. 113816, 2021.
[26] K. Zhong, X. Zhou, J. Huo, C. Yu, C. Lu, and A. P. T. Lau, “Digital signal processing for short-reach
optical communications: A review of current technologies and future trends,” Journal of Lightwave
Technology, vol. 36, no. 2, pp. 377–400, 2018.
[27] L. P. Koh and S. A. Wich, “Dawn of drone ecology: low-cost autonomous aerial vehicles for conservation,” Tropical conservation science, vol. 5, no. 2, pp. 121–132, 2012.
[28] S. Krishnan, B. Borojerdian, W. Fu, A. Faust, and V. J. Reddi, “Air learning: An ai research platform
for algorithm-hardware benchmarking of autonomous aerial robots,” arXiv preprint arXiv:1906.00421,
2019.
[29] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations for high-performance computing,” ACM Computing Surveys (CSUR), vol. 26, no. 4, pp. 345–420, 1994.
[30] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra, “Dague: A generic
distributed dag engine for high performance computing,” Parallel Computing, vol. 38, no. 1-2, pp. 37–
51, 2012.
[31] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in
2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 481–492,
IEEE, 2017.
69
[32] Y. Xiao, Y. Xue, S. Nazarian, and P. Bogdan, “A load balancing inspired optimization framework for
exascale multicore systems: a complex networks approach,” in ICCAD, pp. 217–224, 2017.
[33] S. Mittal and J. S. Vetter, “A survey of cpu-gpu heterogeneous computing techniques,” ACM Computing
Surveys (CSUR), vol. 47, no. 4, pp. 1–35, 2015.
[34] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end
of multicore scaling,” in 2011 38th Annual international symposium on computer architecture (ISCA),
pp. 365–376, IEEE, 2011.
[35] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, “End-to-end deep learning of optimization
heuristics,” in 2017 26th International Conference on Parallel Architectures and Compilation Techniques
(PACT), pp. 219–232, IEEE, 2017.
[36] H. Leather and C. Cummins, “Machine learning in compilers: Past, present and future,” in 2020 Forum
for Specification and Design Languages (FDL), pp. 1–8, IEEE, 2020.
[37] M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, “The deep
learning compiler: A comprehensive survey,” IEEE Transactions on Parallel and Distributed Systems,
vol. 32, no. 3, pp. 708–727, 2020.
[38] Z. Wang and M. O’Boyle, “Machine learning in compiler optimization,” Proceedings of the IEEE,
vol. 106, no. 11, pp. 1879–1901, 2018.
[39] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano, “A survey on compiler autotuning
using machine learning,” ACM Computing Surveys (CSUR), vol. 51, no. 5, pp. 1–42, 2018.
[40] A. Haj-Ali, N. K. Ahmed, T. Willke, J. Gonzalez, K. Asanovic, and I. Stoica, “A view on deep reinforcement learning in system optimization,” arXiv preprint arXiv:1908.01275, 2019.
[41] Y. Zhou, S. Roy, A. Abdolrashidi, D. Wong, P. Ma, Q. Xu, H. Liu, M. P. Phothilimtha, S. Wang, A. Goldie,
et al., “Transferable graph optimizers for ml compilers,” arXiv preprint arXiv:2010.12438, 2020.
[42] A. Haj-Ali, N. K. Ahmed, T. Willke, Y. S. Shao, K. Asanovic, and I. Stoica, “Neurovectorizer: End-to-end
vectorization with deep reinforcement learning,” in Proceedings of the 18th ACM/IEEE International
Symposium on Code Generation and Optimization, pp. 242–255, 2020.
[43] A. Haj-Ali, N. K. Ahmed, T. Willke, S. Shao, K. Asanovic, and I. Stoica, “Learning to vectorize using
deep reinforcement learning,” in Neurips workshop on Machine Learning for Systems., 2019.
[44] Y. Jinnai, A. Mehrjou, K. Ciosek, A. Mitenkova, A. Lawrence, T. Ellis, R. Tomioka, S. P. Jones, and
A. Fitzgibbon, “Knossos: Compiling ai with ai,” 2019.
[45] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken, “Taso: optimizing deep learning computation with automatic generation of graph substitutions,” in Proceedings of the 27th ACM
Symposium on Operating Systems Principles, pp. 47–62, 2019.
[46] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep neural networks,” arXiv
preprint arXiv:1807.05358, 2018.
[47] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and
J. Dean, “Device placement optimization with reinforcement learning,” in International Conference on
Machine Learning, pp. 2430–2439, PMLR, 2017.
70
[48] C. Cummins, Z. Fisches, T. Ben-Nun, T. Hoefler, M. O’Boyle, and H. Leather, “ProGraML: A Graphbased Program Representation for Data Flow Analysis and Compiler Optimizations,” in International
Conference on Machine Learning (ICML), 2021.
[49] D. Grewe, Z. Wang, and M. F. O’Boyle, “Portable mapping of data parallel programs to opencl for
heterogeneous systems,” in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10, IEEE, 2013.
[50] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler, “Neural code comprehension: A learnable representation
of code semantics,” arXiv preprint arXiv:1806.07336, 2018.
[51] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3-5, pp. 75–174, 2010.
[52] D. LaSalle, M. M. A. Patwary, N. Satish, N. Sundaram, P. Dubey, and G. Karypis, “Improving graph
partitioning for modern graphs and architectures,” in Proceedings of the 5th Workshop on Irregular
Applications: Architectures and Algorithms, pp. 1–4, 2015.
[53] Y. Xiao, S. Nazarian, and P. Bogdan, “Self-optimizing and self-programming computing systems: A
combined compiler, complex networks, and machine learning approach,” IEEE transactions on very
large scale integration (VLSI) systems, vol. 27, no. 6, pp. 1416–1427, 2019.
[54] Y. S. Shao, B. Reagen, G. Wei, and D. Brooks, “Aladdin: A pre-rtl, power-performance accelerator
simulator enabling large design space exploration of customized architectures,” in SIGARCH, vol. 42,
pp. 97–108, 2014.
[55] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learning distributed representations of
code,” Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–29, 2019.
[56] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the
22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864,
2016.
[57] Y. Xue and P. Bogdan, “Reliable multi-fractal characterization of weighted complex networks: algorithms and implications,” Scientific reports, vol. 7, no. 1, p. 7487, 2017.
[58] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceedings of the IEEE, vol. 75, no. 9,
pp. 1235–1245, 1987.
[59] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous data flow programs for digital
signal processing,” IEEE Transactions on computers, vol. 100, no. 1, pp. 24–35, 1987.
[60] “Scheduling dynamic dataflow graphs with bounded memory using the token flow model,” in 1993
IEEE international conference on acoustics, speech, and signal processing, vol. 1, pp. 429–432, IEEE, 1993.
[61] G. C. Sih and E. A. Lee, “A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures,” IEEE transactions on Parallel and Distributed systems, vol. 4, no. 2,
pp. 175–187, 1993.
[62] G. C. Sih and E. A. Lee, “Declustering: A new multiprocessor scheduling technique,” IEEE Transactions
on Parallel and Distributed Systems, vol. 4, no. 6, pp. 625–637, 1993.
[63] J. L. Pino, T. M. Parks, and E. A. Lee, “Automatic code generation for heterogeneous multiprocessors,”
in Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing,
vol. 2, pp. II–445, IEEE, 1994.
71
[64] J. Falk, J. Keinert, C. Haubelt, J. Teich, and S. S. Bhattacharyya, “A generalized static data flow clustering algorithm for mpsoc scheduling of multimedia applications,” in Proceedings of the 8th ACM
international conference on Embedded software, pp. 189–198, 2008.
[65] A. Brauckmann, A. Goens, S. Ertel, and J. Castrillon, “Compiler-based graph representations for deep
learning models of code,” in Proceedings of the 29th International Conference on Compiler Construction,
pp. 201–211, 2020.
[66] A. Weichslgartner, D. Gangadharan, S. Wildermann, M. Glaß, and J. Teich, “Daarm: Design-time
application analysis and run-time mapping for predictable execution in many-core systems,” in Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis,
pp. 1–10, 2014.
[67] J. Spieck, S. Wildermann, and J. Teich, “A learning-based methodology for scenario-aware mapping
of soft real-time applications onto heterogeneous mpsocs,” ACM Transactions on Design Automation
of Electronic Systems, vol. 28, no. 1, pp. 1–40, 2022.
[68] A. T. Nguyen, T. D. Nguyen, H. D. Phan, and T. N. Nguyen, “A deep neural network language model
with contexts for source code,” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 323–334, IEEE, 2018.
[69] E. Parisi, F. Barchi, A. Bartolini, and A. Acquaviva, “Making the most of scarce input data in deep
learning-based source code classification for heterogeneous device mapping,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 6, pp. 1636–1648, 2021.
[70] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity
of graph structured objects,” in International Conference on Machine Learning, pp. 3835–3845, PMLR,
2019.
[71] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural networks over tree structures for
programming language processing,” in Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 30, 2016.
[72] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv
preprint arXiv:1609.02907, 2016.
[73] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,”
arXiv preprint arXiv:1710.10903, 2017.
[74] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv
preprint arXiv:1511.05493, 2015.
[75] Y. Xiao, S. Nazarian, and P. Bogdan, “Plasticity-on-chip design: Exploiting self-similarity for data
communications,” IEEE Transactions on Computers, vol. 70, no. 6, pp. 950–962, 2021.
[76] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A
review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
[77] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
[78] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416,
2007.
[79] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention models in graphs: A survey,” ACM
Transactions on Knowledge Discovery from Data (TKDD), vol. 13, no. 6, pp. 1–25, 2019.
72
Abstract (if available)
Abstract
The recent technological advances have significantly contributed to a rapid increase in algorithmic complexity of various applications, from digital signal processing to autonomous aerial, ground and underwater systems. In order to control and manage this increased algorithmic complexity, heterogeneous computing systems require intelligent, flexible and highly efficient programming strategies to provide high performance while minimizing energy costs. However, the current monolithic programming models and task mapping to compute engines do not fully exploit the recent architectural innovations and can exacerbate the load imbalance and communication inefficiencies.
In order to fully utilize the capabilities of hardware platforms, the compilation of parallel programs requires expert heuristics to decide how many threads to spawn and how to schedule them onto heterogeneous computing systems. Due to workload imbalance, synchronization overhead, and resource sharing contention, the overall performance may lead to sub-optimal executions. Therefore, it is crucial for programmers to decide which code segments run on a specific processor (e.g., CPU or GPU).
In this dissertation, we develop a novel programming model for heterogeneous computing platforms. Specifically, we first collect the representative dynamic trace generated from executing a program. This trace contains a sequence of low level virtual machine (LLVM) intermediate representation (IR) instructions to be executed. Then, for each instruction, we check if data, control, and memory dependencies exist and insert a directed edge to construct the graph. By developing this framework, we are able to partition the graph into clusters which will be later mapped to heterogeneous platforms or processing-in-memory (PIM). Experimental results demonstrate the system performance improvement over some state-of-the-art techniques in the field.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Fast and label-efficient graph representation learning
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Human motion data analysis and compression using graph based techniques
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Dispersed computing in dynamic environments
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Detection and decoding of cognitive states from neural activity to enable a performance-improving brain-computer interface
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Theoretical foundations for modeling, analysis and optimization of cyber-physical-human systems
PDF
Efficient graph learning: theory and performance evaluation
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Memristive device and architecture for analog computing with high precision and programmability
Asset Metadata
Creator
Xiao, Yao
(author)
Core Title
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-08
Publication Date
08/31/2024
Defense Date
08/30/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph neural networks,graph representation of program,heterogeneous computing
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bogdan, Paul (
committee chair
), Nazarian, Shahin (
committee member
)
Creator Email
shawnphilip0512@gmail.com,xiaoyao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399A0AT
Unique identifier
UC11399A0AT
Identifier
etd-XiaoYao-13462.pdf (filename)
Legacy Identifier
etd-XiaoYao-13462
Document Type
Dissertation
Format
theses (aat)
Rights
Xiao, Yao
Internet Media Type
application/pdf
Type
texts
Source
20240831-usctheses-batch-1205
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
graph neural networks
graph representation of program
heterogeneous computing