Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enhancing collaboration on the edge: communication, scheduling and learning
(USC Thesis Other)
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENHANCING COLLABORATION ON THE EDGE: COMMUNICATION, SCHEDULING
AND LEARNING
by
Diyi Hu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2023
Copyright 2023 Diyi Hu
Dedication
To my husband and parents.
ii
Acknowledgments
I would like to express my genuine gratitude to my PhD advisor, Prof. Bhaskar Krishnamachari,
for his guidance and support throughout these years. His passion and vision for research, open-
mindedness, and altruism have been profoundly impacting both my research and life. I feel very
fortunate to work with Prof. Krishnamachari, my labmates including Quynh, Pranav, Xiangchen
and Pradipta, and postdoc Gowri. I also thank my other labmates Kwame, Jason, Martin, Elizabeth,
Lillian, Mehrdad, Arvin, Sampad, Sulyab, Tamoghna, Jared, Narjes and Yousef. The experience
in the group will continue to inspire and encourage me in the future.
I would like to thank my defense committees Prof. Feifei Qian and Prof. Aiichiro Nakano for
their valuable feedbacks to my thesis and defense. I would like to extend the gratitude to my other
Qualifying exam committees Prof. Cauligi Raghavendra and Prof. Jyotirmoy Deshmukh.
I also thank DARPA for the research funding and the EE and CS Department for the financial
support. I have also received warm help from the USC staffs such as Diane and Andy, and I want
to thank them too.
I would like to thank my mentors and colleagues during my internship at Amazon AWS, espe-
cially Chris, Lingjie and Eddie. Their expertise and generous help has given me a great first
impression of the industry.
I would like to thank my family. I thank my husband for his affection, empathy and support
throughout the years. We met each other during the PhD orientation at USC and shared all the
happiness and sorrows in this journey. I was fortunate enough to know such an intelligent, humble,
iii
sincere and lovely person. His company and love were the light in my most difficult times, turning
the bitterness into sweet memories.
I thank my parents for their unconditional love, support and encouragement. I appreciate their
understandings and respect so that I could pursue my dreams without hesitation. I thank my uncle
for his guidance and encouragement in the most important stages of my life. His passion in and
dedication to research in physics has been inspiring me. I thank my aunt and grandma for their
care and love. My aunt’s gentleness, amiability and a good balance of life and career have taught
me to enjoy life. My grandma has always made me feel safe and loved. I would also like to thank
my grandpa (a professor in physics) in heaven. His love and recognition to me as well as his
enthusiasm in physics are still vivid.
I would also like to thank my friends for their company. I enjoyed the food, hikes and laughters
with them. Finally, I need to say “thank you” to my cat, who is so peaceful, gentle and considerate.
He never moews at night to interrupt my sleep. He quietly lies on my desk when I am typing this
line of words.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Edge Computing and Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 ML Applications on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Challenges and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.1 Streaming CNN Inference with Communication Compression . . . . . . . 5
1.5.2 A General Framework for MARL with Communication . . . . . . . . . . . 7
1.5.3 Wireless Communication-Enhanced Value Decomposition . . . . . . . . . 9
Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Compact CNN Models on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Wireless Communication Environment . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Collaborative Inference and CNN Model Partition . . . . . . . . . . . . . . . . . . 19
3.2 CNN Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Dynamic Schedule of Compressed Models . . . . . . . . . . . . . . . . . . . . . . 20
3.4 MARL with Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Value Decomposition and QMIX . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 GNN in MARL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 4 Fast and Accurate Streaming CNN Inference via Communication Compression
on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Optimization Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.3 Categorization of CNN Models and Edge Devices . . . . . . . . . . . . . . 28
4.2 Optimized Pipeline Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
4.2.1 Load-Balance of the Computation Stages . . . . . . . . . . . . . . . . . . 30
4.2.2 Inter-device Communication Bottleneck . . . . . . . . . . . . . . . . . . . 31
4.2.3 Load-Reduction for the Communication Stages . . . . . . . . . . . . . . . 32
4.2.4 Adaptive Communication Compression . . . . . . . . . . . . . . . . . . . 37
4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Inference Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Expected Effective Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.3 Performance Analysis for Adaptive Compression . . . . . . . . . . . . . . 42
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.2 Evaluation on Compressors . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.3 Evaluation on End-to-End Performance . . . . . . . . . . . . . . . . . . . 47
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 5 Learning Practical Communication Strategies in Cooperative Multi-Agent Rein-
forcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 When: MDP Re-Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 What: Observation Augmentation . . . . . . . . . . . . . . . . . . . . . . 53
5.1.3 How: Message Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.4 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Predator Prey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.3 Lumberjacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 6 Wireless Communication-Enhanced Value Decomposition for Multi-Agent Rein-
forcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Modeling: Augmented MDP on Aligned Environments . . . . . . . . . . . . . . . 71
6.2 Decentralized Agent Execution: Information Extraction under Transmission Stochas-
ticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Observation Fuser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.2 Message Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.3 Memory aidedQ Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Centralized Value Decomposition: Communication-Enhanced Mixer . . . . . . . . 77
6.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3.2 Mathematical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5.2 Performance Comparison with State-of-the-Art . . . . . . . . . . . . . . . 87
6.5.3 Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 7 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2.1 Collaborative LLM inference on the edge . . . . . . . . . . . . . . . . . . 96
7.2.2 Deployment ofLAUREL on UA Vs for resource coverage problem . . . . . . 97
7.2.3 Communication-Enhanced Value Decomposition in Heterogeneous Systems 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
vii
List of Tables
4.1 Notations related to CNN inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Architecture design of the compressorA
. . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Performance analysis: expected throughput and expected effective accuracy for origi-
nal MobileNetv2, with static compression ratio 2, and with adaptive compression ratio. 44
4.4 MobileNet-v2: Top-1 accuracy and computation overhead under various compression
ratio
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 VGG16: Top-1 accuracy and computation overhead under various compression ratio
46
5.1 Notations related to message encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Benefits of proposed message encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1 Reward gain from positive listening . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
viii
List of Figures
2.1 Inverted Residual Block proposed by MobileNet to reduce the number of operations in
regular convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Overview of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Categorization of CNNs and edge devices . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Computation and communication time per stage . . . . . . . . . . . . . . . . . . . . . 32
4.4 Using various compressors based on threshold . . . . . . . . . . . . . . . . . . . . . 38
4.5 Expected effective accuracy under various value, whenB
i
=N (; 2) . . . . . . . . . 47
4.6 Expected effective accuracy under various value, whenB
i
=N (5;) . . . . . . . . . 48
5.1 Two ways of agent-environment interaction: (a) The original way (non-differentiable
training). (b) The new way (after lagging and alignment). . . . . . . . . . . . . . . . . 51
5.2 Example architecture of on-policyLAUREL . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Example architecture of off-policyLAUREL . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Comparison with state-of-the-art methods . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Average communication action in one trajectory . . . . . . . . . . . . . . . . . . . . . 65
5.6 Communication adapted to limited bandwidth . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Comparison with state-of-the-art methods . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1 The architecture ofN agents: AssumeN = 3. (a) An decentralized Agent Network
(b) The overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 The architecture of communication-enhanced Mixing Network: AssumeN = 3. (a)
The overall architecture as anL layer GNN model. (b) The weights of GNN in each
layer generated by PEHypernets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Comparison with state-of-the-art: (a) PP with obstacles in 7 7 grid, 3 agents. (b) PP
with obstacles in 10 10 grid, 4 agents. (c) LJ with tree shadowing in 7 7 grid, 4
agents, 3 trees. (d) LJ with tree shadowing in 10 10 grid, 5 agents, 3 trees. . . . . . . 88
6.4 Sample trajectory showing positive listening: (a) All agents’ positions when Agent 1
first finds prey at step 9. (b) All agents’ positions at step 10. (c) Step 1 to 9. (d) Step
10 to 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5 Cosine similarity of messages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6 Ablation study of the communication-enhanced value decomposition: (a) PP (b) LJ . . 92
ix
Abstract
With increasing computation power and memory storage capacity of edge devices, application data
generated at the edge can be processed locally, without relying on the remote cloud. This compu-
tation paradigm is called edge computing. Wireless communication enables collaboration among
nodes in the distributed edge computing system. By transmitting the intermediate computation
results, edge devices can serve as two different roles: (1) computation resources, and (2) intel-
ligent agents. As computation resources, the devices can jointly execute a job decomposable as
data-dependent tasks. For example, in distributed CNN inference, we schedule stacks of layers to
optimize throughput of batch inference. As intelligent agents, the devices can cooperate towards a
common goal by learning application specific actions and communication strategies. For example,
in “Search-And-Rescue”, a classic multi-agent reinforcement learning application, multiple robots
jointly rescue victims in a complicated terrain. Communication helps agents explore the shared
environment and establish collaboration. For both roles, however, efficient collaboration is hard
to achieve because: (a) heterogeneity in wireless links, edge devices’ resources and decomposed
tasks’ workload make it hard to improve parallelism among devices; (b) complexity in the environ-
ment due to network condition stochasticity, agents’ mobility and resource scarcity significantly
constrain the feasible communication strategy. (c) dynamicity of the communication pattern in the
learning based approaches complicates the credit assignment during training, making it unclear
how to assess the dynamic communication’s impact on the coordinated goal.
This thesis advances the collaboration in distributed edge systems from both perspectives. First,
for edge devices as computation resources, we study CNN inference as a representative workload.
x
Under resource and network heterogeneity, we design an optimal scheduler to split the CNN model
for balancing each device’s computation loads. To address the inter-device communication bottle-
neck, we propose an Auto-Encoder architecture (variational) for compressing the intermediate
layer outputs. Further, to address the variation of network bandwidth, we propose an adaptive
data compression scheme based on quantitative tradeoff analysis among data compression ratio,
throughput and model accuracy.
For edge devices as intelligent agents, we focus on applying Multi-Agent Reinforcement Learn-
ing (MARL) algorithms under realistic, complicated wireless environments. We design a general
MARL framework, LAUREL, that improves the communication strategy from three fundamental
aspects: (1) When: Agents learn the timing of communication based on not only message impor-
tance but also wireless channel conditions. (2) What: Agents augment message contents with
wireless network measurements to better act in the environment. (3) How: Agents use a novel neu-
ral message encoder to preserve all information from received messages, regardless of the number
and order of messages. We empirically demonstrate the effectiveness of the proposed framework
and its individual components, and validate its generality on different algorithms. We further ana-
lyze the math properties to losslessly encode the messages in the stochastic wireless channels.
We prove the proposed encoder satisfies such math properties. Finally, we investigate the partic-
ular case of enhancing the value decomposition of multi-agent reinforcement learning under the
challenging setting of wireless channels. We capture the dynamic communication with a graph
and design a Graph Neural Network (GNN) based value decomposition architecture. Our design
significantly improves the value decomposition and credit assignment by effectively evaluating
the impact of communication on the collective success. We prove the mathematical properties of
the proposed decomposition network. Experiments show more than 20% improvement in terms
of converged number of steps or return in complex scenarios. Overall, our works significantly
advance the collaboration quality of edge devices.
xi
Chapter 1
Introduction
1.1 Edge Computing and Collaboration
Recent years have witnessed the rapid growth of the number of edge devices as well as the variety
of edge applications. By May 2022, there had been 14.4 billion actively connected IoT devices.
Nowadays, edge devices have been pervasively used in security systems (e.g., surveillance cam-
eras, drones), smart home systems (e.g., cameras, PCs, Apple Watch, etc.), and robot systems (e.g.,
robots integrated with radar, LiDAR, cameras, etc.). We focus on applications running on the edge
devices[1, 2, 3] as opposed to running on the cloud servers. The advantages of executing appli-
cations directly on the edge are two folded. From the performance perspective, computing on the
edge can lead to low latency and low cost since the data are processed near the data source without
transmission to the remote cloud servers. From the security and privacy perspective, computing on
the edge means all sensitive information is only accessed by trusted local nodes.
On the other hand, challenges arise due to the limited computation and memory resources on
each device. A natural way to address such challenges is to leverage the aggregated computation
power from a cluster of interconnected devices – many realistic applications are executed on the
distributed edge. However, due to the heterogeneity of the edge devices and the variety of tasks, it is
non-trivial to fully exploit the available resources of all devices. This leads to the research problem
of collaboration, i.e., how to make the cluster of devices effectively work towards a common goal.
In this thesis, we develop various methods to improve the collaboration on the distributed edge.
Specifically, since communication is critical to establish collaboration, we consider edge devices
communicating with each other via wireless networks.
1
There are various motivations to promote collaboration among a cluster of edge devices:
(a) Some tasks can only be completed by joint efforts of multiple participants (e.g.,
Search-And-Rescue (SAR) robots deployed in hostile environments). (b) A task can be completed
faster or more efficiently by exploiting model parallelism or data parallelism within the edge clus-
ter. (c) Sharing knowledge with each other can guide the edge devices (e.g., robots, drones) to
better understand and explore a complicated environment.
1.2 Categorization
To study the various kinds of collaboration strategies, we first classify distributed edge cluster into
two broad categories according to the role of edge devices: (a) Edge devices serve as pure com-
putation resources executing pre-programmed jobs in parallel. In this category, the intermediate
computation results are transmitted for subsequent execution of downstream tasks. Thus, the com-
munication is mandatory and pre-programmed. (b) Edge devices serve as intelligent agents that
interact with each other in a shared environment. In this category, each agent learns from the feed-
back and updates its decision policy accordingly. Therefore, the communication strategy gradually
evolves. Here the communication behavior determined by intelligent decisions depending on the
current state of the environment perceived by the agent.
In the first category, edge devices are computation resources executing a pre-programmed job.
The job can be either completely offloaded to the edge/cloud server, or executed locally on the
edge device. Early works aim to make a binary decision on offloading or not, based on the cur-
rent upload bandwidth and available computation resources. Instead of offloading the whole job,
recent works [4] partition the job (e.g., inference of Convolutional Neural Networks (CNNs)), at
a “natural split point”. Subsequently, the initial part of the job, which is lightweight, is executed
on the edge device, while the remaining part, which is more computation intensive, is executed on
the edge/cloud server. In this case, the communicated data is the intermediate results and schedul-
ing is hierarchical (from edge to server). However, such “natural split points” may not exist in
2
many cases. Unlike these works adopting an “edge-cloud” computation paradigm, another line
of research [5, 6, 7] adopts the fully decentralized computation paradigm on the edge cluster and
perform model partition at multiple points.
In the second category, the edge devices have learning abilities so that they serve as intelligent
agents exploring (e.g., by navigating) and evolving in a shared environment. This thesis considers
the cases where such edge devices collaborate towards a common mission goal. In other words,
we focus on collaborative Multi-Agent Reinforcement Learning (MARL) with communication.
While there have been many works in the literature to motivate collaboration in MARL, most of
them are built upon over-simplified models for the wireless network: (a) Ideal Wireless Networks:
In this simple wireless model, whatever messages sent out will be successfully received by all
agents. (b) Resource constrained wireless network: This wireless model improves the ideal one by
considering limited bandwidth resources. When multiple agents simultaneously sendk
0
messages,
the environment guarantees minfk;k
0
g messages to be successfully received by all agents (wherek
is a manually configured constant characterizing the bandwidth limit). Note that these works still
assume perfect transmission (i.e., transmission by selected agents always succeeds), and ignore
many critical factors in a realistic wireless environment. Therefore, there is still a significant gap
between state-of-the-art MARL frameworks and practical deployments.
Notably, the categorization based on roles of edge devices also divides the corresponding com-
munication into two categories: the former is more elementary. It resolves the data dependency
and is pre-scheduled and mandatory. The latter is more advanced. Interestingly, it is more like
human-level communication that intelligent agents learn to establish a meaningful common lan-
guage. Such a “language” can be understood in the group, and then guides the behaviors.
1.3 ML Applications on the Edge
Machine Learning (ML) has revolutionized most fields of science and technologies, and remark-
ably changed our daily life. It has demonstrated unprecedented performance in processing and
3
understanding data, and making sequential decisions. For example, it has beaten the best human
players in Atari Games and Go. State-of-the-art image classification models can achieve 90% accu-
racy on the ImageNet dataset while human experts only 5%. Nowadays, the Large Language Mod-
els (LLMs) based on Transformers has extraordinary performance on tasks originally designed
for humans, e.g., SAT, bar exams, code generation, and question answering. Concurrently, the
development of embedded Systems-on-a-Chip (SoCs) opens up the opportunity for a greater num-
ber of edge devices to run machine learning applications. With the pervasive use of edge devices
and the prospect of Artificial General Intelligence (AGI), there is a burgeoning trend to bring ML
applications on the edge.
The major branches of ML applications include Computer Vision (CV), Natural Language
Processing (NLP), Reinforcement Learning (RL) and AI for science (e.g., bioinformatics). The
emerging LLMs (e.g., GPT4[8], LlaMA[9]) bear billions of parameters and are incompatible with
the edge devices. Yet, recent light-weight CNN models make CV applications prevalently deployed
on the edge. With cameras capturing images from different geographical locations, vehicular data
analysis, video stream analytics and even virtual reality can be performed at the edge. Therefore,
in Chapter 4, we choose CNN inference as a typical usecase for the edge devices serving as pure
computation resources. In Chapter 5 and 6, we focus on the MARL applications with wireless
channels. Robots or drones make sequential decisions based on local data collected by the inte-
grated sensors, as well as received messages from the peers to expedite collaboration.
1.4 Challenges and Motivation
When leveraging edge devices as pure computation resources, most existing works focus on min-
imizing latency of a single instance of job (e.g., CNN inference time of one input image frame).
However, in many realistic applications, the input frames continuously flow into the edge devices
in a streaming fashion. For instance, a sensor camera takes input images generated with a steady
rate. The optimization goal in such a case becomes inference throughput instead of single frame
4
latency. How to optimize throughput in the edge cluster under limited wireless bandwidth is under-
explored. When deploying edge devices as intelligent agents, existing works have over-simplified
assumptions on the wireless networks. In reality, a transmitted message generally is not ensured
to be received successfully. We summarize the challenges of multi-device collaboration under
realistic wireless communication: (a) heterogeneity in wireless links, edge devices’ resources and
decomposed tasks’ workload make it hard to improve parallelism among devices; (b) complexity
in the environment due to network condition stochasticity, agents’ mobility and resource scarcity
significantly constrain the feasible communication strategy. (c) dynamicity of the communication
pattern in the learning based approaches complicates the credit assignment during training. It is
unclear how to assess the dynamic communication’s impact on the coordinated goal.
1.5 Contributions
In this dissertation, we leverage communication to enhance the collaboration among edge devices
executing ML applications in a decentralized way. In the CNN inference, we employ model par-
allelism with load balance. We design a communication schedule with data compression that is
adaptive to change of channel resources. In the MARL application, the proposed framework is
concerned with the decentralized execution with partial observation. We resolve the unique chal-
lenges in the settings of realistic wireless channels and further capture the dynamic communication
patterns in the value decomposition process.
1.5.1 Streaming CNN Inference with Communication Compression
Convolutional Neural Networks (CNNs) are fundamental models for Computer Vision. Conven-
tionally, CNN inference is performed on the cloud, while input data is collected on the edge.
Unfortunately, such cloud-centric paradigm requires long distance data transmission, resulting in
substantial upload bandwidth consumption, high latency and privacy concerns [10]. Thus, a recent
trend is inference on the edge.
5
To close the natural gap between complex CNN models and resource-constrained edge devices,
researchers have designed compact CNNs [11, 12, 13, 14, 15]. Without degrading the accuracy,
these models relieve the memory storage pressure of edge devices with their small model sizes,
and alleviate the burden on the edge processors with their light-weight layer operations.
In applications such as vehicle detection and video analytics, streaming input data are collected
by IoT sensors and continuously generated at high rate. However, it is non-trivial to optimize
inference throughput for streaming data, due to:
Bandwidth scarcity To improve throughput, a widely used approach, model partitioning, splits
the CNN into multiple groups of layers. We then deploy each group to an edge device or edge
server. However, limited network bandwidth hinders performance of such deployment. The
works of [16][17] replace the original CNN with a smaller one using early-exit or distillation[18].
However, these techniques are not ideal for 2 reasons. First, emerging CNN models such as
MobileNetV2 are already very compact and hard to compress. Second, both techniques aim at
shrinking the CNN model. The reduction to the hidden layer output to be communicated is lim-
ited.
Bandwidth variation Due to interference and varying signal strength, the network channel band-
width is varying over time. Works in [16] [19] show that CNN inference performance is highly
sensitive to the change of bandwidth. It remains a question how to dynamically adjust the infer-
ence framework according to the bandwidth. For example, when bandwidth is low, it is desirable
to have less intermediate data transmission to avoid throughput degradation. On the other hand,
such restriction on data transmission may lower inference accuracy.
In chapter 4, We design a CNN inference framework on the local edge device cluster. Follow-
ing model partition, we assign grouped layers of CNN to edge devices that form a pipeline. We
address the above two challenges by compressing intermediate activations that are communicated
between edge devices. We use AutoEncoder (AE) and Variational AutoEncoder (V AE) to achieve
6
high compression ratio. End-to-end training is performed on the compressed CNN for accuracy
recovery. Our results include:
• We propose a composite metric effective accuracy, that jointly evaluates throughput and
accuracy quantitatively.
• We propose a fast CNN partitioning algorithm to achieve optimal computation load-balance
across edge devices.
• We propose data compressors based on the (Variational) Auto-Encoder architecture to
address the communication bottleneck. The compressor is flexible in terms of compression
rates, and preserves accuracy with negligible overhead.
• We propose a runtime scheduler that dynamically selects pre-trained compressors per avail-
able network bandwidth to optimize inference throughput and effective accuracy.
• We evaluate MobileNet-v2 and on processing pipelines consisting of Raspberry Pi 3B+. Our
framework consistently achieves significant accuracy improvement under a wide range of
Wi-Fi network bandwidth.
1.5.2 A General Framework for MARL with Communication
In Multi-Agent Reinforcement Learning (MARL), communication plays a key role in facilitating
knowledge sharing and collaboration [20, 21]. The information in agents’ messages alleviates the
limitation of partial observation. Communication leads to better exploration on the huge state and
action space, and thus improves the training quality and convergence speed.
While there have been numerous works in the literature to improve communication in MARL,
most are based on unrealistic modeling of the wireless network environment (e.g., assuming perfect
transmission). In practical applications, however, transmission can fail due to many factors such
as limited bandwidth, signal path loss and fading, medium contention, interference, etc. Moreover,
7
in applications involving navigation (e.g., Search-And-Rescue (SAR) [22], fire fighting [23], bat-
tlefield defense [24]), link conditions are dynamic as they vary with agents’ mobility and relative
positions. It remains an open problem to learn a practical policy by addressing such complexity in
realistic wireless communications.
The following challenges exist to train communicating agents in a realistic wireless environ-
ment. The first challenge, environment coupling, is due to the mutual influence between the game
environment and the wireless environment. On the one hand, agents’ mobility in the game environ-
ment leads to dynamic network connectivity and agents’ communication actions have significant
impact on medium contention and signal interference. On the other hand, changes in the wireless
environment can affect agents’ actions in the game environment. For example, when an agent gen-
erating an important message is far away from other agents, it may intentionally approach others
to increase the receivers’ signal strength. The second challenge, channel stochasticity, means it
is a non-deterministic process whether an agent can successfully receive the message from others.
As described in the previous paragraph, many practical factors can result in failed transmission.
The stochasticity not only increases the complexity of the environment, but also makes it more
challenging for agents to extract useful information.
Our results include: We propose a framework, LAUREL, to Learn prActical commUnication
strategies in cooperative Multi-Agent REinforcement Learning. Our framework optimizes three
fundamental aspects of communication (i.e., when, what and how) without unrealistic simplifica-
tions on the wireless network. To address the environment coupling challenge, we propose to delay
the message sending time by one step so as to align agents’ interactions in the game and wireless
environments. The alignment solves the issue of training non-differentiability in [25]. More impor-
tantly, it leads to a “meta-environment” abstraction, where the Markov Decision Process (MDP)
controlling the agents’ behaviors can be reformulated, and both the action and observation spaces
can be expanded. Specifically, we augment the agents’ observation by important network measure-
ments, so that they can understand and predict the dynamics due to coupling. Then we end-to-end
train a practical communication strategy with neither pre-defined schedule nor prior knowledge on
8
the wireless condition (unlike [25, 26]). As a result, agents learn when to send messages based
on both the content relevance and the wireless channel conditions. To further improve LAUREL
under high channel stochasticity, we propose a novel neural architecture to encode the received
messages. The encoder takes as input any number of messages and only requires one round of
communication. Finally, we generalize LAUREL with popular backbone MARL algorithms, and
build both the on- and off-policy variants in a plug-and-play fashion. On standard benchmarks, we
achieve significantly better performance with respect to game performance, convergence rate and
communication efficiency, compared with state-of-the-art methods.
1.5.3 Wireless Communication-Enhanced Value Decomposition
Value decomposition based algorithms [27, 28] have demonstrated relatively high sample effi-
ciency and training quality, making them desirable for complicated environment. These algo-
rithms adopt Centralized Training with Decentralized Execution (CTDE) paradigm. Their cen-
tralized training process is essentially Q-learning which optimizes globalQ value for all agents’
joint actions. And the globalQ value is decomposed into individual utility values that are used by
the agents to develop decentralized policies. VDN [27] designs the decomposition function as a
sum function, QMIX [28] generalizes it to a monotonic function and achieves better performance.
However, these algorithms do not consider agents that can communicate with each other.
In Chapter 6, we extend the value decomposition based algorithm to enable agents to commu-
nicate in realistic wireless channels. There are several challenges in such a scenario.
One significant challenge is how to assess communication actions’ influence in the system,
especially considering the dynamic communication decisions (across steps) which is aggravated
by aforementioned dynamic channel conditions. We propose to capture the communication struc-
ture in the value decomposition process. The reason is that communication plays an important role
to the group success, then knowing the communication structure is necessary to decompose value
9
accurately; meanwhile, value decomposition can be viewed as a centralized implicit credit assign-
ment process [29], in which we can feasibly introduce an inductive bias to assess communication
actions’ influence. Concretely, we capture the dynamic communication structure as a graph, on
which we further build a Graph Neural Network (GNN) based model. The GNN layers propa-
gate individualQ values along the graph edges so that different communication strategies lead to
different globalQ values.
Another challenge is how to encode the messages in a provable lossless way under the wireless
stochasticity. Value decomposition is the rule of thumb algorithm to be scaled to a large number
of agents. Considering the potential large population of agents, it is important to mathematically
guarantee the encoding quality, besides showing good empirical results. Other challenges include
non-differentiability and joint influence, which has been addressed by our techniques of “commu-
nication lagging” and MDP reformulation. We summarize our results as follows:
• With respect to decentralized execution per agent, we design the agent neural network con-
sisting of observation fuser, message encoder and action selector. The observation fuser
embeds and integrates the wireless observation with the game observation. We prove that
the message encoder embeds received messages in an asymptotically lossless way.
• With respect to centralized training via value decomposition, we propose to capture the
dynamic communication structure during the value factorization process. Specifically, we
construct a graph with communication interactions as edges. Then we design a GNN based
decomposition neural network which introduces inductive bias into the model and leads to
faster convergence and better performance.
• We theoretically prove the proposed decomposition network is permutation invariant, mono-
tonic and with high expressive power. Permutation invariance ensures the global Q value
is invariant with shuffled agent ordering. Monotonicity ensures the learned decentralized
policies is consistent with the centralized one [28].
10
• Empirical results on standard benchmarks show our design consistently outperforms state-
of-the-art baselines in terms of convergence speed and quality. We conduct behavioral stud-
ies to investigate the learned policy as well as ablation studies to verify the effectiveness of
the proposed modules.
11
Chapter 2
Background
2.1 Convolutional Neural Networks
Recent years have witnessed the rapid development of Deep Learning (DL), which is a revolu-
tionary technique of Artificial Intelligence. Convolutional Neural Networks (CNN) is a class of
DL networks commonly used in the domain of computer vision (CV). The architectureA of a
CNN consists of a stack of distinct layers that transform the input activation into output activation
through a differentiable function. Most layers in CNN are convolutional layers with filters that
extend through the full depth of the input activation. The filters are convolved across the width and
height of the input activation, computing the dot product between the entries of the filter and the
input and producing a 2-dimensional activation map.
2.2 Compact CNN Models on the Edge
Input data of CV applications are usually collected in the edge devices such as IoT devices with
cameras. Many scenarios such as smart city, smart grid and Virtual Reality (VR) expect under-
standing data on the resource-constrained edge devices. Therefore, compact CNNs have been
designed for inference on the edge. SqueezeNet[11] downsamples the data using special 1 1
convolution filters. It leads to the same accuracy as AlexNet [30] with 50 fewer parameters
and less than 0.5 MB model size. YOLO[12] uses a custom network based on the GoogleNet
architecture[31] to perform real-time object detection. YOLO only takes around
1
4
as many opera-
tions as VGG16 [32].
12
(a) Regular conv2d (b) inverted residual block
Figure 2.1: Inverted Residual Block proposed by MobileNet to reduce the number of operations in
regular convolution
MobileNet[13, 14, 15] is a state-of-the-art CNN model specifically tailored for edge devices.
The core building block of MobileNet is “inverted residual block", which factorize a regular con-
volution layer into 3 layers: 1 1 pointwise 2D convolution, 3 3 depthwise separable 2D convo-
lution, and 1 1 pointwise 2D convolution. The first layer expands the input channel with ratiot,
the depthwise convolution layer applies a single convolutional filter per input channel to filter fea-
tures as a source of non-linearity, and the last layer project back the activation to a low-dimensional
representation. Consider a conventionalkk conv2d layer withc
in
input channels andc
out
output
channels. Then it needsc
out
filters of size 33c
in
, while depthwise separable conv2d only needs
c
in
t kernels of size 3 3 1 . The operation is reduced by a factor of 1=k
2
. Figure 2.1 illustrates
this structure withk = 3.
The performance of MobileNet is impressive. For example, for an ImageNet image of size
224 224 3, MobileNetV2 can achieve 71:8% classification accuracy in 3.47M model size with
only 600M FLOPs. And the inference time on a Google Pixel 1 smartphone is only 73.8ms. We
will utilize the “inverted residual block" in our compressors in Chapter 4 due to its light weight
and decent performance.
13
2.3 Reinforcement Learning
Reinforcement learning (RL) is a machine learning technique that an agent solves a sequential
decision problem by “trial and error”. In an RL problem, there are four major events: making an
observation, taking an action, transitioning to a new state and receiving a reward. Denoteo as the
observation, andO as the observation space consisting of all possible o. Similarly, denote a as
the action andA as the action space; s as the state andS as the state space. Denote the scalarr
as the reward value. We further define policy as a strategy to select an action from a state. i.e.,
:S!A. The policy can be either deterministic or stochastic. For stochastic, the mapping
toA follows a probability distribution. We can thus more formally describe the RL problem: at
time stept, the agent is in states
t
2S. It observeso
t
2O from the environment, takes an action
a
t
2A according to policy (js
t
), receives rewardr
t
from the environment and transits to state
s
t+1
. The goal is to find an optimal policy
that maximizes the expected cumulative reward (i.e.,
return)J =E
[
P
1
t=1
t
r
t
], where
2 [0; 1] is called the discount factor. Note that in the standard
fully observable RL,o
t
=s
t
.
In practice, the state space and action space are huge, and thus it may take the agent enormous
exploration in the environment to identify a good policy. Unfortunately the interaction with the
environment can often be very expensive and time-consuming. So various techniques are necessary
to help speedup the convergence to a good. The first way is to combine deep learning techniques
with RL, by taking advantage of DNN’s ability as universal function approximators [33]. Going
back to the calculation of the returnJ, we note thatJ involvesr
t
0 of future steps (i.e.,t
0
>t). So
to estimateJ, a common way is to treatJ as a functionR (s
t
;a
t
;s
t+1
). Then we can train a DNN
to approximateR where only data of the current transition are required. Due to the generalization
capability of DNNs, such a design can efficiently learn from limited amount of training data. The
second way is by carefully defining (or, re-defining) the observation spaceO so that the agent can
have better or richer information about the dynamics of the environment. An example is training
agents to play the StarCraft game. In [34], if we allow the agent to observe its allies past few
14
actions, the training will learn a much better policy. In our work, we augmentO by wireless
measurements to improve agents’ policy (see Chapter 5 and 6). The third way is to modify the
training algorithm to allow reuse of data from previous environment interactions, which we discuss
next.
RL algorithms can be classified as on-policy and off-policy algorithms. Denote as the model
parameters for the policy, and thus we denote policy as
. In the training process, keeps getting
updated, and thus the agent’s interactions with the environment are corrected under constantly
evolving policies
. For on-policy algorithms, every policy update is only based on data collected
from the current (i.e., most recent)
. So when is updated to
0
, the data collected from
are discarded. For off-policy algorithms, the data collected from previous policies can be reused to
updated the correct policy. Such historical data may be stored in an experience replay buffer [35] to
facilitate reuse. Therefore, there are trade-offs between the two types. On-policy algorithms may
be more stable and easier to tune, while off-policy ones are more sample efficient [36]. In general,
both on-policy and off-policy algorithms can be useful, depending on the application. Among
the off-policy algorithms, the most renowned is Deep Q-Learning (DQN), which uses a DNN to
approximate the optimal action-valueQ
(s
t
;a
t
) = max
Q
(s
t
;a
t
). The action-valueQ
(s
t
;a
t
)
is defined as the expected return starting from states
t
, taking actiona
t
, and then following policy
, i.e.,
Q
(s
t
;a
t
) =r(s
t
;a
t
) +E
s
t
0;a
t
0
"
1
X
t
0
=t+1
t
0
t
r
t
0
#
: (2.1)
The parameters of the DNN approximating Q
is updated by minimizing the TD-error on the
mini-batch of (s
t
;a
t
;r
t
;s
t+1
) pairs uniformly randomly sampled from a replay bufferD.
L() =E
(st;at;rt;s
t+1
)2U(D)
h
r
t
+
max
a
Q
s
t+1
;a;
^
Q (s
t
;a
t
;)
i
2
; (2.2)
15
where the target value is computed using a target network with parameters
^
.
^
is a delayed version
of and is updated less frequently. DNN as action-value approximator reduces the variance since it
fits more data. The use of target network stabilizes the training process. The data sampling process
mitigates the correlation between the sequential data. At stept, the agent either greedily select the
action with largestQ
, or randomly act with probability (i.e.,-greedy exploration).
2.4 Multi-Agent Reinforcement Learning
In MARL, agents solve a problem formulated as a Decentralized Partially-Observable Markov
Decision Process (Dec-POMDP), which can be represented by the tuple (S;A;T;
;O;R;
).
S,A and
are the state space, joint action space, and joint observation space for theN agents,
respectively. At stept, each agenti2N takes actiona
i
by following policy
i
. The joint action
a leads to state transition captured by functionT (s
0
js;a) :SAS ! [0; 1]. The agent
also draws local observationo
i
2
i
, with joint observation probabilityP (ojs
0
;a) =O(o;s
0
;a),
whereo is the joint observation. Meanwhile, agents receive a joint rewardr
t
=R(s;a) (assuming
collaborative agents). The goal is to find an optimal policy
that maximizes the expected return,
with discount factor
2 [0; 1]:J =E
[
P
1
t=1
t
r
t
].
When extending RL algorithms to multiple agents, the following challenges emerge. Firstly,
the action and state space becomes significantly larger – exponential with the number of agents.
Secondly, the environment becomes non-stationary [37] since all agents are updating their policies
simultaneously. Thirdly, there can be three types of agent relationships: cooperative, competitive
and mixed [38, 39]. Depending on the relationship, the algorithm can learn significantly different
policies. In this thesis, we focus on cooperative agents, where all agents share a common goal.
Finally, the MARL environment is partially observable. i.e., each agent does not see the full status
of the environment or the state / action of the other agents.
To address the challenges, communication among agents is important. The extra information
conveyed by agents’ messages guide agents to better explore the huge state and action space, and
16
thus may improve convergence speed and quality. The information sharing may also promote
collaboration and alleviate the limitation of partial observability. In the search-and-rescue (SAR)
example, each robot can only sense a small region around it, without seeing the full terrain. Sharing
the local terrain observations may lead to better route planning and thus shorter time to rescue the
target.
2.5 Wireless Communication Environment
In applications such as e.g., Search-And-Rescue (SAR) [22], fire fighting [23] and battlefield
defense [24], agents usually form a mobile ad-hoc network (MANET) [40] without pre-existing
infrastructure. In such cases, link connectivity, data rate and packet loss rate varies with agents’
mobility and relative positions. In a large and complicated terrain, there may be obstacles of dif-
ferent materials blocking the signal propagation, causing the wireless signal strength varying with
agents’ mobility. Under popular wireless protocols (e.g.,p-CSMA [41]), before agents send, they
contend to access the medium with certain probability to avoid collision. The signal strength of a
transmitted data packet may be affected by path loss, attenuation, interference with other packet
signals, etc. In particular, for path loss, log-normal fading model is widely used to capture the
Radio Signal Strength (RSS) variation over the spatial domain.
P
r
=P
t
K
ref
10 log
10
d
d
0
+ (2.3)
whereP
r
is the received power in dBm, P
t
is the transmission power in dBm, K
ref
is the loss at
reference distanced
0
, is the path loss component depending on the propagation medium, and is
a log normal variable for multi-path fading. This equation describes the change of signal strength
during the propagation.
17
For interference, it happens when receiver is in range of multiple senders at the same time.
When SINR (Signal-to-Interference-plus-Noise-Ratio, in terms of signal power level) is below a
given threshold, the packet cannot be decoded correctly. In such a case, a broadcast message cannot
be received by all agents.
18
Chapter 3
Related Works
3.1 Collaborative Inference and CNN Model Partition
Involving multiple edge devices/servers to collaborate on inference can increase the frame rate. To
realize inference collaboration, some research works partition the model along the input dimension.
MoDNN [7] exploits task parallelism among heterogeneous mobile devices in a local computing
system. The authors designed input partition scheme for convolutional layers and weight parti-
tion scheme for sparse fully-connected layers. This approach of partition within layers benefits
extremely light-weight edge devices due to its potential of very fine-grained partition. Never-
theless, it increases data dependency since for each layer, every edge device requires input data
residing in other edge devices. Instead of partition within a layer, most research works [42] [43]
partition the CNN model by layer, i.e., to split the CNN model into several groups of layers that
are assigned to several edge devices or servers. We will adopt Such partition approach and boost
throughput with pipelined execution in a local distributed computing system of edge devices.
3.2 CNN Compression
Motivated by the critical role of edge computing in 5G, as well as the popularity of CNN, compa-
nies such as NVIDIA and Xilinx race to design edge devices and software kit powerful in CNN
inference. But such devices are costly when deployed in large scale. As a complementary, com-
pression reduces CNN model size and computation workload. The compressed model better fits
19
edge devices with limited processing power and memory storage. Methods of compressing a pow-
erful and large CNN model include: network pruning, weight quantization and knowledge distil-
lation [18]. Network pruning refers to trim neurons or connections that do not have significant
influence on the final inference accuracy. Weight quantization refers to reducing the number of
digits of the CNN parameters. And knowledge distillation is to train a smaller network (“student
model") to mimic the behavior of the original large and complex network ("teacher model"). Many
existing works [44][45][17][46] have explored accelerating edge inference by using one or multiple
of the above compression methods. For example, Han et al. [46] employ techniques including net-
working connection pruning, weight quantization and Huffman coding to compress the network.
Then the authors proposes a custom application-specific integrated circuit (ASIC) called EIE [45]
to perform inference on the compressed model. EIE accelerates sparse matrix-vector multiplica-
tion in fully-connected (FC) layers with weight sharing. Matsubara et al. [17] partition the CNN
model into “head" and “tail". Then the authors use knowledge distillation to compress the “head"
portion into a smaller “student model". Within the “student model", they select one intermediate
layer and cut down its channels to reduce communication. [42] leverages reinforcement learning
algorithm to generate a framework that can automatically select best initial CNN and compression
techniques for a given input image. The framework can optimize the inference performance while
satisfying resource constraints. The above have their limitations. First, emerging models such
as MobileNetV2 are already very compact and hard to compress significantly. More importantly,
model compression does not directly address the communication bottleneck since the size of the
hidden layer outputs is not necessarily reduced.
3.3 Dynamic Schedule of Compressed Models
Another direction is to dynamically schedule the inference computation given varying resources.
DeepThings [5] proposed a framework for adaptively distributed execution of CNN inference.
With FTP method, it partitions a convolution layer into overlapped tiles that are distributed among
20
edge devices. Then it uses “work stealing" for runtime load balance to adjust to the dynamics
in task workload and computation availability. Edgent [16] considers CNN model partition to
edge device and server. The authors train models with multiple exit points. Based on observed
bandwidth, they greedily search the best model partition point and exit point. Our work in Chapter
4 differs from them. Compared with [5], in addition to load balance, we also consider the dynamics
in bandwidth, and adaptively compress the intermediate layer activation transmitted among edge
devices. Compared with [16], our approach compresses the communication data, and works well
for models with large activation size in early stages, which is the general case [17].
3.4 MARL with Communication
CommNet [26] and BiCNet [47] propose to learn the communication contents in the MARL sys-
tem to boost performance. However, both works perform all-to-all communication at each step,
under the assumption of ideal wireless network conditions (i.e.whatever messages sent out will be
successfully received by all agents). In terms of message aggregation under such assumption, a
simple mean or sum function is used. TarMAC [48] and MAGIC [49] later introduce attention-
based mechanisms to improve message aggregation.
Recent works consider resource constrained wireless networks, which improves the ideal
model by considering limited bandwidth. DIAL [20] considers limited bandwidth only in the
execution phase but assumes perfect channel while training. The following works consider band-
width limit both in training and execution. When multiple agents simultaneously sendk
0
messages,
the environment ensures minfk;k
0
g messages to be successfully received by all agents (wherek
is a manually configured constant characterizing the bandwidth limit). VBC [50] limits the vari-
ance of the message and only those with large variance are exchanged during execution. SchedNet
[25] generates an importance weight for the message, and exactlyk most important messages are
exchanged in each step. RMADDPG [51] adopts a recurrent multi-agent actor-critic framework
which limits the number of communication actions during training. Note that these works still
21
assume perfect transmission, and ignore many critical factors in a realistic wireless environment.
All works above assume at least one of the following: 1. Dedicated and perfect wireless channels
exist between any pair of agents. Any transmitted message can be delivered to all other agents
successfully. 2. Message importance is the only factor when deciding “when to send a message”.
3. Channel condition is fixed and known in advance to allow scheduling agents’ communication
with a fixed budget. In comparison, our framework does not have any of the above assumptions
and considers realistic wireless channels.
3.5 Value Decomposition and QMIX
There are a series of Value Decomposition based algorithms [27, 28, 52, 53] that adopt Central-
ized Training with Decentralized Execution (CTDE) [54] paradigm. During training, a centralized
value decomposer (i.e., “mixer” from the bottom-up perspective) decomposes the global action
value Q
tot
into individual utility values Q
i
with the aid of extra information. During execution,
neither global information nor the value decomposer is involved. Instead, agents operate in a
decentralized way. Specifically, VDN [27] proposes to decompose the global action value as sum
of individual action values. QMIX [28] relaxes the decomposition function into a monotonic one,
as a practical realization to guarantee the IGM (Individual-Global-Max) condition. IGM condi-
tion [28, 53] claims that the actions yield by argmax of global Q
tot
should be the same as the
set of actions yield by argmax operations performed on eachQ
i
. Consequently, the collection of
actions selected by decentralized policies are consistent with the optimal joint actions. There are
some variants of QMIX. For instance, WQMIX [52] further introduces a weighting function on
the individual Q values. QTRAN [53] satisfies IGM by transforming the global Q value function
into a new one that is easier to factorize. It imposes computationally intractable constraints which
often lead to poor empirical performance when relaxed. To date, QMIX is still one state-of-the-art
[55, 56]. We inherit the “monotonicity” idea of QMIX. However, unlike QMIX, we focus on a
22
system where agents can communicate over realistic wireless channels. In Chapter 6, we address
unique challenges in such scenarios to achieve performance gain from intelligent communication.
3.6 GNN in MARL
GNN is a powerful neural network model for graph structured data. It computes the representation
of each node by iteratively exchanging and aggregating latent embeddings across neighboring
nodes. Such exchange of information is called “message passing”. Weight parameters to aggregate
the exchanged latent embeddings are updated via back-propagation. The key property of GNN is
permutation invariance: when applying a (certain kind of) global pooling at the computed node
representations, GNN further generates an overall graph representation which is invariant with
the order of input nodes. Recent works on MARL leverage GNN to improve performance. In
GPG [57], a centralized controller constructs an input feature vector as sequential concatenation
of each agent’s position. The author then leverages GNN to handle the permutation within the
feature vector input. This problem is not a POMDP setting but centralized. MADDPG [51] is a
well-known actor-critic policy gradient based algorithm. PIC [58] applies GNN in the critic of
MADDPG to achieve permutation invariance. Liu, et al. [59] propose to use object detection
to construct a “semantic tracklet” for each agent. The authors then apply GNN on the complete
graph of the semantic tracklet, to generate the policy and value in MAPPO algorithm. Above
works are not based on value decomposition and do not consider agent communication. Our work
also distinguishes from them in that we designed a new GNN for value decomposition, instead of
applying any existing GNN models.
Another line of research applies GNN during the communication process. Most works require
either multiple rounds of or all-to-all communication, which is impractical in realistic wireless
channels with limited resources. DGN [60], based on Individual Q Learning, proposes to use
GNN with convolutional layers to embed the multi-round communication for agents within the
same region. G2ANet [61] combines hard-attention and soft attention to learn interaction weights
23
among agents. It then uses one layer GNN (specifically, a simple weighted sum) to combine the
embedded messages based on the attention. MAGIC [49] also uses GNN to embed the exchanged
messages. It additionally proposed a scheduler to generate adjacency matrix for the GNN in each
round of message passing. The scheduler itself is an attention based GNN (GAT [62]) which takes
all embedded messages as input and thus centralized.
24
Chapter 4
Fast and Accurate Streaming CNN
Inference via Communication Compression
on the Edge
Recently, compact CNN models have been developed to enable computer vision on the edge. While
the small model size reduces the storage overhead and the light-weight layer operations alleviates
the burden of the edge processors, it is still challenging to sustain high inference performance
due to limited and varying inter-device bandwidth. We propose a streaming inference framework
to simultaneously improve throughput and accuracy based on novel communication compression
techniques.
1
Given an edge device processing pipeline and a CNN model, we first layer-wise
partition the CNN to achieve computation load-balance of the devices. Then we identify inter-
device communication bottlenecks and insert (Variational) Auto-Encoders into the original CNN
to compress the data traffic. Further, to address the large variation of bandwidth under various net-
work conditions, we propose an adaptive data compression scheme based on quantitative tradeoff
analysis among compression ratio, throughput and accuracy. The above three optimization steps
significantly improve inference throughput due to improved communication performance. More
importantly, accuracy also increases since (a) fewer frames are dropped when input images are
streamed in at a high rate, and (b) the frames successfully entering the pipeline are processed accu-
rately by the compressed CNN. We evaluate MobileNet-v2 and VGG16 on processing pipelines
1
This work described in this chapter has been published in part as Diyi Hu and Bhaskar Krishnamachari. 2020.
Fast and Accurate Streaming CNN Inference via Communication Compression on the Edge. In Proceedings of the
2020 IEEE/ACM Fifth International Conference on Internet-of-Things Design and Implementation. IEEE, 157–163.
25
consisting of Raspberry Pi 3B+ and NVIDIA Jetson TX2 devices. Our data compression tech-
niques consistently lead to significant accuracy improvement, when average bandwidth of the Wi-
Fi network varies from 3Mbps to 9Mb/s.
4.1 Problem Definition
4.1.1 Optimization Goal
Given an edge device pipeline executing a CNN model, we aim at improving the inference through-
put and effective accuracy when input images are generated at a steady rate.
To define the two optimization goals, suppose the CNN is used for the task of classification,
where the image generation rateT
gen
is a constant and the CNN pre-trained accuracy is. Then
throughput,T , is defined as number of classified images per unit time, whereT < T
gen
. Effective
accuracy, is defined as the ratio of number of correctly classified images over total number of
generated images, in a time period, i.e.,
eff
:=
T
T
gen
; (4.1)
Note that since the inference needs to be real-time, an image frame that is dropped due to the
slow pipeline processing speed is equivalent to the image being classified incorrectly. Therefore,
by the definition of
eff
, we observe a tradeoff between throughput and pre-trained accuracy
— To improve
eff
, it may be worthwhile to compress a CNN model such that the decreased
is compensated by the increased processing throughputT . Such intuition motivates the adaptive
compression scheme proposed in Section 4.2, the performance of which is analyzed in detail in
Section 4.3.
26
Table 4.1: Notations related to CNN inference
Notation Meaning
n Number of pipeline stages
m Number of convolutional layers
Throughput
B
i
Random variable representing bandwidth atith split point
D
i
data amount transmitted atith split point
C
i
computation speed on theith partition
W
i
computation load on theith partition
ops
A
(i) Function returning operation count ofi
th
layer ofA
data
A
(i) Function returning word count ofi
th
layer ofA
…
Decoder
Output
Input
!
"
#
"
Res: Group of residual blocks in MobileNet.
MobileNet
Auto-Encoder (AE)
Conv2d
k=3
S=2
Res2 Res1
Encoder Decoder
Res4 Res3
!
$
%
"
%
$
&
"
…
Figure 4.1: Overview of the framework
4.1.2 System Model
In Figure 4.1, the system performing inference consists of a linear pipeline formed by edge devices.
The devices can be heterogeneous. Denote C
i
as the computation speed of the i
th
device, and
B
i
as the bandwidth to transfer data from the i
th
to the (i + 1)
th
device. Since each device is
mostly executing the same type of convolution operation, we assume C
i
remains fixed during
inference. However, we may have C
i
6= C
j
for i6= j. Regarding bandwidth, we assume the
devices communicate over wireless channels and the coherence time of the environment is large.
In other words,B
i
changes slowly over time, and its variance may be significant. Mathematically,
we modelB
i
as independent random variables, each following distributionB
i
. Lastly, for ease of
notation, defineC =fC
i
j 1ing andB =fB
j
j 1jn 1g.
27
4.1.3 Categorization of CNN Models and Edge Devices
In recent years, to fulfill the requirement of various edge applications, numerous CNN models
and many types of edge devices have been developed. In this section, we categorize these models
and devices based on their computation and communication characteristics to enable appropriate
evaluation on our proposed edge inference framework.
CNN models Depending on the types of convolution performed, we can categorize the CNN
models based on a metric named computation-communication ratio. The computation cost of a
convolutional layer can be measured based on the type of sliding window operation performed.
The communication cost of a layer is simply proportional to the size of the activation tensor. For
classic models such as AlexNet [30] and VGG16 [32], the computation-communication ratio is
high due to the expensive operation to slide 3 3 or 5 5 windows over both the input and
output channel dimensions. For more recent CNNs such as the MobileNet family [13, 14, 15],
computation-communication ratio is reduced by an order of magnitude due to much more efficient
algorithms such as separable depth-wise convolution.
Edge devices We categorize various edge devices only by their computation speed, since the
communication speed largely depends on the network environment rather than the devices them-
selves. Low-end edge devices such as Raspberry Pi feature energy efficiency at the cost of slow
computation speed. On the other hand, embedded GPUs such as NVIDIA Jetson are now available
in the market that can provide server-grade computation performance on the edge.
Figure 4.2 shows a visual categorization of the CNN models and edge devices. For a CNN,
the horizontal axis denotes the total number of operations required to inference a single 32 32
R.G.B. color image, and the vertical axis denotes the total number of bytes to transfer if each
layer is executed on a separate edge device. For a device, the horizontal axis denotes the number
of operations that can be performed in one second, assuming 100% hardware utilization. We do
not plot the vertical coordinate for a device since the communication performance also depends
28
0 200 400 600 800 1;000 1;200
0
0:2
0:4
MobileNet-v2
VGG16
Raspberry Pi
Jetson
Number of floating-point operations (M)
Number of bytes (M)
Figure 4.2: Categorization of CNNs and edge devices
on the network condition. Clearly, CNNs with low computation-communication ratio are suit-
able to be executed on the edge pipeline consisting of less powerful devices. CNNs with high
computation-communication ratio can be executed on embedded GPUs. Following such obser-
vation, in the experiments (Section 6.5), we evaluate MobileNet-v2 on Raspberry Pi 3B+, and
VGG16 on NVIDIA Jetson.
Remark on notation In Table 4.1, we summarize the notations used throughout the paper.
If a parameter is a random variable, we use lower case letter (e.g., b, ) to denote its value and
the Sans Serif font (e.g., B) to denote the corresponding probability distribution. For parameters
related to the execution pipeline, subscripti denotes parameters of devicei, or parameters between
devicesi andi + 1.
4.2 Optimized Pipeline Execution
A natural and efficient way to pipeline CNN inference is to split the layers onto the edge devices.
Let the total number of edge devices ben and the total number of CNN layers bem. Suppose we
split them layers inton parts and the layer indices at the split points areS =fs
1
;:::;s
n1
g. For
ease of notation, we also sets
0
= 0 ands
n
= m. Therefore, edge device 1 executes layers
0
= 1
29
to layers
1
. Edge device 2 executes layers
1
+ 1 to layers
2
. The last edge devicen executes layer
s
n1
+ 1 to layerm =s
n
.
The pipeline under the above configuration consists ofn computation stages corresponding to
then edge devices, andn 1 communication stages to transfer layer activation between devices.
To improve the overall inference throughput, we have to reduce the execution time of the bottleneck
pipeline stage. Clearly, irrespective of the communication stage performance, the overall inference
throughput is upper bound by:
T min
1in
8
>
<
>
:
C
i
P
s
i1
+1js
i
ops
A
(j)
9
>
=
>
;
P
1in
C
i
P
1jm
ops
A
(j)
(4.2)
where the first inequality is achieved if the n 1 communication stages are not the bottleneck,
and the second inequality is achieved if
C
i
P
s
i1
+1js
i
ops
A
(j)
=
C
k
P
s
k1
+1`s
k
ops
A
(`)
. For given CNN and
edge devices, the bound ofT is a constant. Thus, to maximize throughput, we need to 1. balance
the load of the computation stages (Section 4.2.1), and 2. reduce the load of the communication
stages (Section 4.2.3).
4.2.1 Load-Balance of the Computation Stages
In this subsection, we temporarily ignore the communication stages as they will be separately opti-
mized later on. The question of identifying the optimal split pointsS can be solved by formulating
a dynamic programming problem. Define split
A;C
(p;q) as a function to optimally split the last
p layers of the CNN (i.e., layermp + 1 to layerm) onto the lastq devices of the pipeline (i.e.,
devicenq + 1 to devicen). Letsplit
A;C
(p;q) return the computation throughput of the length-
q after the splitting. Clearly, we want to solve split
A;C
(m;n), and we also notice the following
recursion:
30
split
A;C
(p;q) = max
mp
right
, and
left
<
right
.
For sake of discussion, we further assume
left
= 2
right
and 0:5
right
<
left
<
right
.
It is easy to show that the optimal
should be such that effective accuracy ofA
left
atb =
equals effective accuracy ofA
right
atb =
. Now, to compute
eff
, we visualize in Figure 4.4.B
the change of pipeline stage throughput with respect to bandwidth b, under various scenarios.
The horizontal dashed line (A) is the throughput bound T
p
of computation stages. The two red
dashed lines (B,C) represent the communication throughput under two compression ratios. Due
to the assumption on the relative values of
and
presented above,
must fall between the
intersection of A, B and the intersection of B, C. In other words,
T
p
left
=
right
D
right
(4.5)
whereD is the amount of data traffic at the communication stage of interest. With the
defined
by Equation 4.5, the solid line in Figure 4.4.B shows the throughput of the overall pipeline with
respect to b. At the transition point
, the dropped throughput due to a less compressed CNN
39
is compensated by the increased pre-trained accuracy. Therefore, the net effect is that effective
accuracy remain unaffected.
4.3 Performance Analysis
4.3.1 Inference Throughput
The pipeline can be potentially bottlenecked by either a computation stage at the edge device
or a communication stage between two edge devices. We study the two cases respectively and
derive the expectation of inference throughput. We further derive the expected throughput for a
system with two communication stages whereB
1
andB
2
are i.i.d. random variables with Gaussian
distribution. For notation,p
i
(x) denotes probability density function (pdf) ofB
i
, andP
i
(x) denotes
cumulative distribution function (cdf) ofB
i
.
Computation bottleneck Let throughput of slowest computation stage beT
p
. According to our
setup, T
p
is fixed since C
i
2C are fixed. We useP
1
to denote the probability of T
p
becoming
bottleneck of the inference pipeline (i.e., all communication stages have larger throughput than
T
p
). The randomness comes from bandwidth distributions in each communication stage. Hence,
P
1
can be calculated by
P
1
=
Y
i
P(
B
i
D
i
>T
p
) =
Y
i
(1P
i
(D
i
T
p
)) (4.6)
Communication bottleneck The communication stage at partition point j with throughput
becomes the bottleneck if and only if the following holds: 1.T
p
>; 2. The communication stage
at partition pointj has the smallest throughput among all computation stages.
The probability of the communication stage at partitionj with throughput becomes the sys-
tem bottleneck is
40
Q
i6=j
P
B
i
D
i
>
P
B
j
D
j
=
. Therefore, the expected throughput when communication
stagej becomes the system bottleneck is
E(T;j) =
Z
T p
1
Y
i6=j
(1P
i
(D
i
))p
j
(D
j
)D
j
d (4.7)
Expectation of inference throughput The expectation of inference throughput can be calculated
by combining the two cases above:
E[T ] =P
1
T
p
+
X
j
E(T;j) (4.8)
Expected inference throughput for bandwidth with Gaussian distribution According to the
empirical measurement in [66, 67], Gaussian distribution is a reasonable assumption onB
i
. In the
following, we present analysis for a system with 2 split points (n = 3), whereB
1
andB
2
are i.i.d.
random variables andB
i
N (;) fori = 1; 2. The more general case with any number of split
points can be analyzed by following the same roadmap. To calculate the expected throughput of
the entire pipeline, by Equations 4.6, 4.7, 4.8, we have:
E(T ) = (1 (a
1
)) (1 (a
2
))T
p
+
D
1
Z
T p
1
1 (
D
2
)
(
D
1
)d
+
D
2
Z
T p
1
1 (
D
1
)
(
D
2
)d
(4.9)
wherea
i
=
D
i
T p
,i = 1; 2 for simplified notation. Due to limited space, we skip the intermediate
steps and directly give the result of the second term in Equation 4.9:
41
b +
D
1
(a
1
) +b(a
2
)exp(
a
2
1
2
)
bD
2
p
c
exp
2
(D
2
D
1
)
2
2
c
a
1
p
c
D
1
D
1
BvN
(D
1
D
2
)
p
c
;a
1
; =
D
1
p
c
(4.10)
whereb =
p
2D
1
,c =D
2
1
+D
2
2
,BvN[;;] is bivariate normal cumulative with upper bounds
and, and correlation. And the result is based on the assumption thatT
p
>
D
j
. The last term in
Equation 4.9 can be calculated in a similar way.
4.3.2 Expected Effective Accuracy
Recall that Effective accuracy is defined as the ratio of number of correctly classified images over
total number of generated images in a time duration, i.e.
eff
=
T
T gen
. With fixed T
gen
, we have
E[
eff
]/E[T].
4.3.3 Performance Analysis for Adaptive Compression
With pre-measured bandwidth distribution, E[T ] is a function ofD. SinceD and
eff
are both
functions of compression ratio
, we can evaluate the performance in terms ofE[T ] andE[
eff
] for
a given scheme of
.
To analyze the performance of the proposed adaptive compression scheme, we use inference
of CIFAR10 images with MobilenetV2 on 4 Raspberry Pi 3B+ devices as an example.
Again, we assumeB
1
,B
2
, andB
3
are i.i.d. random variables andB
i
N (;) fori = 1; 2; 3,
where = 5Mbps and = 1Mbps. According to 3- rule,P (2MbpsB
i
8Mbps) = 99:7%.
So the probability ofB
i
taking other values is negligible.
Following Equation 4.3, we split MobileNetV2 into 4 partitions with D =
[24576; 4096; 6144] 32bits. We can also compute T
p
= 7:87frames/sec. Even with mini-
mum possible bandwidth 2Mbps, the latter two communication stage have larger throughput than
42
T
p
so cannot throttle the pipeline throughput. With only one potential communication stage as
system bottleneck, we can follow Equation4.6, 4.7, 4.8 to get
E(T ) =
1 (
D
1
T
p
)
T
p
+
+
D
1
Z
T p
1
D
1
d
(4.11)
If we use a static compression scheme, i.e., the same compression ratio regardless of the mea-
surement ofB
i
, we can computeE(T ) with Equation 4.11 straightforwardly. AndE[
eff
]/E[T ].
To improve the performance, we can use the adaptive compression scheme proposed in Section
4.2.4. The transition point of bandwidth for compression ratio can be identified as = 6:12Mbps.
And the adaptive compression scheme is:
1
= 2 forB
1
< and
2
forB
1
>.
We can calculate E(T ) for our adaptive compression scheme by doing piecewise computa-
tion in Equation 4.11: decompose each term in regard toD
1
. The reason is thatD
1
depends on
compression ratio. Thus, for example, the first term in Equation 4.11 can be computed as
P
B
1
D
1
>T
p
T
p
=P
0
@
B
1
b
D
1
1
>T
p
;B
1
<
1
A
+P
0
@
B
1
b
D
1
2
>T
p
;B
1
>
1
A
(4.12)
where
b
D
1
is the data amount in first communication stage without compression.
Similarly, We can conduct piecewise computation to calculateE[
eff
].
The performance of original model, static compression with
= 2 and the dynamic compres-
sion is shown in table 4.3. Note thatE[
eff
] is normalized by multiplyingT
Gen
.
As we can see, by inserting autoencoders to compress the communication, we can significantly
boost expected throughput and expected accuracy simultaneously. Compared to the static compres-
sor, the adaptive compression scheme has smaller compression ratio when bandwidth is large. But
it can reach sameE and even betterE[
eff
].It can improveE[T ] andE[
eff
] of the original network
43
Table 4.3: Performance analysis: expected throughput and expected effective accuracy for original
MobileNetv2, with static compression ratio 2, and with adaptive compression ratio.
E(T ) E[
eff
]
Original 6:04 5:45
= 2 7:84 6:99
1
= 2;
2
= 1; = 6:12Mbps 7:84 7:01
by 29:8% and 28:6% respectively. The adaptive compression scheme has superior performance
because the design considers the tradeoff between bandwidth, throughput and accuracy. When
the bandwidth is large, the system is bottlenecked by the computation stage most of the time. So
heavier compression has negligible benefits in throughput but can instead decrease the accuracy.
The results demonstrate the effectiveness of adaptive scheduling.
44
4.4 Experiments
4.4.1 Setup
We evaluate the effectiveness of our optimizations using two CNNs (MobileNet-v2, VGG16), two
image classification datasets (CIFAR10, CIFAR100) and two types of edge devices (Raspberry Pi
3B+, NVIDIA Jetson TX2). MobileNet-v2 consists of 19 convolutional layers, where layer 1 and
19 are regular 33 convolutional layers, and layers 2 to 18 are inverted-residual blocks constructed
by depthwise separable convolution. VGG16 consists of 13 convolutional layers, each performing
regular 3 3 convolution. For the datasets, both CIFAR10 and CIFAR100 contain 32 32 R.G.B
color images. CIFAR10 has total of 60000 images belonging to 10 categories. CIFAR100 has total
of 60000 images belonging to 100 categories. The Raspberry Pi contains a Broadcom BCM2837B0
quad-core A53 (ARMv8) CPU operating at 1.4GHz, and a Broadcom Videocore-IV GPU. The
RAM is 1 GB LPDDR2. Raspberry Pi supports 2.4GHz and 5GHz 802.11b/g/n/ac Wi-Fi. We
insert a 32 GB Micro-SD card for storage. For NVIDIA Jetson TX2, its CPU consists of an ARM
Cortex-A57 @2GHz and a NVIDIA Denver2 @2GHz. Its GPU is a 256-core Pascal operating at
1300MHz. The memory is 8 GB 128-bit LPDDR4 and the storage is 32 GB eMMC 5.1. Jetson
supports 802.11a/b/g/n/ac Wi-Fi communication.
To perform evaluation, we build two edge device pipelines: one consisting of four Raspberry Pi
3B+, and the other consisting of four NVIDIA Jetson TX2. Both pipelines communicate via Wi-Fi.
As discussed in the previous sections, MobileNet-v2 is executed on the Raspberry Pi pipeline and
VGG16 is executed on the Jetson pipeline.
We implement our code using Python3.7 and Tensorflow (v1.14 for Jetson and Tensorflow-Lite
for Raspberry Pi).
45
4.4.2 Evaluation on Compressors
Based on our splitting algorithm (Section 4.2.1), the splitting points areS = f4; 11; 15g for
MobileNet-v2, andS =f3; 5; 8g for VGG16. The communication bottleneck only happens at
the first splitting point for both CNNs (the size of the activation tensor shrinks dramatically at deep
layers). We thus insert compressors of
= 2; 4; 8 between device 1 and 2 of the pipeline. The
compressor architectures are defined by Table 4.2. We use AE as the compressor.
Table 4.4 and 4.5 summarize the pre-trained accuracy (i.e., ) achieved and the computation
overhead introduced by adding additional compressor layers. Clearly, 1. the proposed end-to-end
training methodology ensures high accuracy even when the compression ratio is very high; 2. the
additional computation load due to the inserted compressor is at most 2% of the computation load
of the original model. Therefore, after inserting the compressor, there is no need to re-split layers
for re-balance of computation load.
Table 4.4: MobileNet-v2: Top-1 accuracy and computation overhead under various compression
ratio
CIFAR-10 CIFAR-100
Accuracy Overhead Accuracy Overhead
Original 90:2 0:00 69:1 0:00
= 2 89:2 0:02 67:4 0:02
= 4 88:5 0:02 66:7 0:02
= 8 87:7 0:02 65:7 0:02
Table 4.5: VGG16: Top-1 accuracy and computation overhead under various compression ratio
CIFAR-10 CIFAR-100
Accuracy Overhead Accuracy Overhead
Original 90:3 0:00 70:3 0:00
= 2 90:0 0:01 69:6 0:01
= 4 89:0 0:01 68:8 0:01
= 8 87:9 0:01 67:5 0:01
46
0 2 4 6 8 10
40
60
80
Mean value
E [
eff
]
MobileNet-v2, CIFAR10
0 2 4 6 8 10
40
60
80
Mean value
E [
eff
]
VGG16, CIFAR10
0 2 4 6 8 10
20
40
60
Mean value
E [
eff
]
MobileNet-v2, CIFAR100
0 2 4 6 8 10
20
40
60
Mean value
E [
eff
]
VGG16, CIFAR100
Uncompressed
= 2
= 4
= 8
1
,
2
Figure 4.5: Expected effective accuracy under various value, whenB
i
=N (; 2)
4.4.3 Evaluation on End-to-End Performance
We evaluate the effectiveness of data compression on the overall expected effective accuracyE[
eff
]
under various network conditions. Figure 4.5 is measured by fixing the variance to be = 2Mbps,
and changing the mean value from = 3Mbps to 9Mbps. Figure 4.6 is measured by fixing the
mean to be = 5Mbps, and changing the variance from = 0:5Mbps to 3Mbps. Note that in all
experiment, we set the input image generation rate to be equal to the throughput of the bottleneck
computation stage (i.e.,T
gen
=T
p
). In other words,
eff
= if and only if the three communication
stages never become the bottleneck of the system. In our setup when B
i
is Gaussian, we always
have
eff
<, regardless of the optimization performed. We make the following observations.
Effectiveness of compression For any CNN, dataset and type of edge devices, data compres-
sion significantly improves the overall effective accuracy. Effective accuracy of the uncompressed
model catches up with the effective accuracy of the compressed model, only when is very large.
47
1 2 3
40
60
80
Variance
E[
eff
]
MobileNet-v2, CIFAR10
1 2 3
40
60
80
Variance
E[
eff
]
VGG16, CIFAR10
1 2 3
30
40
50
60
70
Variance
E[
eff
]
MobileNet-v2, CIFAR100
1 2 3
30
40
50
60
70
Variance
E[
eff
]
VGG16, CIFAR100
Uncompressed
= 2
= 4
= 8
1
,
2
Figure 4.6: Expected effective accuracy under various value, whenB
i
=N (5;)
For the two plots of MobileNet-v2 in Figure 4.5, when = 9, we have
D
1
1:7T
p
. In addition,
we observe that higher compression ratio is more useful when the network condition is bad. For
Figure 4.5, larger
leads to significant accuracy improvement compared with smaller
, when is
small. Effective accuracy of CNNs with small
eventually becomes better when keeps increas-
ing. This is because when is very large, the communication stage is very unlikely to become the
bottleneck, and then becomes the dominant term in the calculation of
eff
. For Figure 4.6, larger
compression ratio are more beneficial when the variance of the bandwidth is larger.
Effectiveness of adaptive compression In the experiments, when using the adaptive com-
pression technique (red line with legend “
1
,
2
”), we use a single parameter to support two
compression ratio between devices 1 and 2. Inference with adaptive data compression almost
always achieve the best performance, regardless of the mean and variance of the bandwidth. For
example, when inferencing MobileNet-v2 on CIFAR100 in Figure 4.6, we clearly observe that the
48
adaptive compression scheme leads to higher accuracy than any other single compression ratio
schemes. Naturally, when the variance of the bandwidth is large, we can hardly identify a single
compression ratio suitable for all scenarios. If we allow the communication stage to choose among
more compressors, we expect even higher accuracy than the current simple adaptive compression
scheme.
Performance comparison between MobileNet-v2 and VGG16 It is worth noticing that the
shapes of the MobileNet-v2 curves are significantly different from the shapes of the VGG16 curve,
and that effective accuracy of VGG16 is much lower than MobileNet-v2, especially in Figure 4.6.
This is mainly due to the fact that computation speed of Jetson is much faster than Raspberry Pi —
Since the computation stage throughput is much higher on the Jetson pipeline andT
gen
= T
p
, the
denominator of effective accuracy definition is very large, leading to a lower
eff
.
4.5 Summary
We have proposed a framework for improving throughput and accuracy of pipelined CNN infer-
ence on the edge. Our framework is based on data compression techniques to improve the com-
munication performance between edge devices. We have proposed V AE based compressors to
significantly reduce the activation size without significant accuracy loss. We have also proposed
an adaptive data compression scheme that considers the distribution of the varying bandwidth,
and then optimizes expected accuracy by properly changing the desired compression ratio during
runtime.
49
Chapter 5
Learning Practical Communication
Strategies in Cooperative Multi-Agent
Reinforcement Learning
In Multi-Agent Reinforcement Learning, communication is critical to encourage cooperation
among agents. Communication in realistic wireless networks can be highly unreliable due to
network conditions varying with agents’ mobility, and stochasticity in the transmission process.
We propose a framework
1
to learn practical communication strategies by addressing three funda-
mental questions: (1) When: Agents learn the timing of communication based on not only message
importance but also wireless channel conditions. (2) What: Agents augment message contents with
wireless network measurements to better select the game and communication actions. (3) How:
Agents use a novel neural message encoder to preserve all information from received messages,
regardless of the number and order of messages. Simulating standard benchmarks under realistic
wireless network settings, we show significant improvements in game performance, convergence
speed and communication efficiency compared with state-of-the-art.
1
This work described in this chapter has been published in part in D. Hu, C. Zhang, V . Prasanna, B. Krishnamachari,
“Intelligent Communication over Realistic Wireless Networks in Multi-Agent Cooperative Games”, 21st International
Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 9–13, 2022 and
D. Hu, C. Zhang, V . Prasanna, B. Krishnamachari, “Learning Practical Communication Strategies in Cooperative
Multi-Agent Reinforcement Learning”, Asian Conference on Machine Learning (ACML), PMLR, 2022
50
Neural
Net
o
t
i
m
t
i
w
t
i
w
t
i
m
t
i
Wireless
Env.
Neural Net
c
t
i
a
t
i
a
t
i
Game
Env.
o
t+1
i
r
t
i
(a)
Neural
Net
o
t
i
o
0
t
i
(including
c
t1
i
)
a
t
i
m
t
i
a
0
t
i
Game Env.
Wireless Env.
r
t
i
o
t+1
i
o
0
t+1
i
(includingc
t
i
)
m
t
i
a
t
i
a
0
t
i
Meta Env.
(b)
Figure 5.1: Two ways of agent-environment interaction: (a) The original way (non-differentiable
training). (b) The new way (after lagging and alignment).
5.1 Method
In this section, we describe the proposed framework in detail. The framework reformulates the
MDP to align agents’ interactions in the game and wireless environments, and augments the obser-
vation space with network measurements such as RSS and ACK. The framework also uses a mes-
sage encoder to embed the received messages and their wireless conditions. The framework aims
to optimize three aspects of communication: when, what, and how.
5.1.1 When: MDP Re-Formulation
Use subscript “i” to denote variables from all agents other thani. Communication can facilitate
exchange of current observations. A natural design, followed by most existing works [20, 26, 38],
is to send and receive a message within the same step. In Figure 5.1(a), at the beginning of stept,
agenti makes local observationo
t
i
on the game environment. Fromo
t
i
, a neural network generates a
messagem
t
i
with weightw
t
i
measuring message importance / relevance. Then based onw
t
i
andw
t
i
,
the messagesm
t
i
andm
t
i
contend to go through the wireless channel. Since some transmissions
may fail, we usec
t
i
to denote the messages successfully received byi. Next, agenti’s own message
and all its received ones go through another neural network to generate the next actiona
t
i
. Finally,
51
all agents interact with the game environment viaa
t
i
anda
t
i
. The game environment returns the
current rewardr
t
i
and next observationo
t+1
i
.
A major drawback of the Figure 5.1(a) setting is training non-differentiability. When optimiz-
ing policy
via back-propagation, the gradients of the parameters need to flow froma
t
i
back to
o
t
i
. However, the wireless environment lies in the middle of stept. The mapping implemented by
the realistic wireless channel,m
t
;w
t
! c
t
i
, is non-differentiable [25] (due to “stochasticity” in
transmission).
Communication Lagging We propose “communication lagging” to make training differen-
tiable. Denote a
0
as communication actions and o
0
as wireless observations (see Section 5.1.2).
As shown in Figure 5.1(b), an agent observes at the beginning of stept. Yet it does not generate
the message m
t
i
until the end of step t. After lagging, the “message-wireless environment” and
“agent-game environment” interactions happen simultaneously (enclosed by dotted circle). Thus,
the two environments are aligned. Lagging leads to the following tradeoffs: (1) Ease of training:
the aligned environments enable end-to-end policy training, as the gradients can flow backward
from the end of step t (i.e., a
t
i
and a
0
t
i
) to the beginning of step t (i.e., o
t
i
and o
0
t
i
). The training
process is differentiable since there is no communication channels lying in the middle of one step
that blocks the back-propagation of gradients. We thus develop a general framework (Section 6.4)
supporting any existing MARL algorithm on top of realistic wireless environment. (2) Staleness
of messages: actiona
t
i
is generated with stale informationc
t1
i
since the step-(t 1) messages are
received at stept. Such staleness may only have slightly negative impact when the step duration is
short in practice [50].
Meta-Environment We abstract a “meta-environment” from aligned game and wireless envi-
ronments. Meta-environment is the new game environment: any algorithm on the original game
environment can be directly applied on the meta-environment. To define agents’ interaction with
the meta-environment, we reformulate the Markov Decision Process (MDP) with expanded state
52
and action spaces: (S;A;T;
;O;R;
). DenoteS as the state of the meta-environment. The
augmented action spaceA consists of the game actionsA
T
and communication actionsA
C
. i.e.,
A = A
T
A
C
. The augmented observation space
consists of the game observations
T
and wireless observations
C
. i.e.,
=
T
C
.T andO are defined on the meta-state and
augmented observation and action spaces.
Learning Communication Actions Under the reformulated MDP, agents can learn a policy on
communication actions to decide “when to communicate”. The simplest is the binary action for
“send / no-send”: a
C
i
2A
C
i
=f0; 1g. Other examples of learnablea
C
i
include: (1) Transmission
power P
t
: larger P
t
improves the received signal strength at the cost of higher probability of
interference. (2) Medium contention probabilityp: largerp increases the probability of securing
the medium by suppressing other agents’ chance.
5.1.2 What: Observation Augmentation
Efficiently exploring the state spaceS and action spaceA relies on suitable observations, espe-
cially considering the complexity in wireless environment. However, since agents may be far
away, it is infeasible for them to directly observe the full network condition (i.e., the wireless envi-
ronment is partially observable). To address the issue, we first augment the observation spaceO by
observation / measuremento
C;t
i
on the wireless environment. i.e.. for each agent, we concatenate
the wireless and game observationso
t
i
= o
T;t
i
ko
C;t
i
. Then we let the message sent byi containi’s
hidden stateh
t
i
, whereh
t
i
embeds information ofo
t
i
. Whenj receives fromi, agentj predicts the
current network condition. If such condition is believed as poor,j may suppress its sending action.
Network Measurements Common factors resulting in failed transmission include limited band-
width, noises, signal attenuation due to obstacles and packet collision. The network condition is
complicated: some of the above factors are related to the wireless environment (e.g., limited band-
width), some are related to the game environment (e.g., obstacles), and others are even related to
53
agents’ policies (e.g., collision due to simultaneous sendings). Agents can understand the network
condition from various types of network measurements, such as Radio Signal Strength (RSS) and
Acknowledgement packets (ACK). The sender agent can use ACK to infer the packet receiving
rate and the data rate. For brevity, we now only illustrate the benefits of augmenting observation
via RSSP
s
in detail. When agenti receives the message fromj, it measuresP
s
and also infers
j’s position (thus distanced betweeni andj). When agenti receives more messages from other
agents across multiple steps, it essentially collects many (P
s
;d) data points. The agent can then
learn the functionP
s
=g (d) corresponding to the path loss model of the particular wireless envi-
ronment. Withg, when an agent knows (or predicts) the position of others, it immediately knows
the corresponding RSS if it sends a message. Ignoring collisions, the agent knows if its message
can be successfully received or not, even before it sends out the message. Such knowledge clearly
helps with a better policy on communication actions. In addition, RSS information can also reveal
terrain information on the game environment. If there is an obstacle betweeni andj, the received
RSS will decrease due to attenuation. If the measuredP
s
is significantly lower than the estimated
g (d), the agent detects an obstacle. Communication then serves as a form of long-range sensing
on the terrain, and helps agents better plan their future movements.
In addition to the RSS measurements, we further discuss the use of ACK. The benefits of
ACK are two-folded. Firstly, ACK can be used to estimate available bandwidth. When an agent
broadcasts, it can request ACK from other agents and then record the number of ACKs received
back. The sending agent can thus infer the packet receiving rate (PRR), and correspondingly the
available bandwidth. Secondly, we can put the information of other network measures as the
payload of ACK. For example, after agenti receivesj’s message, it can measure the RSS and put
such RSS in the ACK replying toj. This way, bothi andj can know the RSS at agenti’s position.
5.1.3 How: Message Encoder
54
Table 5.1: Notations related to message encoding
Notation Meaning
m
ij
2R
l
Message from agentj toi
N
i
The set of agents successfully sending toi
x2 [0;n 1] Cardinality ofN
i
c
i
=fm
ij
:j2N
i
g The set of messages received byi
(c
i
) Encoding function mapping the set of messages to a vector inR
d
0
Message Encoder Factors such as signal path loss, signal attenuation and medium contention
can lead to failed transmission of messages. We use the term channel stochasticity to refer to the
non-deterministic process whether an agent can successfully receive the messages from others. In
the following, we consider a single step, and thus omit superscript (t). Denote vectorm
ij
2R
l
as
the message from agentj toi. DenoteN
i
as the set of agents successfully sending toi. Cardinality
ofN
i
is from 0 (when i receives no message) to n 1 (when i receives from all other agents).
Definec
i
=fm
ij
:j2N
i
g as the set of messages received byi. The encoder performs a function
(c
i
) to map the set of messages to a vector inR
d
0
. Common neural architectures to encode the
received messages in the MARL system have some limitations when deployed under the realistic
wireless channels. For example, DGN [60] and G2ANet[61] require multiple rounds of message
exchange within one step, which is not desired or supported under the realistic wireless channels
since it will incur significant bandwidth consumption and delay. It is not energy efficient for the
edge device as well. Some of them [68, 69] assume fixed number of messages would be received
at each step and thus fixed input length of concatenated messages. This will not work under the
wireless network stochasticity. MAGIC [49] requires a centralized scheduler to generate attention
weights in the message exchange process, which is not feasible. In the literature [26, 38, 25],
another common way to implement is by first aggregating the messages into a single vector and
then encode the aggregated vector by a vector function. e.g., sum aggregation
55
(c
i
) =
0
X
j2N
i
m
ij
!
: (5.1)
We further propose an architecture that overcomes the two limitations and empirically outper-
forms the encoder model of 5.1. We implement the message encoder as follows,
Message encoder: (c
i
) =
P
j2N
i
MLP (m
ij
) (5.2)
where MLP () is a Multi-Layer Perceptron with trainable parameters. All received messages go
through a shared MLP.
5.1.4 General Framework
We propose ageneral framework to learn practical communication strategies, since our techniques
can be applied to various MARL training algorithms in a plug-and-play fashion.
For illustration, we describe an example implementation based on Policy Gradient (PG) with
Generalized Advantage Estimator (GAE) [70, ?], which can be trained in theon-policy manner. We
observe that IC3Net and CommNet cannot converge in the upgraded environment with the original
REINFORCE algorithm. Therefore, we modify the baseline algorithm by further integrating GAE.
The policy gradient using REINFORCE is
r
J () =
N
X
i=1
T
X
t=1
log
^
a
t
i
o
t
i
T
X
t
0
=t
t
0
t
r
t
0
i
b(s
t
i
)
!
; (5.3)
whereb(s
t
i
) is the state-dependent baseline approximated byV
(o
t
i
) which is updated by minimiz-
ing
56
L () =
N
X
i=1
T
X
t=1
V
(o
t
i
)
T
X
t
0
=t
t
0
t
r
t
0
i
!
2
: (5.4)
In Equation 5.3, the target value is estimated using the Monte-Carlo estimator, which estimates the
return with a single trajectory. Although it is an unbiased estimator of the true expected return, it
leads to large variance due to the propagation of variance via sum. And it requires large amounts
of data to reduce the variance. We decide to integrate GAE to overcome the disadvantages. The
key idea of GAE is to estimate the advantage as a combination of Monte-Carlo and value function
approximation so as to make a balance of bias and variance.
GAE is based on a series ofk-step return estimators.k-step return estimator estimates the first
k step with Monte-Carlo while the remaining steps with value function approximation, i.e.,
^
A
k
=
t+k
X
t
0
=t
t
0
t
r
t
0 +
k
V
i
(s
t+k
): (5.5)
By rangingk from 1 toT , GAE makes an exponentially-weighted average of all possiblek-step
estimations, i.e.,
A
GAE
=
T
X
k=1
w
k
^
A
k
(5.6)
It can be rewritten as:
A
GAE
(s
t
;a
t
) =
T
X
t
0
=t
(
)
t
0
t
t
0 (5.7)
where
t
=r(s
t
;a
t
) +V
(s
t+1
)V
(s
t
) (5.8)
57
Algorithm 3 Training algorithm ofLAUREL
1: Initialize all neural networks and replay bufferD
2: for episode= 1 toM do
3: Initialize observationo, initialize messagesc all 0s
4: fort = 1 to end of episode do
5: Receive msgc
t1
i
; get network measuremento
C;t
i
6:
t
i
Encode message by
i
c
t1
i
7: e
t
i
Fuseo
C;t
i
with game observationo
T;t
i
8: h
t
i
;C
t
i
LSTM
h
t1
i
;C
t1
i
;
t
i
;e
t
i
9: a
t
i
;m
t
i
Headers (h
t
i
)
10: V
t
i
V
(h
t
i
)
11: Take actiona
T;t
in game environment
12: Sendm
t
bya
C;t
in wireless environment
13: Observe rewardr
t
; transit to next states
t+1
14: Add transition to replay bufferD
15: end for
16: T episode length
17: A
t
i
GAE (fr
t
i
ji2 [1;T ]g;fV
t
i
ji2 [1;T ]g) for alli
18: Compute gradient with Equation 5.9 and 5.4
19: Update, by gradient ascent / descent
20: end for
and 2 [0; 1] is the decaying parameter for controlling the compromise between bias and
variance. Therefore, our policy gradient is
r
J () =
N
X
i=1
T
X
t=1
log
a
t
i
o
t
i
T
X
t
0
=t
(
)
t
0
t
t
0
!
: (5.9)
As shown in Figure 6.1a, Five neural networks interact with each other: Message Encoder,
Observation Embedder, LSTM, Headers andV
. The Observation Embedder are fully connected
layers embedding the input augmented observation toe
t
i
. The Message Encoder follows the Sec-
tion 6.2 design and embeds received messages into
t
i
. The LSTM generates the hidden embeddings
h
t
i
by combining the current state information (e
t
i
and
t
i
) and memories (h
t
i
andC
t
i
). The Head-
ers consist of an action header, an message header that generate augmented action a
t
i
, message
58
Figure 5.2: Example architecture of on-policyLAUREL
m
t
i
respectively. V
generates approximatedV
t
i
, which is later used to compute the advantage per
equation 5.7. TheV
and computation ofA
GAE
are not required in execution but training. Algo-
rithm 3 shows the training algorithm. Lines 5 - 11 are executed by each agent i. We perform
gradient update at the end of each episode instead of every step as we need the Monte-Carlo return
when fittingV
and computingA
GAE
.
We can similarly integrateoff-policy algorithms into our framework. An example implementa-
tion follows the architecture of Figure 5.3. The Mixing Network follows the design in QMIX [28]
(In Section 6.5, we evaluate QMIX [71] based algorithms).
We perform gradient descent from the loss
L () =
X
t
(y
tot
t
Q
tot
(o;c;a;))
2
(5.10)
wherey
tot
t
=r +
max
a
0Q
tot
(o
0
;c
0
;a
0
;
0
),c is the received message. Q
tot
is the full architecture
in Figure 5.3 with parameter. (o;c;a;r;o
0
;c
0
) is transition-t in the sampled trajectory. We omit
the details for brevity. Other variants (e.g., based on Actor-Critic) can also be derived similarly.
59
Q Network
Q
t
1
a
t
1
m
t
1
t
1
o
t
1
MLP
MLP
Enc.
m
t1
1
Q Network
Q
t
2
a
t
2
m
t
2
t
2
o
t
2
MLP
MLP
Enc.
m
t1
2
Q Network
Q
t
3
a
t
3
m
t
3
t
3
o
t
3
MLP
MLP
Enc.
m
t1
3
Mixing Network
Q
t
total
Figure 5.3: Example architecture of off-policyLAUREL
60
5.2 Experiments
In this section, we evaluate our proposed framework with two categories of algorithms: on-policy
and off-policy. We choose the most popular algorithms in each category and evaluate them using
two standard game benchmarks. We also implement a MANET with parallelized interaction as
communication environment.
5.2.1 Setup
Wireless Environment Following most existing works [20, 26, 25], we let agents perform single-
hop broadcast. We implement a 1-hop mobile network without Access Points (AP) as in [72]. For
communication model, we follow “log distance path loss”. We model interference as the receiver
hearing multiple signals in range. We also consider background noise and attenuation due to
obstacles. For communication protocol, we implement the slottedp-CSMA [41]. Agents following
the protocol perform fully distributed execution without the need of a centralized controller.
Communication Protocols: Slottedp-CSMA
When an agent intends to transmit a packet, it first randomly choose a counter in the range of
“counter window size”. When counter is down to 0, it senses the medium. If the medium is free
(i.e., sensed signal strength lower than the threshold
f
), the agent transmits with probabilityp.
Otherwise, it chooses a random counter again and repeats the process.
Arbitration of successful receiving
We use SINR (signal-to-interference-plus-noise ratio, the power ratio of received signal strength
over noise and interference). When a packet arrives at an agent, the agent will sense the packet’s
signal strength as . Denote
0
as the interfering signal strength at the agent if existent, N as
the background noise,
r
as the SINR threshold. If
0
+N
<
r
, the packet cannot be decoded
correctly.
61
In other words, there are three cases for the receiving signal strength: 1. If>
r
(
0
+N),
then the agent successfully receives the packet; 2. If
f
r
(
0
+N), then the agent
knows there is a packet coming, but it cannot successfully receive the contents; 3. If<
f
, then
the agent doesn’t even know there is a packet arrived.
Parameters of Wireless Environment
For default settings: obstacle attenuation
0
= 4:5; noise N =95dBm; RSS threshold
f
=
78dBm; SINR threshold
r
= 15 dB for PP
Obs
n
and 20dBm for others; contention probability in
p-CSMA protocolp = 0:3; counter window size inp-CSMA protocolW = 15 time slots.
MARL Algorithms & Hyperparameters We evaluate both on- and off-policy designs.
Off-policy We compare three designs. The first is QMIX [28], the state-of-the-art value
decomposition based MARL training algorithm. The second enhances the communication part of
the first design, by integrating the TarMAC message aggregation function [48] into the original
QMIX training. The third is the LAUREL version of QMIX, whose implementation follows Figure
5.3 in Section 6.4. Note that TarMAC is the state-of-the-art design performing attention-based
message aggregation.
On-policy We again compare three designs. Among them, CommNet [26] and IC3Net [38]
are two state-of-the-art MARL algorithms with learned communication schemes. The LAUREL
variant of IC3Net follows Figure 6.1a in Section 6.4 . Note that due to complexity of real-
istic environment, vanilla REINFORCE does not converge. All three designs are trained with
REINFORCE+GAE for fair comparison.
We conduct most experiments with off-policy models due to its significantly higher sample
efficiency and faster convergence time.
62
Training Hyperparameters and Neural Network Architecture
We repeat all experiments 5 times with different random seeds. We report the average performance
and variance of the metrics. For the on-policy algorithms, we use a LSTM cell with hidden dimen-
sion 128 in the policy net. For the off-policy algorithms, we use a GRU cell with a 2-layer MLP
of hidden dimension 128 as the Q-value network, and 3-layer MLP with Hypernetwork [73] as
the mixing network. For both versions of LAUREL, we use a 2-layer MLP with hidden dimension
128 to implement the message encoder according to Equation 6.3. All algorithms are trained with
Adam optimizer [74] with learning rate 0.0005.
5.2.2 Predator Prey
“Predator-Prey” is a standard MARL benchmark [25, 38, 50]. In this game,m predators and 1 prey
are initially randomly placed in annn grid world (denoted as PP
n
). In one step, each predator
can take one of the five possible game actions: moving to an adjacent grid or staying still. Reward
is given when a predator catches the prey by moving to the prey’s position. The game terminates
when all predators catch the prey. For observation, each predator has visionv.
We further propose variants of the PP environment to better simulate realistic applications (e.g.,
SAR with complex field terrain). In the vanilla setting used by [25, 26, 38], the grid world is flat
and does not contain any obstacles. In the advanced setting, we introducek obstacles. An obstacle
is specified by size` and can be either horizontal or vertical. Obstacles affect both the game and
wireless environments, since they block movement of agents and also have an attenuation effect on
the wireless signals passing them. To initialize each episode, we randomly generate the obstacle
orientation and positions. We denote such an environment as PP
Obs
n
. For all experiments, we set
grid sizen = 10. There are 3 agents, each with visionv = 0 (i.e., similar to [38], agents only see
the grid it is currently in), number of obstaclesk = 1 and obstacle size` = 9.
63
0 0:2 0:4 0:6 0:8 1 1:2 1:4 1:6
10
8
10
20
30
40
Steps to catch prey
PP
Obs
10
, On-policy
0 1 2 3 4
10
6
10
20
30
40
Number of environment steps
PP
Obs
10
, Off-policy
CommNet IC3Net LAUREL
on
QMIX TarMAC LAUREL
off
Figure 5.4: Comparison with state-of-the-art methods
Comparison with State-of-the-Art LAUREL uses RSS as the wireless observation. From Figure
5.4, we observe that for both the on-policy and off-policy algorithms,LAUREL significantly shortens
the number of steps to catch the prey. In addition, the variances of the LAUREL curves are very
small. For CommNet, the model hard-codes an all-to-all communication scheme: each step, all
agents broadcast messages. In a realistic wireless network environment, sending more messages
can reduce the number of successfully received messages due to increased chance of collision
and interference. Therefore, it can be hard for the algorithm always performing broadcasting to
stably learn a good policy – as reflected by the large variance of the CommNet curve. For IC3Net,
the performance is better than CommNet since IC3Net has gated communication which mutes
unimportant messages. LAUREL further improves upon IC3Net due to its intelligence in all the
“when”, “what” and “how” aspects. For the off-policy comparisons, QMIX converges to a policy
with significantly more steps. This shows the importance of communication. For LAUREL and
TarMAC, both converge to similar number of steps. However, LAUREL converges faster and with
64
smaller variance. Note that in the Lumberjacks environment (Section 5.2.3), TarMAC fails to learn
a good policy whileLAUREL can.
.0
.05
.10
.15
Prob. of first prey-catching
PP
Obs
10
, Off-policy,LAUREL
0 5 10 15 20 25 30 35 40
.0
.2
.4
.6
.8
1
Steps in one trajectory
Comm. action per agent
PP
Obs
10
, Off-policy,LAUREL
Agent 1 Agent 2 Agent 3
Figure 5.5: Average communication action in one trajectory
Learned Communication Strategy We analyze the communication scheme learned by our
framework. We first answer “at which steps of the trajectory the agents are more likely to send
messages”. In Figure 5.5, we use the same configuration ofLAUREL as Figure 3. Once the training
converges, we freeze the model and evaluate it on 2000 trajectories. We record two metrics: (1)
the binary communication actiona
t
i
of each agenti at each stept, and (2) the first stept
0
where at
least one predator catches the prey. The left vertical axis (corresponding to the curves) of Figure
5.5 records the average a
t
i
over the 2000 trajectories. The right vertical axis (corresponding to
the bars) records the probability oft
0
. We have the following observations: (1) The three curves
almost overlap with each other since agents are homogeneous. (2) At the initial few steps, the
agents are more likely to communicate. This allows agents to know others’ initial positions, as
well as to probe the wireless and game environments. The sooner they know such information,
the better they collaborate. (3) The agents are also more likely to communicate when they catch
65
the prey (as can be seen from the similar shape of the curves and the bars after step 10). After the
prey-catching agent informs others of the prey’s position, the other agents can directly approach
the prey following the shortest path.
0 0:5 1 1:5 2 2:5 3
10
6
0
10
20
30
40
Steps to catch prey
PP
Obs
10
, Off-policy
0 0:5 1 1:5 2 2:5 3
10
6
0
0:5
1
Number of environment steps
Avg. comm. action
TarMAC TarMAC-BW LAUREL
off
LAUREL
off
-BW
Figure 5.6: Communication adapted to limited bandwidth
Adapting to Complicated Wireless Environment We further study the effect of bandwidth
limit on agents’ communication behaviors. In Figure 5.6, we consider two wireless environments:
one with more bandwidth resources and the other with fewer bandwidth resources (achieved by
reducing the number of time slots in slotted p-CSMA [41]). All other parameters remain as
default. The curves marked by “-BW” correspond to those evaluated in the bandwidth reduced
environment. We observe the different behaviors of TarMAC and LAUREL. In the right plot, we
plot the average communication action. e.g., when = 0:5, each agent has 50% probably of
sending a message at each step. Since TarMAC agents always send a message in each step ( = 1),
we do not plot their communication curves. We observe that: (1) The performance of TarMAC is
66
Table 5.2: Benefits of proposed message encoder
Enc. MLP Avg Eq. 6.3
PP
Obs
10
24.035.12 21.240.98 18.710.58
LJ
5;3
10
29.003.60 38.151.27 22.581.25
very sensitive to the wireless environment. When the wireless environment becomes more compli-
cated, the total number of steps to catch the prey increases significantly. (2) For LAUREL, making
the wireless environment more complicated only results in slightly slower convergence. From the
right plot, we observe that LAUREL successfully adapts its policy to the changed environment by
reducing the number of communications.
Ablation Study We show how our encoder improves the quality of the learned policy. In
Table 5.2, we compare three different encoding architectures, with all other configurations equiv-
alent. For the MLP encoder, suppose each message is a lengthd vector. For 3 agents, the inpute
is a length-3d vector. The subvector [e]
id:(i+1)d
is filled withi’s message if message from agenti
is received, otherwise, 0s. Thene is fed into a 2-layer MLP to generate the encoded vector. For
the “Avg” encoder, we first aggregate the received raw messages by vector mean, and then feed the
aggregated vector into a 2-layer MLP. Clearly, the proposed encoder based on Equation 6.3 leads
to significantly better agents’ performance.
5.2.3 Lumberjacks
Lumberjacks [75] is a multi-resource spatial coverage problem [76]. In a grid world,p lumberjacks
cooperate to chopq trees. Each tree has a “strength”k, meaning the tree will be chopped when at
leastk agents are adjacent to it. We setk = 2 for all trees. The agent obtains rewardr
1
= 0:05 and
r
2
= 0:5 for observing and chopping the tree. Step penaltyr
3
=0:1 is used to encourage agents
to chop trees within minimum number of steps. Similar to PP, each lumberjack agent can move to
its adjacent grid or stay still in a step. We modify the original setting [77] by assigning trees signal
67
attenuation = 4:5. The game is terminated once all trees are chopped or the max number of steps
is reached. Denote LJ
p;q
n
as a Lumberjacks game withinnn grid.
0 0:5 1 1:5 2 2:5 3 3:5
10
6
20
15
10
5
Mean episode rewards
LJ
5;2
10
, Off-policy
0 0:5 1 1:5 2 2:5 3 3:5
10
6
20
15
10
5
Number of environment steps
LJ
5;3
10
, Off-policy
QMIX TarMAC LAUREL
off
Figure 5.7: Comparison with state-of-the-art methods
Figure 5.7 compares the convergence of reward betweenLAUREL and state-of-the-art methods.
Both environments contain 5 agents. In LJ
5;2
10
with 5 agents and 2 trees, QMIX achieves similar
reward as LAUREL. In LJ
5;3
10
, LAUREL achieves significantly higher reward than QMIX. In both
environments, TarMAC fails to learn a good policy. We conclude the following. When there
are many agents and few trees, collaboration is less critical – Even if each agent searches for a
tree on its own and then waits there, there is still a high chance that two agents land adjacent to
the same tree by coincidence. This is why the vanilla QMIX performs well in LJ
5;2
10
. However,
when we increase the number of trees in the environment, collaboration becomes more critical,
and the communication strategy becomes important. So we see a performance drop in QMIX in
contrast to a performance boost in LAUREL. Finally, the poor performance of TarMAC shows it is
important to consider the complicated wireless environment in the algorithm design. Otherwise,
communication can be even more harmful than no communication at all, as shown by the TarMAC-
QMIX comparison.
68
5.3 Summary
We have proposed a general framework to learn practical multi-agent communication strategies.
Our techniques comprehensively address the fundamental aspects of communication, “when”,
“what” and “how”, with theoretical and empirical justifications. The two implementations of our
framework (on- / off-policy) significantly improve the agents’ performance, especially in compli-
cated environments.
69
Chapter 6
Wireless Communication-Enhanced Value
Decomposition for Multi-Agent
Reinforcement Learning
Communication has demonstrated great potential in encouraging cooperation in multi-agent rein-
forcement learning systems. However, when deploying the system in a realistic wireless environ-
ment, the following challenges arise: (1) dynamic channel condition due to agent mobility, (2) joint
influence of wireless channel condition and message content on the policy and (3) stochasticity in
the wireless transmission process causing loss and out-of-order reception of messages. To over-
come the challenges, we design a framework for agents to learn an intelligent and flexible policy.
The framework is based on an augmented Markov Decision Process and allows end-to-end training
even with the presence of non-differentiable communication channels. We follow the paradigm of
Centralized Training with Decentralized Execution (CTDE). With respect to decentralized execu-
tion by each agent, we develop a novel neural message encoder to preserve all information from
received messages, thus addressing the channel stochasticity. With respect to centralized training
via value decomposition, we capture the learned inter-agent communication pattern with a graph,
and further design a GNN based decomposition network. This leads to high expressive power
and permutation invariance. It also guarantees the consistency between local and joint action
selections. Simulating standard benchmarks under realistic wireless network settings, we show
70
significant improvements in game performance, convergence speed and communication efficiency
compared with state-of-the-art
1
.
6.1 Modeling: Augmented MDP on Aligned Environments
With the presence of realistic wireless communication channels, traditional modeling of MARL
results in non-differentiable training process. We have analyzed this in Chapter 5. As in Figure
5.1a, when optimizing policy
via back-propagation, the gradients of the parameters need to
flow from a
t
i
back to o
t
i
. However, the wireless environment lies in the middle of step t. The
mapping implemented by the realistic wireless channel,m
t
;w
t
!c
t
i
, is non-differentiable [25].
We have proposed “communication lagging” to make training differentiable. In retrospect, as
shown in Figure 5.1b, an agent observes at the beginning of step t. Yet it does not generate
the message m
t
i
until the end of step t. After lagging, the “message-wireless environment” and
“agent-game environment” are aligned, which is abstracted as a “meta-environment”. We have
reformulated the Markov Decision Process (MDP) with expanded state and action spaces. We
augment the observation spaceO by observation / measuremento
C;t
i
on the wireless environment.
i.e.. for each agent, we concatenate the wireless and game observationso
t
i
= o
T;t
i
ko
C;t
i
. Then we
let the message sent byi containi’s hidden stateh
t
i
, whereh
t
i
embeds information ofo
t
i
. Whenj
receives fromi, agentj predicts the current network condition. Refer to Chpater 5 for more details.
71
MLP
MLP
t
Msg. Encoder
m
t1
2
m
t1
3
FC
FC
k FC
Obs. fuser
o
t
1
o
0
t
1
e
t
1
+
GRU
h
t1
1
h
t
1
Q Head
M Head
m
t
1
Q
t
1
Q Generator
Agent 1 Network
(a)
Agent 1
Network
m
t1
1
Q
t
1
o
t
1
o
0
t
1
Agent 2
Network
m
t1
2
Q
t
2
o
t
2
o
0
t
2
Agent 3
Network
m
t1
3
Q
t
3
o
t
3
o
0
t
3
Mixing Network S
t
Q
t
total
overall architecture
(b)
Figure 6.1: The architecture ofN agents: AssumeN = 3. (a) An decentralized Agent Network
(b) The overall architecture
6.2 Decentralized Agent Execution: Information Extraction
under Transmission Stochasticity
Following the CTDE paradigm, we design a value decomposition based framework as shown in
Figure 6.1. Note the decentralized agent networks are involved in both training and execution
process. Figure 6.1(a) shows a decentralized Agent Network consisting of Observation Fuser,
Message Encoder and Q Generator. For brevity, we depict agent 1’s network only. Figure 6.1(b)
1
This work is under submission to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
72
shows the overall architecture consisting of decentralized Agent Networks and a centralized Mix-
ing Network (also called value decomposition network from a top-down view). Mixing Network
combines all agents’Q values to calculate a globalQ and is only used in the training phase.
In this section, we introduce the neural architecture for the decentralized agents, which consists
of observation fuser, message encoder, andQ value generator. As shown in Figure 6.1 (a), at the
beginning of a step, observation fuser combines the local game observation o
t
i
and local wireless
observationo
0
t
i
. Meanwhile, a message encoder extract information from received messages. The
dashed line represents transmission stochasticity. Then the fused observation e
t
i
as well as the
encoded message are handled by the Q value generator. The Q generator has internal memory
units and generates hidden stateh
t
i
which subsequently generates messagem
t
i
, individual action-
valueQ
t
i
, and greedy actiona
t
i
with the largest individual action-value.
6.2.1 Observation Fuser
Wireless Observation
Based on the augmented MDP modeling, properly selected wireless observation can endow the
agent with understanding of the current state including wireless network condition. Following our
previous work, we choose RSS as the default wireless observation due to its rich information in
network status and the terrain.
Neural Architecture Design As in Figure 6.1 (a), we use fully connected layers to embed
the game observationo
t
and wireless observationo
0
t
respectively. Then the hidden embeddings are
concatenated and fed into another fully connected layer to be further embedded intoe
t
.
6.2.2 Message Encoder
In this subsection, we formalize the math properties required by the stochastic communication
process to encode received messages in a lossless way.
73
Stochasticity in Communication An agent does not know 1. who will send a message at what
time, 2. whether the message will have large enough signal strength to be successfully received,
3. in what order messages will arrive, 4. what the message will contain. For point 1, our agents
learn an intelligent communication policy based on current condition, instead of following a fixed
schedule. For point 2, many factors (e.g., path loss, fading, noise, interference) can cause packet
loss. For point 3, due to medium contention and varying network latency, the multiple messages
to be received by a single agent in one step can arrive in an arbitrary order. For point 4, for the
hidden state embedding contained in a message, the vector elements can take arbitrary numerical
values representable by a floating point number.
Properties of the Message Encoder With the existence of stochasticity in communication,
the message encoder has to satisfy particular mathematical properties to extract information prop-
erly. We again consider a single step and omit superscript (t). Refer to Table 5.1 for notations. The
encoder performs a function (c
i
) to map the set of messages to a vector inR
d
0
. should satisfy
the following two properties.
(1) Permutationinvariance. This property addresses the stochasticity 3 above. Since we define
the input to as an unordered setc
i
, we need to preserve a well-known property of such a set
function, namely permutation invariance [78]. By definition, for anyc
i
=fm
i1
;:::;m
ik
g
and any permutation on the indicesf1;:::;kg, a permutation invariant function satisfies
(fm
i1
;:::;m
ik
g) =
m
i(1)
;:::;m
i(k)
(6.1)
We show an example in our scenario. Suppose agent 1 receives messages from agents 2, 3
and 4. Imagine two sequences of message arrival, (2; 3; 4) and (3; 4; 2). The encoder should
not care about the sequence and should generate the same output for both cases. Thus,
(fm
12
;m
13
;m
14
g) = (fm
13
;m
14
;m
12
g).
74
(2) Injectiveness. This property addresses stochasticity 1, 2, and 4. To preserve all information
in the encoding process, we should have
c
i
=c
0
i
, (c
i
) = (c
0
i
) (6.2)
This way, agenti generates a unique encoding forc
i
so that it knows exactly the status of
all the agents who successfully sends a message to it. For a function taking vector inputs,
injectiveness is easy to satisfy. For example, a linear mapping
0
(m
0
) =Wm
0
is injective
whenW has full rank. However, it is non-trivial to ensure injectiveness of whose input
is a set with non-fixed size. In the literature [26, 38, 25], a common way to implement is
by first aggregating the messages into a single vector and then encode the aggregated vector
by a vector function. e.g., sum aggregation (c
i
) =
0
P
j2N
i
m
ij
. Unfortunately, such
a is not injective since regardless of
0
, we can find cases wherec
i
6=c
0
i
yet
P
j2N
i
m
ij
=
P
j2N
0
i
m
0
ij
.
Neural Architecture Design The ultimate goal of mathematical analysis above is to inform
the design of a neural network function approximator. We design the encoder as the same archi-
tecture as proposed in Chapter 5. Next we show that our encoder theoretically satisfies the above
two properties and thus preserves all information regardless ofc
i
.
Message encoder: (c
i
) =
P
j2N
i
MLP (m
ij
) (6.3)
Theorem 1. There exists a set of parameters for the encoder architecture defined by Equation 6.3,
such that is both permutation invariant and injective.
Proof of Theorem 1. To prove permutation invariance, note that vector addition is commutative.
So for any permutation, the sum of the sequence
MLP
m
i(1)
;:::;MLP
m
i(k)
is always the
same.
For injectiveness, we recall the conclusion from Lemma 5 of [79].
75
Lemma 1. Assume the space of messages,M, is countable. There exists f :M! R
d
0
, s.t.
P
j2N
i
f (m
ij
) is unique for eachc
i
M.
Then we only need to show an MLP can express the functionf specified by the above lemma.
This is a direct consequence of the universal approximation theorem [80]. In other words, there
exists a set of MLP parameters such that (c
i
) =
P
j2N
i
MLP (m
ij
) =
P
j2N
i
f (m
ij
), where
c
i
=c
0
i
, (c
i
) = (c
0
i
). So is injective.
Note that the requirement on the input spaceM being countable is automatically satisfied in
realistic applications, where the messagem is quantized to fixed bit widths.
Relation with GNNs Our encoder can be seen as a single-layer Graph Neural Network tailored
for real-life communication stochasticity. Correspondingly, “sum of MLPs” of CHALET’s Equa-
tion 6.3 is the GNN aggregation function, which preserves the theoretical properties while being
simple to implement in practice. Communication design of many related works can also be related
with GNNs. For example, TarMAC [48] performing “weighted mean” of messages can be seen
as a single-layer Graph Attention Network [62]; IC3Net [38] performing “unweighted average” is
equivalent to GraphSAGE [81]. Other related works (e.g., MAGIC [49]) deploy multi-layer GNNs
that require multi-hop communication in each step. Such communication can be too expensive
under constraints of realistic wireless networks, and thus we restrict CHALET to the single-layer
design. Importantly, for all the above related works, their GNN encoders satisfy permutation
invariance but not injectiveness. As a result, there can be significant performance drop due to the
aforementioned stochasticity.
6.2.3 Memory aidedQ Generator
Neural Architecture Design TheQ generator takes as input the encoded message
t
and fused
observation e
t
, and feeds the sum of them into a Memory Unit (GRU). The GRU stores hidden
stateh
t1
which contains information from the past few steps, and generates the next hidden state
76
h
t
. Based on the hidden state, two headers (MLP layers) generate action-values and a messagem
t
i
to be potentially transmitted. The augmented actiona
t
i
with valueQ
t
i
will also be output by taking
the argmax of action-values.
6.3 Centralized Value Decomposition: Communication-
Enhanced Mixer
In this section, we discuss how to integrate the communication information in the centralized value
decomposition process. Note, we call the neural architecture of the decomposition process “mixer”
(in the bottom-up view). Similar to other studies (e.g., VDN, QMIX), the mixer consists of two
steps: In step 1, individual action-values are aggregated into a hidden vector, i.e.,
v =f ((Q
1
;Q
2
;:::;Q
N
)): (6.4)
In step 2, MLP further maps the hidden vectorv into a scalar valueQ
tot
. Properties of the mixer are
mainly decided by the first stepf(), where our design differs from the others. Therefore, we only
discuss the first step and abuse the terms by not differentiatingf() and “mixer” in this section.
Communication affects collaboration. Different ways of collaboration not only leads to dif-
ferentQ
i
values, but should also determine different ways of mixing. Therefore, we need to cap-
ture the inductive bias from the dynamic communication structure and design a communication-
enhanced mixer.
6.3.1 Design
Graph is a powerful way to capture versatile kinds of agent interactions. And Graph Neural Net-
works (GNNs) are optimized for extracting information from graphs. So we have our design:
First, we treat each agent as one node and construct a directed graph with communication actions
77
Q
1
FC
W
1
1
g
1
1
+
h
1
1
Q
2
FC
W
1
2
g
1
2
+
h
1
2
Q
3
FC
W
1
3
g
1
3
+
h
1
3
Layer
1
FC
W
2
1
g
2
1
+
h
2
1
FC
W
2
2
g
2
2
+
h
2
2
FC
W
2
3
g
2
3
+
h
2
3
Layer
2
.
.
.
.
.
.
.
.
.
.
.
.
Pooling
v
h
L
1
h
L
2
h
L
3
(a)
GNN Layer 1
PEHypernet_A
^ s
1
; ^ s
2
; ^ s
3
W
1
1
,W
1
2
,
W
1
3
PEHypernet_B
1
1
,
1
2
,
1
3
(b)
Figure 6.2: The architecture of communication-enhanced Mixing Network: AssumeN = 3. (a)
The overall architecture as anL layer GNN model. (b) The weights of GNN in each layer generated
by PEHypernets.
as edge-connections. Nodei’s feature is the individual action-valueQ
i
. Then, we apply a GNN
based computational model illustrated in Figure 6.2 on the graph. For simplicity, the superscriptt
is omitted in the Figure. The model consists ofL layers. One layer corresponds to one round of
neighbor aggregation.
We developPermutationEquivariantHypernet(PEHypernet) to generate embedding weights
for each agent. “PE” represents its mathematical property which is essential for the mixer to satisfy
the PI property (see Section 6.3.2). “Hypernet” means it generates weights for another neural
78
network. The input to PEHypernet for agenti is ^ s
i
(global information concatenated to agenti’s
local observation).
In layerl, there are two PEHypernets shared across the nodes. For nodei, they generate self-
embedding weightW
l
i
and neighbor aggregation parameters
l
i
, respectively. To preserve mono-
tonicity (see Section 6.3.2), functionabsolute() is applied on generated weights and parameters.
Let the function approximated by the two PEHypernets be
l
A
() and
l
B
()). Thus, the hidden
embedding ofQ
i
afterl layersh
l
i
can be represented as:
h
l
i
=
W
l
i
h
l1
i
+g
fh
l1
j
jj2N
i
g;
l
i
(6.5)
whereW
l
i
= abs
l
A
(^ s
i
)
,
l
i
= abs
l
B
(^ s
i
)
,N
i
is the set of nodes with incoming edges to
nodei (also called neighbors ofi),g() is the neighbor feature aggregation function parameterized
by
l
i
. In the first layer,h
0
i
= Q
i
. Note that there is no indexi on functions
l
A
() or
l
B
(), since
all nodes share the same functions while using different input ^ s
i
. WhenL> 1, there are multiple
rounds of neighbor aggregation, which enables integrating information from multi-hop neighbors.
Finally, a pooling function is performed after the final layer. For example, we can use sum
function so that
v =
X
i
h
L
i
(6.6)
6.3.2 Mathematical Properties
In this subsection, we show the two mathematical properties of the proposed mixer: Permutation
Invariance and Monotoniciy.
Theorem 2. The proposed communication-enhanced mixer is permutation invariant w.r.t.
fQ
i
ji = 1; 2;:::Ng, ifg() is a permutation invariant function.
The proof of Theorem 2 will use the Lemma below.
79
Lemma 2. Weights and parameters generated by the proposed PEHypernet arepermutationequiv-
ariant w.r.t.f^ s
i
ji = 1; 2;:::Ng.
Recall that PEHypernet takes respective input ^ s
i
for nodei while sharing the network parame-
ters among all nodes. Define
A
(), the mapping realized by PEHypernet_A, as:
(W
1
;:::;W
n
) =
A
((^ s
1
;:::; ^ s
n
)) := (
A
(^ s
1
);:::;
A
(^ s
n
)): (6.7)
Let a permutation () be applied on (^ s
i
;:::; ^ s
n
). It is easy to show that
A
( ((^ s
1
; ^ s
2
;:::))) = (
A
((^ s
1
; ^ s
2
;:::))); (6.8)
i.e., it results in same permutation on (W
1
;:::;W
n
). Similarly, (
1
;:::;
n
) =
B
((^ s
1
;:::; ^ s
n
)) is
permutation equivariant.
Proof. (of theorem 2)
By Equation 6.5, whenl = 1,
h
1
i
=
W
1
i
Q
i
+g
fQ
j
jj2N
i
g;
1
i
(6.9)
Let permutation() be applied on the indices of agents. Consequently, (Q
1
;Q
2
;:::;Q
n
) and
(^ s
1
; ^ s
2
;:::; ^ s
n
) are both permuted with . We use
0
to denote the corresponding variables after
permutation. By Lemma 2,W
0
l
i
=W
l
1
(i)
, and
0
l
i
=
l
1
(i)
Thus,
h
0
1
i
=
W
0
1
i
Q
0
i
+g
fQ
0
j
jj2N
i
g;
0
1
i
=
W
1
1
(i)
Q
1
(i)
+g
fQ
0
j
jj2N
0
i
g;
1
1
(i)
(6.10)
Let the original adjacency matrix beA, and the permutation matrix corresponding to beP .
Then the adjacency matrix after permutation isA
0
=P
T
AP [82], where the multiplication with
80
(P
T
) and (P ) represent row permutation and column permutation respectively. By definition,N
0
i
is the column indices of nonzero elements in theith row ofA
0
.
Therefore,
fQ
0
j
jj2N
0
i
g =fQ
0
j
jj =(k);k2N
1
(i)
g
=fQ
1
(j)
jj =(k);k2N
1
(i)
g
=fQ
k
jk2N
1
(i)
g
(6.11)
Combine Equation 6.10 and Equation 6.11, we have
h
0
1
i
=
W
1
1
(i)
Q
1
(i)
+g
fQ
k
jk2N
1
(i)
g;
1
1
(i)
(6.12)
By comparing Equation 6.9 and Equation 6.12, we can conclude thath
0
1
i
= h
1
1
(i)
if and
only if g() is a permutation invariant function. So far we have proved that (h
1
1
;h
1
2
;:::;h
1
N
) is
permutation equivariant with (Q
1
;Q
2
;:::;Q
N
). Iterating from for l = 2;::: to l = L , we can
prove that (h
L
1
;h
L
2
;:::;h
L
N
) is permutation equivariant with input Q values. Further applying a
permutation invariance pooling function on the hidden embedding afterL layers would lead to the
same output as the original one.
Note the proposed GNN architecture is different from traditional ones where in each layer,
weights are shared among all nodes. Instead, we generate distinctive weights from the parameter
shared PEHypernet. As we have proved, the architecture maintains the permutation invariance
property of traditional GNN. Hence, the proposed mixer, combined with the innovative message
encoder in Section 6.2, consistently guarantees permutation invariance in the whole framework.
This leads to benefits of sample efficiency: the possible permutations increase with the number of
81
agents, and without the property, the model needs to learn from samples that all possible permuta-
tions correspond to the sameQ
tot
.
Next, we show the other property of the proposed mixer: monotonicity. Recall that restricting
the mixer to be monotonic with regard to individual action-values is a common way to satisfy the
IGM condition (see Section ??).
Theorem 3. The proposed mixer ismonotonic with regard tofQ
i
ji = 1; 2;:::Ng, if the Jacobian
matrix
@g
@h
l1
j
is non-negative.
Proof Sketch. In Equation 6.5, theabsolute() on all generated weights and parameters is a feasible
realization of monotinicity. Further assume that activation function () always has nonngetive
derivatives (e.g., ReLU, ELU). When the Jacobian matrix
@g
@h
l1
j
0, we can apply chain rule to
show that the partial derivatives
@h
l
i
@h
l1
j
0 for all i;j;l. Therefore
@f
Q
i
0 for all i. A simple
example ofg () that satisfies the requirement is a linear aggregation, i.e..
6.3.3 Theoretical Analysis
Expressive Power We now analyze the expressive power of our proposed communication-
enhanced mixer and compare it with QMIX mixer.
Theorem 4. The proposed mixer has higher expressive power than QMIX.
Definition 1. Expressive Power If the class of the functions that the model A can compute is
strictly wider than the class of the functions that the model B can compute, then we say A has
higher expressive power thanB.
Proof. (of Theorem 4) Without loss of generality, we use sum() as the pooling function and
assume the number of layersL = 1. We also ignore bias terms in the analysis for simplicity. Thus,
the mappingf() of our proposed mixer can be written as:
82
f((Q
1
;Q
2
;:::)) =
X
i
w
i
Q
i
+
X
i
g (fQ
j
jj2N
i
g;
i
)
!
: (6.13)
In contrast, the aggregation process of QMIX can be represented as:
^
f((Q
1
;Q
2
;:::)) =
X
i
^ w
i
Q
i
!
; (6.14)
In Equation 6.13, Letw
i
= ^ w
i
,g (;
i
) = 0 for alli, thenf((Q
1
;Q
2
;:::)) =
^
f((Q
1
;Q
2
;:::)).
This shows the proposed mixer has at least same expressive power as QMIX.
On the other hand, letg() in Equation 6.13 be a non-linear function (e.g., [83]). It is easy to
show that
^
f((Q
1
;Q
2
;:::)) cannot represent f((Q
1
;Q
2
;:::)) in this case, since there’s only linear
combination inside the activation function of
^
f().
Therefore, by Lemma 1, our proposed mixer has higher expressive power than QMIX.
In other words, the neighbor aggregation, i.e., communication enhancement gives rise to a more
general mixing function.
Another difference from QMIX mixer is that our mixer is proved to be permutation invariant
(Theorem 2), which is not the case for QMIX.
Comparison with GNNs The designed mixerf has achieved (1) permutation invariance via
parameter sharing and (2) feature exchange via neighbor aggregation. It is true that these two
principles correspond with the defining features of GNNs. Yet, our mixer is not a straightforward
application of any existing GNN architecture. The proposed mixer is innovative in the following
ways:
(1) We train a neural network (PEHypernet) to generate embedding weights and parameters,
while classical GNN directly learns these values. Besides, traditional GNN weights are fixed and
independent of agents’ states during inference, which is not the case in our mixer.
83
(2) The level of parameter sharing is different. Traditionally, within a GNN layer, the weight
matrix is shared by all nodes. Take GraphSAGE as an example, one layer can be represented as
h
l
i
=
CONCAT
h
l1
i
W
l
; MEAN(h
l1
j
jj2N
i
)
(6.15)
Instead, our mixer only share the architecture (i.e. PEHypernet) and generate respective
weights
W
l
1
;W
l
2
;:::
and parameters for the agents 1; 2;:::. Such level of parameter sharing
is essential for leveraging state information while maintaining the permutation invariance.
(3) Our GNN layers enforces nonnegative weights to approximate a monotonic function.
Therefore, our mixer is a novel architecture customized for the MARL setting with realistic
communication. It allows for combining the idea of Hypernet (i.e. generating neural network
weights with a neural network) and GNN and benefit from both.
6.4 Training Algorithm
As shown in Figure 6.1, our design follows the CTDE paradigm. During centralized training, the
Mixing Network optimizes the policy via maximizing the global Q value. During decentralized
execution, each agent outputs its next action with its own Agent Network alone, without the Mixing
Network. The training algorithm is illustrated in Algorithm 4. Line 5 to are executed by every
agent. Each agent fuses the observations, encodes the received messages and then generates the
individual Q value, the message to be transmitted and -greedy augmented actions. The neural
architecture of Agent Networks follows designs in Section 6.2. Corresponding to line 12, 14
and 15, with sampled trajectories from the replay bufferD, the Mixing Network combines each
agent’sQ value to calculate a globalQ for the current step following Section 6.3. Note that Since
84
Algorithm 4 Training algorithm of CHALET
1: Initialize all neural networks and replay bufferD
2: for episode= 1 toM do
3: Initialize observationo, initialize messagesc all 0s
4: fort = 1 to end of episode do
5: Receive msgc
t1
i
; get network measuremento
C;t
i
6:
t
i
Encode message by
i
c
t1
i
7: e
t
i
Fuseo
C;t
i
with game observationo
T;t
i
8: a
t
i
;m
t
i
;Q
t
i
-greedy arg max
a
i
Q
i
(e
t
i
;
t
i
;a
i
)
9: Take actiona
T;t
in game environment
10: Sendm
t
bya
C;t
in wireless environment
11: Observe rewardr
t
; transit to next states
t+1
12: Add transition to replay bufferD
13: end for
14: Randomly sample trajectory fromD
15: Update
F
,
,
Q
;
mix
;
0
Q
;
0
mix
16: end for
we use GRU [84] inQ Generator, we perform gradient descent on the rollout episode instead of
the individual step (as in [51]). Specifically, in line 15, we perform gradient descent from the loss
L () =
X
t
(y
tot
t
Q
tot
(o;c;a;))
2
(6.16)
wherey
tot
t
=r +
max
a
0Q
tot
(o
0
;c
0
;a
0
;
0
),c is the received message. Q
tot
is the full architecture
in Figure 6.1b with parameter. (o;c;a;r;o
0
;c
0
) is transition-t in the sampled episode.
85
6.5 Experiments
In this section, we evaluate our proposed framework using two standard game benchmarks: Preda-
tor Prey and Lumberjack. We also conduct behavioral and ablation studies to understand the
learned policies and effectiveness of the decomposition architecture.
6.5.1 Setup
Wireless Environment We use the same wireless environment setup as in our previous work.
Agents perform single-hop broadcast in a mobile network without Access Points (AP) as in [72].
For communication model, we adopt the “log distance path loss” model and further model inter-
ference in the wireless network as the receiver hearing multiple signals in range. We also consider
background noise and attenuation due to obstacles. For communication protocol, we adopt the
slottedp-CSMA [41]. More details can be found in Section 5.2.
MARL Algorithms and Hyperparameters We compare five designs. The first two are
VDN [27] and QMIX [28]. They are well-known and state-of-the-art value decomposition based
algorithms. Note these two baselines do not consider communication. Then we create another two
baselines by integrating a communication component into them, i.e., TarMAC+VDN and TAR-
MAC+QMIX. TarMAC is the state-of-the-art design performing attention-based message aggre-
gation. Note original TarMac cannot be directly used as baseline since it is based on actor-critic
instead of value-decomposition. The last design is CHALET, the framework we propose whose
implementation follows Section 6.4 We use same hyperparameter (search approach) during train-
ing as in our previous work.
Game Environment Similarly, we evaluate on two game environments: Predator Prey (PP)
and Lumberjacks (LJ). In PP,n predators collaborate to catch 1 prey in angg grid world (denoted
as PP
g
).
To better simulate realistic applications (e.g., SAR with complex field terrain), we further add
k obstacles in the field that can attenuate the passing signal and block the agents’ movement. We
86
denote such an environment as PP
Obs
n
. More details on the PP benchmark including observation
space, action space, reward, etc.can be found in Section 5.2.
Lumberjacks [75] is a multi-resource spatial coverage problem [76]. In a grid world, p lum-
berjacks cooperate to chopq trees. Each tree has a “strength”k, meaning the tree will be chopped
when at leastk agents are adjacent to it. We setk = 2 for all trees. We modify the original set-
ting [77] by assigning trees signal attenuation = 4:5. The game is terminated once all trees are
chopped or the max number of steps is reached. Denote LJ
p;q
g
as a Lumberjacks game withingg
grid. More details on the benchmark can be found in Section 5.2.
6.5.2 Performance Comparison with State-of-the-Art
Figure 6.3 shows the learning curves in the standard benchmarks PP and LJ. The first two subplots
show the convergence of the number of steps to catch the prey in Predator Prey and the last two
subplots show the convergence of episode return in Lumberjacks. We set the maximum training
steps as 4 10
6
and clip the plot (a) and (c) byx axis once converged for better visualization.
PP In Figure 6.3a, the environment is relatively simple, all algorithms converge within 1:5
10
6
training steps. The proposed CHALET, TarcMAC aided QMIX and TarMAC aided VDN are
able to converge to similar number of steps to catch the prey. It is unanimous that communication
boosts the performance compared to baselines VDN and QMIX. However, CHALET enjoys the
fastest convergence speed.
In figure 6.3b, the environment is more complex due to increased field and obstacle size, as
well as the number of agents. TarMAC aided VDN and TarMAC aided QMIX can no longer boost
performance through communication. However, CHALET still converges to the smallest number
of steps with even larger performance gap. It also enjoys fastest convergence speed and small
variance. This shows that CHALET is capable of enhancing performance especially in complicated
scenarios.
LJ
87
0 0:2 0:4 0:6 0:8 1 1:2 1:4 10
6
10
15
20
25
30
35
40
Number of environment steps
Steps to catch prey
PP
Obs
7
, 3 agent
(a)
0 0:5 1 1:5 2 2:5 3 3:5 4 10
6
15
20
25
30
35
40
45
Number of environment steps
Steps to catch prey
PP
Obs
10
, 4 agent
(b)
0 0:2 0:4 0:6 0:8 1 1:2 1:4 1:6 1:8 2 10
6
14
12
10
8
6
4
2
0
Number of environment steps
Episdode reward
LJ
4;3
7
(c)
0 0:5 1 1:5 2 2:5 3 3:5 4 10
6
20
18
16
14
12
10
8
6
Number of environment steps
Episdode reward
LJ
5;3
10
(d)
CHALET QMIX+TARMAC QMIX VDN+TARMAC VDN
Figure 6.3: Comparison with state-of-the-art: (a) PP with obstacles in 7 7 grid, 3 agents. (b) PP
with obstacles in 10 10 grid, 4 agents. (c) LJ with tree shadowing in 7 7 grid, 4 agents, 3 trees.
(d) LJ with tree shadowing in 10 10 grid, 5 agents, 3 trees.
In Figure 6.3c, CHALET leads to largest episode reward, TarMAC aided VDN cannot converge
to a good optimum, and the other three baselines have similar performance in between.
In Figure 6.3d, the environment is more complex, and we fix the number of training steps as
4 10
6
. Again, CHALET has the largest episode return compared to baseline algorithms. We also
observe that till 2k steps, CHALET has similar convergence speed as QMIX and VDN baselines.
But the baseline converges to a local minimum while CHALET continues to learn and finally
outperforms the baseline. In fact, its return is still increasing at the ending training steps.
88
Overall Performance In general, CHALET consistently ourperforms baselines and is espe-
cially capable in more complex scenarios. It also enjoys quick convergence speed and small
variance. We can observe that for the original value decomposition based algorithms, QMIX is
slightly better than VDN which verifies the importance of monotonicity. But when extending
these algorithms to a realistic wireless environment, Existing algorithms such as VDN based Tar-
cMAC or QMIX based TarMAC cannot converge to a good local optimum. The performance is
sometimes even worse than that without communication. One possible reason is that the commu-
nication failures (caused by limited bandwidth, signal path loss, etc.), stochasticity and dynam-
icity hinder the learning process. However, our CHALET achieves significantly fewer steps and
higher rewards. This demonstrates its success of extending the value decomposition algorithm into
the realistic wireless environment thanks to the comprehensive design in modeling, decentralized
agent network and centralized value decomposition aspects. CHALET also enjoys faster conver-
gence compared to the baseline algorithms (including the ones without communication). This
phenomenon is interesting since it contradicts with an intuitive hypothesis that adding the com-
munication module and expanded observation space should complicate the learning process. We
believe that augmented observation actually alleviate the complexity of the realistic environment
due to more information provided. With a sophisticated design such information can be effectively
aggregated, extracted and understood. Further capturing the communication patterns in the value
decomposition process can ease the learning of a good communication policy. The permutation
invariance in the value decomposition process can also help improve sample efficiency which may
explain the fast convergence. In addition, the variances of the CHALET curves are very small
showing its stability.
6.5.3 Behavior Analysis
Positive signaling and positive listening Positive signaling and positive listening are common
metrics to verify the effectiveness of communication in MARL[85]. Positive signaling means
89
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Ag. 3 Ag. 1 Ag. 2 Ag. 4
(a)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Ag. 3 Ag. 1 Ag. 2 Ag. 4
(b)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Ag. 1 Ag. 2
(c)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Ag. 1 Ag. 2
(d)
Figure 6.4: Sample trajectory showing positive listening: (a) All agents’ positions when Agent 1
first finds prey at step 9. (b) All agents’ positions at step 10. (c) Step 1 to 9. (d) Step 10 to 16.
Table 6.1: Reward gain from positive listening
r
no communication variant
r
CHALET
r" r
ref
(step penalty)
PP
Obs
10
-2.550.22 -1.110.23 1.44 -0.1
LJ
5;3
10
-9.450.59 -8.450.64 1.00 -0.1
the message includes meaningful information on the sender’s current observation and positive
listening means the message influences the receiver’s behavior. In figure 6.4, we load a trained
model of CHALET in the PP benchmark and visualize the consecutive steps to show positive
listening. Fig. 6.4a shows the state at step 9. Solid dots represent agents who send messages at
this step while hollow dots represent silent agents. Cross mark represents the prey. We notice that
Agent 1 (in blue) finds the prey and broadcasts a message. By checking the recorded receiving
90
Figure 6.5: Cosine similarity of messages.
mask matrix we confirm that Agent 2 (in green) receives the message successfully. In Fig. 6.4b,
Agent 2 moves to the right. We take Agent 2 as an example to visualize the positive listening.
We plot Agent 2’s trajectory before and after receiving this message in Fig. 6.4c and Fig 6.4d
respectively. We can observe that Agent 2 stops exploring and immediately reverses its direction
to approach the prey. Since Agent 2 is blind by design, it can only get the information on the
prey from this message. Therefore, the change of behavior results from communication. This
visualizes the positive listening. A formal way to measure positive listening is by the improvement
of average reward with the communication channel as suggested by [86]. We tested on both PP and
LJ environments and show the results in table 6.1. The average reward gain from communication
channel is 1.44 and 1.00 respectively. The gain is significant compared with the reference reward
(step penalty). For positive signaling, we check the “speaker consistency” [87], i.e., whether the
agents are able to generate consistent messages for a specific concept (e.g., finding the prey). We
load a trained model of CHALET in the PP benchmark and rollout for 2000 episodes. Then we
randomly sample 10 messages from 10 episodes which are generated when the agent first observes
the prey. Similarly, we randomly sample another 10 messages from 10 runs when the prey is not
91
found. Note that since we used parameter sharing in the training process, the message encoders
have the same architecture and weights among the agents. We embed each message with first layer
of the message encoder and compute the cosine similarity of these messages in the latent space.
In Fig. 6.5, message 0 9 correspond to messages sent when the agent observes the prey while
message 10 19 correspond to the others. It is clear that that messages 0 9 have large cosine
similarity while the remaining are dissimilar (which is expected due to the random sampling).
It shows agents are able to learn a meaningful and consistent language. We can conclude that
CHALET agents demonstrated both positive signaling and positive listening.
6.5.4 Ablation Study
0 0:5 1 1:5 2 2:5 3 3:5 4
10
6
15
20
25
30
35
40
45
Number of environment steps
Steps to catch prey
PP
Obs
10
, 4 agent
0 0:5 1 1:5 2 2:5 3 3:5 4
10
6
20
25
30
35
40
Number of environment steps
Episdoe reward
LJ
5;3
10
CHALET Qmix Mixer
Figure 6.6: Ablation study of the communication-enhanced value decomposition: (a) PP (b) LJ
Communication-Enhanced Mixer We compare the communication-enhanced mixer and the
QMIX mixer (Hypernet based). The metric is the number of steps to catch the prey in PP and the
number of steps to cut down all trees in LJ. As shown in Figure 6.6, We found that the proposed
mixer consistently outperforms the baseline with a significant gap. It verifies our idea that cap-
turing the dynamic communication in the value decomposition process is essential and proved the
effectiveness of the proposed GNN based design in doing so. We believe the benefit also comes
92
from the nice mathematical properties (permutation invariance leading to improved sample effi-
ciency thus faster convergence) as direct result of our design. There is another way to understand
the performance gain from capturing the communication structure in the value decomposition:
besides using current state to generate weight matrices to aggregate individual Qi’s, we need also
scale Qi according to other agents’ Q values. Such information can only be obtained by message
passing. Scaling Qi’s in this way is analogous to softmax function, since both consider the mutual
influence between the original values. But instead of making the outputs sum to 1.0 to approximate
a distribution, our message passing is more general and the scaling can be learned through gradient
descent.
6.6 Summary
We have proposed a framework to extend value decomposition algorithm for MARL to a realistic
communication scenario. We provide a comprehensive design spanning from modeling, decentral-
ized agent network design, to centralized value decomposition design. The framework achieves
outstanding performance compared to state-of-the-art algorithms in terms of convergence speed
and quality. We also provide mathematical analysis for the proposed mixer network and message
encoder. Behavioral studies show the effectiveness of the communication policy.
93
Chapter 7
Conclusions and Future Works
In this chapter, we conclude our contributions and give insights on potential extensions of the thesis
work.
7.1 Conclusions
In this dissertation, we have presented our research on enhancing the collaboration among edge
devices running AI applications. For edge devices that perform the pre-configured CV inference
task, we have optimized the scheduling of sub-tasks as well as the communication of intermediate
output data. For the edge devices that act as intelligent reinforcement learning agents, we have pro-
posed a general frameworkLAUREL to learn the application-specific actions as well as communica-
tion strategies under a meta-environment. In addition, we have enhanced the value-decomposition
process to better assess the communication’s impact on the collective success.
For the first setting where edge devices serving computation resources, our contribution can be
summarized as follows:
• An innovative optimization metric that jointly evaluates the inference throughput and accu-
racy;
• A model partitioning algorithm based on dynamic programming that optimizes the through-
put with respect to computation stages;
• The design of communication load compressors based on (Variational) Auto-Encoders,
which are lightweight and can be seamlessly inserted to the original architectures;
94
• A bandwidth-adaptive compressor selection algorithm, which is derived from rigid analysis
of the tradeoff between compression ratio and inference accuracy;
• Mathematical performance analysis with an example Gaussian distributed bandwidth;
• Experimental evaluation on various image datasets, showing consistent performance
improvement in the expected effective throughput
For the second setting where edge devices serving as intelligent agents, our contribution can be
summarized as follows:
• A technique that aligns agents’ interactions in the game and wireless environments, solving
the issue of training nondifferentiability in realistic wireless channels;
• A modeling scheme where we formulate a new POMDP capturing the joint influence;
• Selection of wireless measurements as agents’ observation by leveraging domain knowledge;
• A message encoding architecture that addresses the communication stochasticity and embeds
received messages in a provable lossless way;
• A general framework based on the POMDP, with which the policy is end-to-end learnable;
• A communication-enhanced value decomposition architecture based on an innovative GNN
model;
• Proof of various math properties of the proposed mixer, including permutation invariance,
monotonicity and high expressive power;
• Theoretical comparison with the original mixer as well as traditional GNNs;
• Ablation studies that show the effectiveness of the proposed mixer from integrating commu-
nication action in the credit assignment process;
95
• Behavioral studies that exhibit effectiveness of the communication by showing “positive
signaling” and “positive listening”;
• Comprehensive empirical evaluations on standard benchmarks showing our framework con-
sistently outperforms state-of-the-art baselines in terms of convergence speed and quality.
7.2 Future Works
While this thesis has significantly improved the collaboration quality on the edge from two per-
spectives, there are some interesting directions to explore. In Chapter 4, the target application is
CNN inference. However, Transformer [88] has inspired a series of language models. How to
bring the NLP inference to the edge has been increasingly important. For our work in Chapter 5,
we only experimented in the game benchmarks in the simulator. New challenges would arise if we
would like to implement the framework in a robotic system such as a group of UA Vs. In Chap-
ter 6, we have evaluated our algorithm in homogenous agents. When it comes to different kinds
of agents, we may need to modify the algorithm to better leverage the communication structure.
Therefore, we propose the following extensions.
7.2.1 Collaborative LLM inference on the edge
We would like to extend our application from CV to NLP, specifically inference of Large Language
Models. The current LLMs with billions of parameters are too large to fit edge devices. For
example, GPT-3 [89] has 175 billion parameters and requires 800 GB to store. A new line of
research focus on reduing inference budget so that ultimately inference can no longer rely on
server-level hardwares. LLAMa [9] trains a series of smaller models with the number of parameters
ranging from 7 billion 65 billion. LLAMa achieved competitive performance compared to the
best LLMs. Alpaca [90] finetunes the 7 billion LLAMa model and demonstrates similar behaviors
as GPT 3.5. However, even the size of the relatively smaller LLMs are too large for most edge
96
devices. The latest work MLC-LLM [91] proposes Machine Learning Compilation (MLC) which
employs quantization, dynamic shaping, etc. to optimize the computation and memory usage of
LLMs to support inference at the mobile devices. However, it still requires extra work to obtain
a compact model that has comparable accuracy and inference speed as the large models, when
deployed on the edge devices.
In addition, the architecture of LLMs puts new constraints on the collaborative inference. Most
LLMs are based on Transformer decoder, which includesN stacks of the same decoding block.
Within each decoding block, the multi-head attention has h headers running in parallel, which
leads to one-to-multiple connections. There are also skip connections within the decoding block.
Therefore, performing model decomposition within the decoding block is likely to incur extra
communication burden. In the text generation task, the LLM repeatedly predict one token at a
time until completing the sentence. Due to auto-regressiveness, the current output token is fed as
next input to the model, which means that if multiple devices collaborate to inference, the device
that outputs the final result must have a reliable connection to the device that handles the input.
Alternatively, the output device is the same as the input device. To design an efficient collaborative
strategy that is customized to the LLMs’ characteristics would be interesting to explore.
7.2.2 Deployment ofLAUREL on UA Vs for resource coverage problem
Our framework LAUREL has been evaluated in the 2D rectangular Grid World where the game
observation is coordinates. It takes additional work to deploy LAUREL on autonomous robots such
as a group of UA Vs. First, to guarantee the real-world performance, we need to train the agents in
a more comprehensive simulator. For example, using Unity [92], we can better imitate the real 3D
environment by modeling the physics of the UA Vs, PID control loops as well as the complex ter-
rains. We can use the open source Unity based drone simulations (e.g., [93, 94]) as a starting point,
combined with our batch wireless communication environment simulator. Second, agents will have
visual inputs instead of coordinates as game observation. This requires efficient CV processing on
97
the drones. Third, the domain transfer of multi-agent reinforcement learning to real-world applica-
tions is underexplored in the literature. Significant tunings are needed to ensure the trained model
can maintain similar performance after deployment. One widely-adopted technique is domain ran-
domization [95]. Although there are some works that deploy MARL algorithms on physical robots
[96, 97, 98], either the number of agents or the size of the area is very limited. How to reduce
the sim-to-real performance gap for a large population of agents remains an open problem. More-
over, additional wireless network measurements need to be considered based on understanding of
different applications/scenarios. For example, with a large population of robots and long episode
length, the queue congestion may hinder the overall performance. Thus, e.g., adding queue size to
the message header may improve communication decision and message freshness.
We assume that the agents’ actions are synchronized and all the message delay are within one
step. One interesting direction is to relax the assumption(s). We can extend the framework to
accommodate asynchronous. We can also study how to accommodate the heterogeneous delay of
messages due to e.g., queuing delay.
7.2.3 Communication-Enhanced Value Decomposition in Heterogeneous
Systems
We assumed homogeneous agents in our experiments of MARL with communication. However,
there may be several kinds of agents. For example, in the Search and Rescue application, the differ-
ent roles between firetruck agents and ambulance agents can lead to different behaviors. In some
settings [99], the role can be emergent, i.e., changing with the environment dynamics. Besides,
even gents of the same role can still have different navigation/visualization capabilities, as well
as different observation and action space. It worths extra effort to extend our communication-
enhanced value decomposition framework to the heterogeneous multi-agent system. First, we
can no longer train a universal policy via parameter sharing. Then how to improve the training
efficiency remains an open problem. Local parameter sharing may be leveraged to facilitate the
98
learning within same kinds of agents. In message aggregation, we may want to differentiate the
messages from different kinds of agents. For example, the message of “water depleted” from a
firetruck agent may not be important to an ambulance agent. For the communication-enhanced
value decomposition process, we may want to capture the (dynamic) roles of the agents, which
further differentiates the communications. Currently, we use a GNN based architecture which
is permutation invariant. Inspired by existing works such as Graph Attention Networks (GAT)
[62] and Graph Transformer Networks (GTN) [100], we can add the Attention mechanism to our
decomposition architecture.
99
References
[1] Junchen Jiang, Yuhao Zhou, Ganesh Ananthanarayanan, Yuanchao Shu, and Andrew A
Chien. Networked cameras are the new big data clusters. In Proceedings of the 2019
Workshop on Hot Topics in Video Analytics and Intelligent Edges, pages 1–7, 2019.
[2] Rustem Dautov, Salvatore Distefano, Dario Bruneo, Francesco Longo, Giovanni Merlino,
and Antonio Puliafito. Data processing in cyber-physical-social systems through edge com-
puting. IEEE Access, 6:29822–29835, 2018.
[3] Cecilio Angulo and Ricardo Tellez. Distributed intelligence for smart home appliances. Ten-
dencias de la minería de datos en España. Red Española de Minería de Datos. Barcelona,
España, 2004.
[4] Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Distributed deep neu-
ral networks over the cloud, the edge and end devices. In 2017 IEEE 37th international
conference on distributed computing systems (ICDCS), pages 328–339. IEEE, 2017.
[5] Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. Deepthings: Dis-
tributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):2348–
2359, 2018.
[6] Jiachen Mao, Zhongda Yang, Wei Wen, Chunpeng Wu, Linghao Song, Kent W Nixon,
Xiang Chen, Hai Li, and Yiran Chen. Mednn: A distributed mobile system with enhanced
partition and deployment for large-scale dnns. In 2017 IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), pages 751–756. IEEE, 2017.
[7] Jiachen Mao, Xiang Chen, Kent W Nixon, Christopher Krieger, and Yiran Chen. Modnn:
Local distributed mobile computing system for deep neural network. In Design, Automation
& Test in Europe Conference & Exhibition (DATE), 2017, pages 1396–1401. IEEE, 2017.
[8] OpenAI. Gpt-4 technical report, 2023.
[9] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien
Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and
efficient foundation language models, 2023.
100
[10] Xiufeng Xie and Kyu-Han Kim. Source compression with bounded dnn perception loss
for iot edge computer vision. In The 25th Annual International Conference on Mobile
Computing and Networking, MobiCom ’19, New York, NY , USA, 2019. Association for
Computing Machinery.
[11] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and
Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb
model size. arXiv preprint arXiv:1602.07360, 2016.
[12] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
[13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[14] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
[15] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing
Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for
mobilenetv3. arXiv preprint arXiv:1905.02244, 2019.
[16] En Li, Zhi Zhou, and Xu Chen. Edge intelligence: On-demand deep learning model co-
inference with device-edge synergy. In Proceedings of the 2018 Workshop on Mobile Edge
Communications, pages 31–36. ACM, 2018.
[17] Yoshitomo Matsubara, Sabur Baidya, Davide Callegaro, Marco Levorato, and Sameer
Singh. Distilled split deep neural networks for edge-assisted real-time systems. In Pro-
ceedings of the 2019 Workshop on Hot Topics in Video Analytics and Intelligent Edges,
pages 21–26. ACM, 2019.
[18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531, 2015.
[19] Jiasi Chen and Xukan Ran. Deep learning with edge computing: A review. Proceedings of
the IEEE, 107(8):1655–1674, 2019.
[20] Jakob Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to
communicate with deep multi-agent reinforcement learning. In NeurIPS, 2016.
[21] Yali Du, Bo Liu, Vincent Moens, Ziqi Liu, Zhicheng Ren, Jun Wang, Xu Chen, and Haifeng
Zhang. Learning correlated communication topology in multi-agent reinforcement learning.
In IFAAMAS, 2021.
101
[22] J. P. Queralta, J. Taipalmaa, B. C. Pullinen, V . K. Sarker, T. N. Gia, H. Tenhunen, M. Gab-
bouj, J. Raitoharju, and T. Westerlund. Collaborative multi-robot systems for search and
rescue: Coordination and perception. arXiv:2008.12610, 2020.
[23] Ravi Haksar and Mac Schwager. Distributed deep reinforcement learning for fighting forest
fires with a network of aerial robots. In IROS, 2018.
[24] Thanh Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Multi-agent deep reinforcement
learning with human strategies. In IEEE ICIT, 2019.
[25] Daewoo Kim, Sangwoo Moon, David Hostallero, Wan Ju Kang, Taeyoung Lee, Kyungh-
wan Son, and Yung Yi. Learning to schedule communication in multi-agent reinforcement
learning. In ICLR, 2019.
[26] Sainbayar Sukhbaatar, arthur szlam, and Rob Fergus. Learning multiagent communication
with backpropagation. In NeurIPS, 2016.
[27] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot,
N. Sonnerat, J. Leibo, K. Tuyls, et al. Value-decomposition networks for cooperative multi-
agent learning. IFAAMAS, 2018.
[28] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N.
Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep
multi-agent reinforcement learning. In ICML, 2018.
[29] Meng Zhou, Ziyu Liu, Pengwei Sui, Yixuan Li, and Yuk Ying Chung. Learning implicit
credit assignment for cooperative multi-agent reinforcement learning. arXiv preprint
arXiv:2007.02529, 2020.
[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[31] Marian Verhelst and Bert Moons. Embedded deep neural network processing: Algorithmic
and processor techniques bring deep learning to iot and edge devices. IEEE Solid-State
Circuits Magazine, 9(4):55–65, 2017.
[32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[33] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural Networks, 1989.
[34] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas
Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon
Whiteson. The starcraft multi-agent challenge. arXiv:1902.04043, 2019.
102
[35] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine learning, 1992.
[36] Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3o: Policy-on policy-off policy
optimization. In Uncertainty in Artificial Intelligence. PMLR, 2020.
[37] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A
survey of learning in multiagent environments: Dealing with non-stationarity. CoRR,
abs/1707.09183, 2017.
[38] Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Individualized controlled con-
tinuous communication model for multiagent cooperative and competitive tasks. In ICLR,
2019.
[39] Lucian Bu¸ soniu, Robert Babuška, and Bart De Schutter. Multi-agent Reinforcement Learn-
ing: An Overview. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
[40] Datuk Prof Ir Ishak Ismail and Mohd Hairil Fitri Ja’afar. Mobile ad hoc network overview.
In IEEE APACE, 2007.
[41] Yi Gai, Shankar Ganesan, and Bhaskar Krishnamachari. The saturation throughput region
of p-persistent csma. In Information Theory and Applications. IEEE, 2011.
[42] Sicong Liu, Yingyan Lin, Zimu Zhou, Kaiming Nan, Hui Liu, and Junzhao Du. On-demand
deep model compression for mobile devices: A usage-driven model selection framework. In
Proceedings of the 16th Annual International Conference on Mobile Systems, Applications,
and Services, pages 389–400. ACM, 2018.
[43] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars,
and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile
edge. In ACM SIGARCH Computer Architecture News, volume 45, pages 615–629. ACM,
2017.
[44] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu-based
deep learning framework for continuous vision applications. In Proceedings of the 15th
Annual International Conference on Mobile Systems, Applications, and Services, pages 82–
95. ACM, 2017.
[45] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and
William J Dally. Eie: Efficient inference engine on compressed deep neural network. In
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA),
pages 243–254. IEEE, 2016.
[46] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
103
[47] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun
Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat
games. arXiv:1703.10069, 2017.
[48] Abhishek Das, Théophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Mike Rabbat,
and Joelle Pineau. Tarmac: Targeted multi-agent communication. In ICML, 2019.
[49] Yaru Niu, Rohan Paleja, and Matthew Gombolay. Multi-agent graph-attention communica-
tion and teaming. In IFAAMAS, 2021.
[50] Sai Qian Zhang, Qi Zhang, and Jieyu Lin. Efficient communication in multi-agent rein-
forcement learning via variance based control. NeurIPS, 2019.
[51] Rose E Wang, Michael Everett, and Jonathan P How. R-maddpg for partially observable
environments and limited communication. ICML, 2019.
[52] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix:
Expanding monotonic value function factorisation for deep multi-agent reinforcement learn-
ing. Advances in neural information processing systems, 33:10199–10210, 2020.
[53] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran:
Learning to factorize with transformation for cooperative multi-agent reinforcement learn-
ing. In International conference on machine learning, pages 5887–5896. PMLR, 2019.
[54] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal
for decentralized planning. Neurocomputing, 2016.
[55] Thomy Phan, Thomas Gabor, Andreas Sedlmeier, Fabian Ritz, Bernhard Kempter, Cornel
Klein, Horst Sauer, Reiner Schmid, Jan Wieghardt, Marc Zeller, et al. Learning and testing
resilience in cooperative multi-agent systems. In Proceedings of the 19th International
Conference on Autonomous Agents and MultiAgent Systems, pages 1055–1063, 2020.
[56] Ling Pan, Tabish Rashid, Bei Peng, Longbo Huang, and Shimon Whiteson. Regularized
softmax deep multi-agent q-learning. Advances in Neural Information Processing Systems,
34:1365–1377, 2021.
[57] Arbaaz Khan, Ekaterina Tolstaya, Alejandro Ribeiro, and Vijay Kumar. Graph policy gradi-
ents for large scale robot control. In Conference on robot learning, pages 823–834. PMLR,
2020.
[58] Iou-Jen Liu, Raymond A Yeh, and Alexander G Schwing. Pic: permutation invariant critic
for multi-agent deep reinforcement learning. In Conference on Robot Learning, pages 590–
602. PMLR, 2020.
[59] Iou-Jen Liu, Zhongzheng Ren, Raymond A Yeh, and Alexander G Schwing. Semantic
tracklets: An object-centric representation for visual multi-agent reinforcement learning. In
2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages
5603–5610. IEEE, 2021.
104
[60] Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. Graph convolutional reinforce-
ment learning. arXiv preprint arXiv:1810.09202, 2018.
[61] Yong Liu, Weixun Wang, Yujing Hu, Jianye Hao, Xingguo Chen, and Yang Gao. Multi-
agent game abstraction via graph attention neural network. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 7211–7218, 2020.
[62] Petar Veliˇ ckovi´ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
Yoshua Bengio. Graph Attention Networks. International Conference on Learning Repre-
sentations, 2018.
[63] Gregory K Wallace. The jpeg still picture compression standard. IEEE transactions on
consumer electronics, 38(1):xviii–xxxiv, 1992.
[64] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In
Conference on learning theory, pages 907–940, 2016.
[65] Carl Doersch. Tutorial on variational autoencoders, 2016.
[66] Sandeep Chinchali, Apoorva Sharma, James Harrison, Amine Elhafsi, Daniel Kang,
Evgenya Pergament, Eyal Cidon, Sachin Katti, and Marco Pavone. Network offloading
policies for cloud robotics: a learning-based approach. arXiv preprint arXiv:1902.05703,
2019.
[67] Ionut Cardei, Ankur Agarwal, Bassem Alhalabi, Timur Tavtilov, Taghi Khoshgoftaar, and
Pierre-Philippe Beaujean. Software and communications architecture for prognosis and
health monitoring of ocean-based power generator. In 2011 IEEE International Systems
Conference, pages 353–360. IEEE, 2011.
[68] Tom Eccles, Yoram Bachrach, Guy Lever, Angeliki Lazaridou, and Thore Graepel. Biases
for emergent communication in multi-agent reinforcement learning. Advances in neural
information processing systems, 32, 2019.
[69] Yanchao Sun, Ruijie Zheng, Parisa Hassanzadeh, Yongyuan Liang, Soheil Feizi, Sumitra
Ganesh, and Furong Huang. Certifiably robust multi-agent reinforcement learning against
adversarial communication.
[70] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. arXiv preprint
arXiv:1506.02438, 2015.
[71] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-
forcement learning. Mach. Learn., 1992.
[72] Eduardo Feo Flushing, Michal Kudelski, Luca M Gambardella, and Gianni A Di Caro.
Spatial prediction of wireless links and its application to the path control of mobile robots.
In IEEE SIES, 2014.
105
[73] David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. CoRR, 2016.
[74] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[75] Stefano V Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-
response learning method for ad hoc coordination in multiagent systems. arXiv:1506.01170,
2015.
[76] Nitin Kamra and Yan Liu. Differentiable approximations for multi-resource spatial coverage
problems. 2020.
[77] Anurag Koul. ma-gym: Collection of multi-agent environments based on openai gym.
https://github.com/koulanurag/ma-gym, 2019.
[78] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdi-
nov, and Alexander J Smola. Deep sets. In NeurIPS, 2017.
[79] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph
neural networks? ICLR, 2018.
[80] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural Networks, 1989.
[81] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large
graphs. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017.
[82] Computer solution of large linear systems. In G. Meurant, editor, Studies in Mathematics
and Its Applications, volume 28, pages 397–540. Elsevier, 1999.
[83] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? a
dissection on graph classification. arXiv preprint arXiv:1905.04579, 2019.
[84] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical eval-
uation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555,
2014.
[85] Toru Lin, Jacob Huh, Christopher Stauffer, Ser Nam Lim, and Phillip Isola. Learning to
ground multi-agent communication with autoencoders. Advances in Neural Information
Processing Systems, 34:15230–15242, 2021.
[86] Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. On the
pitfalls of measuring emergent communication. arXiv preprint arXiv:1903.05168, 2019.
106
[87] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega,
DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for
multi-agent deep reinforcement learning. In International conference on machine learning,
pages 3040–3049. PMLR, 2019.
[88] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017.
[89] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
guage models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020.
[90] Alpaca: A strong, replicable instruction-following model. https://crfm.stanford.
edu/2023/03/13/alpaca.html. Accessed: 2023-05-31.
[91] Mlc-llm. https://mlc.ai/mlc-llm/. Accessed: 2023-05-31.
[92] Unity. https://unity.com/. Accessed: 2023-05-31.
[93] Unity drone simulator. https://github.com/UAVs-at-Berkeley/
UnityDroneSim. Accessed: 2023-05-31.
[94] D.r.o.n.e. https://unity.com/madewith/drone. Accessed: 2023-05-31.
[95] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.
Domain randomization for transferring deep neural networks from simulation to the real
world, 2017.
[96] Yuan Gao, Junfeng Chen, Xi Chen, Chongyang Wang, Junjie Hu, Fuqin Deng, and Tin Lun
Lam. Asymmetric self-play-enabled intelligent heterogeneous multirobot catching system
using deep multiagent reinforcement learning. IEEE Transactions on Robotics, 2023.
[97] Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish Kumar, Sven
Koenig, and Howie Choset. Primal: Pathfinding via reinforcement and imitation multi-
agent learning. IEEE Robotics and Automation Letters, 2019.
[98] Huy Xuan Pham, Hung Manh La, David Feil-Seifer, and Aria Nefian. Cooperative
and distributed reinforcement learning of drones for field coverage. arXiv preprint
arXiv:1803.07250, 2018.
[99] Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. Roma: multi-agent rein-
forcement learning with emergent roles. In Proceedings of the 37th International Confer-
ence on Machine Learning, pages 9876–9886, 2020.
[100] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph
transformer networks. Advances in neural information processing systems, 32, 2019.
107
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Machine learning in interacting multi-agent systems
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Quantum computation in wireless networks
PDF
Enabling virtual and augmented reality over dense wireless networks
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
Remote exploration with robotic networks: queue-aware autonomy and collaborative localization
PDF
Exploiting diversity with online learning in the Internet of things
PDF
Efficient crowd-based visual learning for edge devices
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Interaction and topology in distributed multi-agent coordination
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
Asset Metadata
Creator
Hu, Diyi
(author)
Core Title
Enhancing collaboration on the edge: communication, scheduling and learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-08
Publication Date
07/31/2023
Defense Date
06/22/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
distributed computing,edge computing,multi-agent reinforcement learning,OAI-PMH Harvest,reinforcement learning,wireless networks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
), Nakano, Aaiichiro (
committee member
), Qian, Feifei (
committee member
)
Creator Email
diyihu.hu@gmail.com,diyihu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113291577
Unique identifier
UC113291577
Identifier
etd-HuDiyi-12167.pdf (filename)
Legacy Identifier
etd-HuDiyi-12167
Document Type
Dissertation
Rights
Hu, Diyi
Internet Media Type
application/pdf
Type
texts
Source
20230731-usctheses-batch-1076
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
distributed computing
edge computing
multi-agent reinforcement learning
reinforcement learning
wireless networks