Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning, adaptation and control to enhance wireless network performance
(USC Thesis Other)
Learning, adaptation and control to enhance wireless network performance
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning, Adaptation and Control to Enhance Wireless Network Performance
by
Shangxing Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2018
Copyright 2018 Shangxing Wang
Dedication
To my beloved family.
ii
Acknowledgments
This thesis would not have been possible without the help of many people. I would
like to take this opportunity to express my sincere gratitude and appreciation for
their contribution.
First, I would like to thank my advisor, Prof. Bhaskar Krishnamachari. He
introduced to me a cool and fun research eld and opened my mind to a world of
opportunities. He gave me a lot of freedom to explore interesting topics and showed
me how to do high-quality work. I am greatly beneted from many illuminating
discussions with him. His dedicated advice, encouragement and continuous support
helped me thrive throughout my PhD study. On a personal level, he also inspires
me by his hardworking and passionate attitude. I could not have imagined having
a better advisor for my PhD study.
Besides my advisor, I would like to thank Prof. Andrea Gasparri and Prof.
Nora Ayanian for the stimulating discussions and insightful advice on my research
projects. I would also like to thank the rest of my thesis committee: Prof. John
Silvester and Prof. Leana Golubchick, for their valuable feedback and contribution
iii
to my thesis. I would also thank Prof. Edmond Jonckheere for serving on my
qualifying exam committee.
Special thanks go to my other research collaborators. I want to thank Dr.
Arman (MHR) Khouzani and Dr. Fan Bai for their constructive feedback and
suggestions. I also thank my fellow labmates Pedro Henrique Gomes and Hanpeng
Liu for the stimulating discussions, and for the sleepless nights when we scrambled
to make deadlines. This thesis benets enormously from all of them.
At the University of Southern California, I received tremendous support from
various faculty, sta, colleagues, friends and the graduate school. My heartfelt
thanks goes to Prof. Ali A Zahid, Prof. Armand Rene Tanguay and Prof. Yan Liu
for oering me the opportunity to work with them as teaching assistant. I would
also like to thank our EE sta Diane Demetras, Tim Boston and Shane Goodo
for oering prompt help. I thank my fellow labmates in ANRG for their continued
help and support on research and life. They made my experience in graduate school
exciting and fun, and I feel very lucky to be part of such a wonderful group. I also
thank all my friends for always staying with me during the ups and downs. I also
extend my appreciation to the Annenberg Fellowship Program and EE department
for supporting tuition and oering stipends during my study.
Last but not the least, I owe my deep gratitude to my beloved family: my mom,
grandparents, aunts, uncles and cousins for their unconditional love and support
throughout my life.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List Of Figures viii
List Of Tables x
Abstract xi
Chapter 1: Introduction 1
1.1 Autonomous Robots . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Robotic Message Ferrying . . . . . . . . . . . . . . . . . . . 7
1.3.2 Robotic Network Deployment . . . . . . . . . . . . . . . . . 9
1.3.3 Dynamic Spectrum Access . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Background 15
2.1 Backpressure Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Decision Making with Prior Knowledge . . . . . . . . . . . . . . . . 17
2.2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Partially Observable Markov Decision Process . . . . . . . . 20
2.3 Decision Making without Prior Knowledge . . . . . . . . . . . . . . 22
2.3.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . 25
Chapter 3: Related Work 28
3.1 Robotic Message Ferrying . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Robotic Network Deployment . . . . . . . . . . . . . . . . . . . . . 31
3.3 Dynamic Multichannel Access . . . . . . . . . . . . . . . . . . . . . 34
v
Chapter 4: Robotic Message Ferrying for Wireless Networks using
Coarse-Grained Backpressure Control 38
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Capacity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Coarse-Grained Backpressure Control . . . . . . . . . . . . . . . . . 48
4.3.1 Capacity Region under nite velocity and epoch duration . . 49
4.3.2 Coarse-grained Backpressure-based Message Ferrying . . . . 50
4.4 Epoch Adaptive CBMF . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Structural Properties and Delay Performance of CBMF Algorithm
in a Homogeneous Network . . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 Structural Properties . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Simulation And Evaluation . . . . . . . . . . . . . . . . . . . . . . . 78
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 5: The Optimism Principle: A Unied Framework for Op-
timal Robotic Network Deployment in An Unknown Obstructed
Environment 84
5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1 Link Quality Metric . . . . . . . . . . . . . . . . . . . . . . 85
5.1.2 Mobility, Sensing and Environment Assumptions . . . . . . . 86
5.1.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 OnLinE RObotic Netowrk FormAtion (LEONA) . . . . . . . . . . 88
5.3 Case Study I: Finding Minimized ETX Path . . . . . . . . . . . . . 92
5.3.1 Analysis of the Sucient Searched Area . . . . . . . . . . . 93
5.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Case Study II: Finding Maximized Transmission-Rate Path . . . . 104
5.4.1 Analysis of the Sucient searched Area . . . . . . . . . . . . 105
5.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 6: Deep Reinforcement Learning for Dynamic Multichannel
Access in Wireless Networks 109
6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Myopic Policy and Whittle Index . . . . . . . . . . . . . . . . . . . 114
6.2.1 Myopic Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2.2 Whittle Index Based Heuristic Policy . . . . . . . . . . . . . 116
6.3 Deep Reinforcement Learning Approach . . . . . . . . . . . . . . . 118
6.4 Optimal Policy for Known Fixed-Pattern Channel Switching . . . . 120
6.5 Experiment and Evaluation of Learning for Unknown Fixed-Pattern
Channel Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.5.1 DQN Architecture . . . . . . . . . . . . . . . . . . . . . . . 125
6.5.2 Single Good Channel, Round Robin Switching Situation . . 127
vi
6.5.3 Single Good Channel, Arbitrary Switching Situation . . . . 129
6.5.4 Multiple Good Channels Situation . . . . . . . . . . . . . . 130
6.6 Experiment and Evaluation of DQN for More Complex Situations . 131
6.6.1 Perfectly correlated scenario . . . . . . . . . . . . . . . . . . 133
6.6.2 Real data trace . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6.3 Multi-User Scenario . . . . . . . . . . . . . . . . . . . . . . . 137
6.6.4 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 Adaptive DQN for Unknown, Time-Varying Environments . . . . . 142
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Chapter 7: Conclusion and Open Questions 147
7.1 Extensions on Robotic Message Ferrying . . . . . . . . . . . . . . . 149
7.2 Extensions on Robotic Network Deployment . . . . . . . . . . . . . 150
7.3 Extensions on Dynamic Multichannel Access . . . . . . . . . . . . . 151
References 153
vii
List Of Figures
2.1 MDP illustration
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 A network containing 2 pairs of source and sink nodes and 4 robots 42
4.2 Capacity region for a problem with 3 robots and 2
ows . . . . . . 48
4.3 Delay as we vary v for T = 100 (left) and delay as we vary T for
v = 8
p
2 (right) for 20-Flows-30-Robots network . . . . . . . . . . . 80
4.4 Delay (left) and Epoch Duration (right) comparison of the Epoch
Adaptive CBMF Algorithm with a non-adaptive scheme for 20-Flows-
30-Robots network . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Delay of as we vary for homogeneous 20-Flows-10-Robots network
(left) and 20-Flows-30-Robots network (right) . . . . . . . . . . . . 82
4.6 Delay of as we vary T for a homogeneous 20-Flows-10-Robots net-
work (left) and 20-Flows-30-Robots network (right) . . . . . . . . . 82
5.1 Comparisons of
i;j
derived from sigmoidal, Q and exponential func-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 ETX (left) and moving steps (right) . . . . . . . . . . . . . . . . . . 102
5.3 Illustration of robot congurations: strong wall attenuation (left)
and weak wall attenuation (right) . . . . . . . . . . . . . . . . . . . 103
5.4 Transmission rate (left) and moving steps (right) . . . . . . . . . . 107
viii
5.5 Illustration of robot congurations: strong wall attenuation (left)
and weak wall attenuation (right) . . . . . . . . . . . . . . . . . . . 107
6.1 Running time (seconds) of the POMDP solver as we vary the number
of channels in the system . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2 Gilbert-Elliot channel model . . . . . . . . . . . . . . . . . . . . . . 114
6.3 A capture of a single good channel, round robin switching situation
over 50 time slots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4 Average discounted reward as we vary the switching probability p in
the single good channel, round robin switching . . . . . . . . . . . . 128
6.5 A capture of a single good channel, arbitrary switching situation over
50 time slots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.6 Average discounted reward as we vary the switching order in the
single good channel, arbitrary switching . . . . . . . . . . . . . . . . 129
6.7 A capture of a multiple good channels situation over 50 time slots . 130
6.8 Average discounted reward as we increase the number of good chan-
nels in the multiple good channels situation . . . . . . . . . . . . . 130
6.9 Average discounted reward for 6 dierent cases. Each case considers
a dierent set of correlated channels . . . . . . . . . . . . . . . . . . 134
6.10 Average maximum Q-value of a set of randomly selected states in 6
dierent simulation cases . . . . . . . . . . . . . . . . . . . . . . . . 134
6.11 Channel utilization of 8 channels in the testbed . . . . . . . . . . . 137
6.12 Average discounted reward as we vary the number of users in the
multiple-user situation . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.13 Average discounted reward as we vary the channel switching pattern
situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.14 Average discounted reward in real time during training in unknown
xed-pattern channel switching . . . . . . . . . . . . . . . . . . . . 144
ix
List Of Tables
4.1 List of main notations in the problem formulation . . . . . . . . . . 40
6.1 List of DQN hyperparameters . . . . . . . . . . . . . . . . . . . . . 126
6.2 Performance on real data trace . . . . . . . . . . . . . . . . . . . . 137
x
Abstract
With the arrival of Internet of Things (IoT), today's wireless networks have de-
veloped into a heterogeneous complex system with a massive collection of various
devices such as smart phones, wearables, UAVs and autonomous vehicles. Often
operated in complex, dynamic and even unknown environments, wireless networks
require a new design of algorithms and protocols to overcome the challenges of
heterogeneity, complexity and uncertainty. Inspired by the rapid development and
huge success in the elds of robotics and online learning, we believe an AI-assisted
approach, by incorporating autonomous robots and online learning into wireless
network operation, is the key to realize the promise of high data rate, low latency,
ultra reliability and low energy consumption in such a complicated system.
In this thesis, we apply autonomous robots and online learning, in particular
deep reinforcement learning, to dierent networking problems, and see how learn-
ing, adaptation and control can help enhance wireless network performance. On
one hand, robots can serve as intelligent mobile relays in wireless networks, and
their controllable mobility provides a new design dimension to improve network
performance including throughput and delay. Moreover, because of their ability
xi
to explore and learn, these smart robotic relays can rapidly form a network with
desired performance, and adapt themselves in dynamic and even unknown envi-
ronments. On the other hand, deep reinforcement learning provides an end-to-end
approach for many networking problems that require sequential decision making in
dynamic, unknown environments. As a combination of deep learning and reinforce-
ment learning, deep reinforcement learning allows one to quickly make decisions
on the
y and continuously improve decision-making through interactions with the
environment, even in high-dimensional, large state-space complicated problems.
We rst consider using autonomous robots as message ferries to help transmit
data between statically-placed source and sink pairs. Guided by the capacity region
for this problem under ideal conditions, we show how robots could be scheduled to
satisfy any arrival rate in the capacity region, given prior knowledge about arrival
rate. We then consider the more practical setting where the arrival rate is unknown
and present a coarse-grained backpressure message ferrying algorithm (CBMF) for
it. In CBMF, robots are matched to sources and sinks once every epoch to max-
imize a queue-dierential-based weight. The matching controls both motion and
transmission for each robot. We show through analysis and simulations the condi-
tions under which CBMF can stabilize the network. We also propose a heuristic
approach so that CBMF can adapt the epoch duration to improve end-to-end delay
while guaranteeing network stability at the same time. This adaptive approach can
also detect changes and adjust itself in dynamic, time-varying environments.
xii
Second, we study the use of robots for autonomously building a multi-hop com-
munication network in an unknown, obstructed environment. Such an application is
important for search and rescue and remote exploration applications. We propose a
unied framework, onLinE rObotic N etwork formAtion (LEONA), that is general
enough to permit optimizing the communication network for dierent utility func-
tions, even in non-convex settings. We demonstrate and evaluate this framework
in two specic scenarios and show that LEONA can signicantly reduce resources
spent in exploring and mapping the entire region prior to network optimization.
In the last study, we study a dynamic multichannel access problem, where mul-
tiple correlated channels follow an unknown joint Markov model and a user selects
channels to transmit data over time. The objective is to nd a policy that maxi-
mizes the expected long-term number of successful transmissions. We formulate the
problem as a partially observable Markov decision process (POMDP) with unknown
system dynamics. Inspired by the idea of deep reinforcement learning, we imple-
ment a Deep Q-Network (DQN) to overcome the challenges of unknown dynamics
and prohibitive computation,. We analytically study the optimal policy for xed-
pattern channel switching and show through simulations that DQN can achieve the
same optimal performance without any prior knowledge. We also show DQN can
achieve near-optimal performance in more complex situations through both simu-
lations and real data trace. Finally, we propose an adaptive DQN approach with
the capability to adapt its learning in time-varying scenarios.
xiii
Chapter 1
Introduction
Over the past two decades, wireless networks have undergone a dramatic revo-
lution. \The smartphone-centric networks of yesteryears are gradually morphing
into a massive Internet of Things (IoT) ecosystem [14, 52, 64, 91] that integrates
a heterogeneous mix of wireless-enabled devices ranging from smart-phones, to
drones, connected vehicles, wearables, sensors, and virtual reality apparatus." [53]
Such a huge corpus of devices makes the wireless network an extremely heteroge-
neous, complex and dynamic. The design and implementation of a next-generation
wireless network to meet the needs and promises of high data rate, low latency, ul-
tra reliability and low energy consumption have become unforeseeable challenging.
And the key is to integrate Articial Intelligence (AI), i.e., leveraging cutting-edge
theories and technologies from autonomous robots and online learning, across the
entire wireless ecosystem so that wireless network has the capability to learn and
adapt to cope with this ongoing and rapid evolution of wireless services and meet
unprecedented diverse requirements in such highly dynamic environments.
1
1.1 Autonomous Robots
Today, we are entering an era of developing robots that can coordinate autonomously
{ that is, with no human input { to complete team objectives. The main motivation
for using such multi-agent systems is that, for many applications, a robot team can
accomplish tasks much more quickly and eectively than a single unit. Applica-
tions
ourish in all walks of life: a team of autonomous robots can be deployed
to do search and rescue to mitigate disasters [22, 33]; networked robots are ideal
and critical ingredients in the environmental observations and ecological monitor-
ing [7, 16]; domestic and personal robots become more common in homes and work
places, and it is natural to anticipate these robots working together in future. The
fast developments in the theory and practical realization of robotics open up the
possibilities to apply a multi-agent system to wireless networks to enhance network
performance, especially in dynamically changing environments.
The seminal work by Grossglauser and Tse [30] demonstrated that the use of mo-
bility can dramatically increase the performance of wireless networks. Various types
of mobility have been extensively studied to improve network performance. These
mobility models can be broadly classied as: uncontrollable and unpredictable mo-
bility, predictable but uncontrollable mobility model, and controllable (and thus
predictable) mobility. The uncontrollable and unpredictable mobility model has
been widely studied mostly due to the fact that it resembles the way humans and
2
animals move with mobile devices. Based on the idea of \store-and-forward", mes-
sages are carried by mobile elements that move randomly in the network, and are
transmitted to others during opportunistic encounters [30, 38, 77]. The predictable
but uncontrollable mobility model is gaining more attention with the growing pop-
ularity of vehicular networks where the movements of vehicles often follow some
predened schedules. Sensor nodes mounted on vehicles can learn and predict the
encounter probability with other mobile nodes, which can be used to improve data
transmission [10, 11, 89, 96]. However, schemes designed based on these two mo-
bility models are restricted by the limitation that mobility is uncontrollable, which
may not fully realize the wireless network's potential and thus cannot meet the
demanding requirements in future.
With the rapid development of technology in robotics, controllable mobility
has become readily available for mobile devices. One can envision that controlled
mobility may provide dramatic improvements in network performance. One such
scenario is the autonomous deployment of mobile sensors in a dynamic environment
for surveillance. Since the environment is changing, the controlled mobility ability
of sensors can allow them to move and change locations to form a better network
according to the environment conditions. Controlled mobility can also make a
network more robust. As mobile sensors are battery-constrained, some of them
may stop working. Instead of losing connectivity of the entire network because of
3
dead sensors, the remaining mobile sensors can still maintain connectivity through
dynamic re-conguration utilizing their mobility.
Recent rapid developments in robotics provides us with the full potential of
controllable mobility of agents, and the underlying techniques belong to the eld
of \Robotic Wireless Network". While there are many successful embodiments of
networked robots with the applications to manufacturing industry, the defense in-
dustry, space exploration, domestic assistance, and civilian infrastructure, there are
signicant challenges that have to be overcome [46]. The problem of coordinating
multiple autonomous units in wireless networks lies at the intersection of control
and communication. On one hand, autonomous robots are often deployed to form
a wireless network in a harsh environment without any available communication
infrastructure. Intelligently controllable movement of these robotic sensors is the
key to provide satisfactory network performance. On the other hand, while the
control of robots is fundamental for the task of building a network, communication
is critical for the cooperation among robots. Robots, when working on some task
together, need to share information with each other as well as to relay informa-
tion to and from outside operators. One of the main goal in our work is to fully
utilize the interleaving between control and communication (or networking), and
harmoniously co-optimize these two factors to improve network performance.
4
1.2 Online Learning
Online learning is a promising approach to achieving AI. Unlike traditional of-
ine approaches that try to model a problem based on theoretical assumptions
and analysis, online learning provides an end-to-end approach to discover the po-
tentially underlying relationships between the input and the output merely from
observations and keep improving over time. These days, beneting from the su-
perior computation power of GPU and huge amounts of data generated by the
Internet, researchers have made tremendous progress in the eld of online learning
and machine learning. Various applications in daily life range from personalized
recommendations to machine translation, healthcare diagnostics, voice assistants
and autonomous driving.
Along with the fast developments in the elds of online learning and wireless
networks, researchers have seen a convergence of these two elds. On one hand, the
massive number of users and devices in a wireless network generates a huge amount
of data. With the help of online learning and machine learning, we can collect
meaningful and important information from raw data and utilize it to create new
user-centric businesses and applications as well as to improve network optimization
across all layers of wireless networks.
On the other hand, pervasive resource allocation and management is essential in
today's communication and networking world. Examples include video streaming
5
on mobile phones that requires deliberate data rate adjustment due to limited band-
width; and dispersed computing that needs careful task allocation and scheduling
over wirelessly connected servers, especially when dealing with big data and ma-
chine learning. The problem of how to make ecient use of the limited resource
is open and channeling. Traditionally, researchers often propose heuristic meth-
ods based on simple assumptions and painstakingly test and adjust them manually
in practice. This conventional approach does not work well in today's world, as
systems become complicated and are often deployed in time-varying and even un-
known environments. Not only does it become impossible to create useful simple
models for these systems, it is also expensive to repeat the entire process again if
the environment changes. To overcome these challenges, it makes sense to deploy
systems that have the ability to learn by themselves and make adjustments when
detecting changes. Online learning, especially when combined with deep learning,
provides an end-to-end approach for systems to learn and optimize by themselves
through interactions with their environment in an online fashion. And it has shown
successes in many complicated real-world problems, such as robotics control, play-
ing the game of Go and data center cooling. Inspired by the fast development in
online learning, we apply it to the eld of wireless networks and aim to build intel-
ligent systems that can become aware of changes and continuously learn and adapt
in real time to improve network performance.
6
1.3 Contributions
In this thesis, we explore the potential of using autonomous robots as well as online
learning to see how Learning, Adaptation and Control can enhance wireless
network performance. This AI-assisted approach provides networks with the capa-
bility to learn and adapt in complicated dynamic and even unknown environments.
1.3.1 Robotic Message Ferrying
Grossglauser and Tse [30] showed that the use of delay tolerant mobile communica-
tions can dramatically increase the capacity of wireless network by providing ideal
constant throughput scaling with network size at the expense of delay. However,
though the idea of message ferrying using controllable mobility nodes dates back
to the work by Zhao and Ammar [99], nearly all the work to date has focused on
message ferrying in intermittently connected mobile networks where the mobility
is either unpredictable, or predictable but uncontrollable. With the rapidly grow-
ing interest in multi-robot systems, we are entering an era where the position of
network elements can be explicitly controlled in order to improve communication
performance.
In Chapter 4, we explore the fundamental limits of robotically controlled mes-
sage ferrying in a wireless network. We consider an environment in which a set
of K pairs of static source and sink wireless nodes that communicate, not directly
with each other (possibly because they are located far from each other and hence
7
cannot communicate with each other at suciently high rates), but through a set
of N controllable robots. We assume there is a centralized control plane responsible
for scheduling robots. Because it collects only queue state information about all
network entities, this centralized plan can be relatively inexpensively created either
using infrastructure such as cellular / WiFi, or through a low-rate multi-hopping
mesh overlay.
We mathematically characterize the capacity region of this system, considering
ideal (arbitrarily large) settings with respect to robot mobility and scheduling du-
rations. Our analysis shows that with N = 2K robots the system can be made to
operate at full capacity (arbitrarily close to the throughput that could be achieved if
all sources and sinks were adjacent to each other). We show how any desired trac
that is within the capacity region of this network can be served stably if the data
arrival rate is known to the scheduler. We then consider how to schedule the robots
when the arrival rate is not known a priori. For this case, we propose and evalu-
ate a queue-backpressure based algorithm, i.e., Coarse-grained Backpressure-based
Message Ferrying (CBMF) algorithm, for message ferrying that is coarse-grained
in the sense that robot motion and relaying decisions are made once every xed-
duration epoch. We show that as the epoch duration or robot velocity increases,
the throughput performance of this algorithm rapidly approaches that of the ideal
case. In addition, to improve the performance of the CBMF algorithm in prac-
tice, we design a heuristic scheme to show how one can adapt network settings
8
according to network conditions to reduce delay with a guaranteed throughput per-
formance. Finally, to gain insights on a practical implementation, we study the
structural properties and corresponding performance of the CBMF algorithm in a
homogeneous network where all source-sink pairs have the same data arrival rate
and delivery distance.
1.3.2 Robotic Network Deployment
In Chapter 5, we consider the problem of deploying a team of robots in an unknown,
obstructed environment to form a multi-hop communication network. In obstructed
environments, such as inside buildings or outdoors in forested areas, not only is there
a concern with moving eciently through the environment while avoiding obstacles
and walls, the communication channels are also cluttered and highly varying due
to signal attenuation (shadowing), and multi-path scattering (fading).
With the exception of some recent work (e.g., [90]), most research to date on
robot network deployments has assumed idealized communication models such as
the unit disk model. Further, the problem of network formation has typically been
treated assuming convex utility functions that can be optimized through local-
ized potentials [61] and greedy distributed gradient descent algorithms [93]. While
radio propagation models such as the simple path loss model yield convex opti-
mization problems in unobstructed environments, the presence of walls introduces
9
non-convexities. We argue that a tractable perspective for a more realistic commu-
nication environment is to consider a graph theoretic formulation in which vertices
correspond to the set of all possible (discretized) locations for the robots in the given
environment, and there are labels on the edges between the vertices that indicate
the RF path loss (or a monotonic function thereof) between the corresponding po-
sitions. The network formation problem becomes one of nding subgraphs of this
graph that satisfy the constraint that the number of nodes in the selected subgraph
must be equal to or less than the number of available robots, and maximize a desired
utility function. Such a problem can then be solved using a suitable centralized or
decentralized graph algorithm (e.g., the Bellman-Ford algorithm to compute the
minimum cost path) to yield the optimal conguration of the robotic network.
While general enough to handle many non-convex network optimization prob-
lems such as minimum cost path formation in environments with arbitrary link
qualities (which is not possible using a purely distributed potential-based approach
in obstructed environments), the implementation of this graph-theoretic approach
in practice faces one signicant hurdle: it requires prior mapping of the area to
determine the link qualities for every pair of locations. This could be prohibitively
time-consuming. We address this challenge with an innovative online, iterative,
approach based on the principle of \Optimism in the Face of Uncertainty" inspired
by similar ideas in the domain of online learning and multi-armed bandits [75].
10
We propose a unied framework, called LEONA (for onLinE rObotic Network
formAtion), that is general enough to allow optimization for dierent utility func-
tions in non-convex environments. The crux is the following. At each iteration, an
optimistic prediction of the graph edge weights (link qualities) is maintained, i.e.,
it is ensured that the predicted link quality is no worse than the true link qual-
ity. The robots then move through the environment to the network conguration
computed to be optimal based on the predicted graph. As they move through the
environment, the robots collaborate to take additional measurements of the link
qualities. These measurements, and potentially, additional inferences derived from
these measurements
1
, are used to update the predicted graph to a new set of val-
ues, that are still ensured to be optimistic (though now a bit \closer" to the true
graph because of the updates). The iterations continue until the robots are at a
conguration whose measured utility is as good as the best possible conguration
in the current predicted graph, which can then be shown to be provably optimal
because of its optimistic bias.
Second, we demonstrate and evaluate how this general framework works in un-
known environments for two specic scenarios concerning the formation of a multi-
hop robotic relay path between two xed end-points in an obstructed environment,
that dier in the path utility functions. In one, we seek to minimize the total path
1
For example, it may be reasonable to assume that if a particular pair of locations has a
certain path loss indicative of signicant attenuation due to a wall, then links corresponding to
all locations that fall on or even near the same line as those locations must experience at least
that much loss due to attenuation as well.
11
cost, and in the other, we seek to maximize the bottleneck rate (i.e., the end-to-
end data rate). A simulation-based evaluation shows that the use of the optimism
principle can signicantly reduce the time spent in exploring and mapping the en-
tire region a priori before the optimal network conguration is constructed. We
also present a mathematical modeling of how the searched area scales with various
relevant parameters for each case.
1.3.3 Dynamic Spectrum Access
In Chapter 6, we study the dynamic spectrum access, which is one of the keys to
improving spectrum utilization in wireless networks and helping to meet the need
for more capacity. This is particularly important in the presence of other networks
operating in the same spectrum. In the context of cognitive radio research, a
standard assumption is that secondary users may search and use idle channels that
are not being used by their primary users (PU). Although there are many existing
works that focus on algorithm design and implementation in this eld, nearly all
of them assume a simple independent-channel (or PU activity) model, that may
not hold in practice. For example, consider a low power wireless sensor network
(WSN) based on IEEE 802.15.4-radios, which use the globally available 2.4 GHz and
868/900 MHz bands. These bands are shared by various wireless technologies (e.g.
Wi-Fi, Bluetooth, RFID), as well as industrial/scientic equipment and appliances
(e.g. micro-wave ovens) whose activities can aect multiple IEEE 802.15.4 channels.
12
This external interference can cause the channels in WSNs to be highly correlated,
and the design of new algorithms and schemes using dynamic multichannel access
is required to tackle this challenge.
Motivated by such practical considerations, we consider a multichannel access
problem with N correlated channels. Each channel has two possible states: good
or bad, and their joint distribution follows a 2
N
-state Markovian model. There is a
single user (wireless node) that selects one channel at each time slot to transmit a
packet. If the selected channel is in the good state, the transmission is successful;
otherwise, there is a transmission failure. The goal is to obtain as many successful
transmissions as possible over time. As the user is only able to sense his selected
channel at each time slot, there is no full observation of the system available. In
general, the problem can be formulated as a partially observable Markov decision
process (POMDP), which is PSPACE-hard and nding the exact solution requires
exponential computation complexity [63]. Even worse, the parameters of the joint
Markovian model might not be known a-priori, which makes it more dicult to
nd a solution.
We investigate the use of Deep Reinforcement Learning, in particular, Deep Q
learning [54], from the eld of machine learning as a way to enable learning in
an unknown environment, as well as to overcome the prohibitive computational
requirements. By integrating deep learning with Q learning, Deep Q learning or
Deep Q Network (DQN) uses a deep neural network with states as input and
13
estimated Q values as output to eciently learn policies for high-dimensional, large
state-space problems. We implement a DQN that nds a channel access policy
through online learning. This DQN approach is able to deal with large systems, and
nds a good or even optimal policy directly from historical observations without any
requirement to know the system dynamics a-priori. We study the optimal policy for
a known xed-pattern channel-switching situation and conduct various experiments
showing that DQN can achieve the same (optimal) performance. We then study the
performance of DQN in more complex scenarios and show, through both simulation
and real data traces, that DQN is able to nd superior, near-optimal, policies. In
addition, we design an adaptive DQN framework that is able to adapt to non-
stationary, time-varying, dynamic environments, and validate, through simulation,
that the proposed approach can be aware of the environment change and re-learn
the optimal policy for the new environment.
14
Chapter 2
Background
Our main focus is to design and implement novel AI-assisted algorithms to enhance
wireless network performance. The necessary background for this include backpres-
sure scheduling to utilize the controllable autonomous robots, and online learning
and deep reinforcement learning algorithms for decision making problems.
2.1 Backpressure Scheduling
The novel idea of backpressure scheduling for multi-hop wireless networks was rst
proposed in the seminal work by Tassiulas and Ephremides [73]. Consider a network
that contains multiple nodes and links, and there are multiple data
ows being
transmitted in the network. The goal is to schedule links at each time slot to help
transmit data.
15
Assume each node maintains a separate queue to store undelivered data for each
data
ow . Let Q
f
i
¹tº denote the queue at node i for
ow f at time slot t, and r
ij
¹tº
is the transmission rate of link¹i; jº. Dene a weight W
ij
¹tº for each link¹i; jº as
W
ij
¹tº =r
ij
¹tº max
f
Q
f
ij
¹tº
=r
ij
¹tº max
f
¹Q
f
i
¹tºQ
f
j
¹tºº
(2.1)
The backpressure scheduling is to schedule a non-interfering set of links for data
transmission at each time slot t that maximizes the sum of W
ij
¹tº. Its essence is to
prioritize transmissions over links that have the highest queue dierentials. This
mechanism is, to some extent, similar to a gradient descent approach where the
gradient refers to queue dierential, which is also known as congestion gradient.
It has been shown that backpressure scheduling is throughput optimal such that
it can stabilize a network under any feasible data rate. Backpressure scheduling can
be implemented eciently in practice as the non-interfereing set of links that max-
imizes the weights W
ij
¹tº at each time slot t can be found by applying a maximum
weight matching algorithm. Moreover, backpressure scheduling is very
exible and
practical in the sense that it makes dynamic scheduling decisions based only on
queueing information and link rate, which does not require to know input data
rate. Later research combines the original results with unitily optimization and
16
presents that backpressure techniques promise simple, throughput-optimal, cross-
layer network protocols that can integrate medium access, routing, rate and power
control for all kinds of networks [18, 26, 56, 57, 58, 59].
2.2 Decision Making with Prior Knowledge
Many problems in the eld of wireless network, such as dynamic channel access-
ing [43], computational ooading [45] and power allocation [76], can be modelled
as a sequential decision making problem. Markov Decision Process and Partially
Markov Decision Process are two commonly used mathematical frameworks to
study such dynamic systems.
2.2.1 Markov Decision Process
Figure 2.1: MDP illustration
1
1
https://www.cs.cmu.edu/
~
katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
17
Markov Decision Process provides a mathematical framework for studying se-
quential decision making problems in discrete-time stochastic environments [9]. A
Markov Decision Process (MDP) can be described by a tuple¹S;A;T;rº. S is a set
containing all possible states of the system, A represents a set of all actions. T
is the state transition probability function that describes how the system evolves
stochastically. Suppose the current state at time slot t is s, if a user takes an action
a, then the probability that the system at time slot t + 1 is s
0
is
T¹s;a;s
0
º = p¹S
t+1
= s
0
jS
t
= s;A
t
= aº (2.2)
At the same time, the user may receive an immediate reward after taking an action,
andr is the reward function that describes how reward is related to state and action
r¹s;aº =E»R
t+1
jS
t
= s;A
t
= a¼ (2.3)
where R
t+1
represents the immediate reward after taking action a at the state s.
A policy : s¹tº! a¹tº is a function that maps each state s to an action a
at each time slot t, which can guide the user to make decisions and take actions
sequentially. The objective is to nd an optimal polity
that maximizes the
expected long term discounted reward:
E
»
1
Õ
t=1
t1
R
¹s¹tºº
¹tº¼ (2.4)
18
where 0<=
< 1 is a discounted factor.
Let V
¹sº be the value function that represents the expected accumulated dis-
counted reward achieved by policy with the system starting from initial state s.
Then we have the following equation:
V
¹sº =r¹s;¹sºº+
Õ
s
0
2S
T¹s;¹sº;s
0
ºV
¹s
0
º (2.5)
Therefore, let V
¹sº represent the value function for the optimal policy
with
initial state s, then the optimal value function satises :
V
¹sº = max
a2A
r¹s;aº+
Õ
s
0
2S
T¹s;a;s
0
ºV
¹s
0
º
(2.6)
The above equations (2.5) and (2.6) are called Bellman equations, which relate
the value function to itself via the system dynamics. The optimal policy
can be
found via a value iteration approach based on the Bellman equations.
Algorithm 1 Value Iteration
1: Initialize t = 0 and V
0
¹sº = 0 for all s2 S
2: while max
s2S
jV
t+1
¹sºV
t
¹sºj > do
3: for all s2 S do:
4: V
t+1
= max
a2A
r¹s;aº+
Í
s
0
2S
T¹s;a;s
0
ºV
t
¹s
0
º
5: end for
6: t = t + 1
7: end while
It has been shown that the policy found by this value iteration approach is
within
2
1
of optimal [6].
19
2.2.2 Partially Observable Markov Decision Process
A Partially Observable Markov Decision Process (POMDP) is very similar to an
MDP except that the state of the system is observable or not. If the state of the
system is fully observable, then the problem is a fully-observable MDP or MDP for
short; otherwise, if the state of the system is not completely revealed to the user,
but only some observation is available, then it is a POMDP. Let O be the set of
all observations, and an observation function Z describes the relationship between
system state (an action) and observation:
Z¹s
0
;a;o
0
º = p¹O
t+1
= o
0
jS
t+1
= s
0
;A
t
= aº (2.7)
Since there is no direct access to the current state, decision-making needs to
maintain the entire history, which makes the problem become complex and non-
Markovian. However, one can convert the problem to an augmented MDP by
maintaining a belief function over the states and then solve the problem using the
previous MDP approach. A belief b¹:º is a probability distribution over possible
states given all previous history. If we consider the belief as the state space, a
POMDP can be converted as an MDP represented by a tuple¹B;A;O;T
b
;r
b
º, where
20
B is the belief space, A is the action space and O is the observation space. T
b
is
the transition function that describes how belief state changes as follows
T
b
¹b;a;b
0
º = p¹b
0
jb;aº
=
Õ
o2O
p¹b
0
ja;b;oºp¹oja;bº
(2.8)
where
p¹oja;bº =
Õ
s
0
2S
Z¹s
0
;a;oº
Õ
s2S
T¹s;a;s
0
ºb¹sº (2.9)
p¹b
0
ja;b;oº =
8
>
>
>
>
> <
>
>
>
>
>
:
1; if b
a
o
= b
0
0; otherwise
(2.10)
And b
a
o
is the updated belief after taking action a and receiving observation o
b
a
o
¹s
0
º =
Z¹s
0
;a;oº
Í
s2S
T¹s;a;s
0
ºb¹sº
p¹oja;bº
(2.11)
In addition, r
b
represents the average reward
r
b
¹b;aº =
Õ
s2S
b¹sºr¹s;aº (2.12)
Since the belief is continuous-valued, the augmented MDP has a continuous
state space. Fortunately, the optimal policy can still be found by solving a Bellman
21
optimality equation as Eqn. (2.13), and a similar value iteration approach exists as
in the previous discrete MDP case.
V
¹bº = max
a2A
r
b
¹b;aº+
Õ
o2O
p¹oja;bºV
¹b
a
o
º
(2.13)
2.3 Decision Making without Prior Knowledge
With the arrival of IoT, wireless network has become heterogeneous and complex.
One of the main challenges in many sequential decision-making and optimization
problems is that we do not have any prior knowledge so that it is impossible to
expect to have fully baked solutions. In this situations, online learning and rein-
forcement learning can be used allowing decisions and actions to be taken on the
y, while learning and gaining new knowledge about the environment at the same
time. In addition, the more one knows about the environment, the more that one
can make better decisions in future. There is a fundamental dilemma that exists
in every online decision making: one has to choose between exploitation and explo-
ration. Exploitation means to make the best decision based on current knowledge,
while exploration means to try other options to gather more information about
the system. Exploitation can give best immediate benet, however it may not be
the best in term of long term reward; On the contrary, exploration allows one to
learn more about the environment at the expense of short-term sacrices, but may
22
enable one to make better decisions in future. Therefore, when designing and ap-
plying online learning and reinforcement learning algorithms, the tradeo between
exploitation and exploration should be well addressed so that one can nd the best
strategy overall.
2.3.1 Q-Learning
From previous discussions, we know that MDP or POMDP can be solved using
Bellman equations and dynamic programming. However, this requires knowing the
system dynamics a-priori, which is often impossible to obtain in practical settings.
Therefore, in order to act fast and decide smartly, online learning and reinforcement
learning, by taking actions while learning the environment, is the key to sequential
decision making problems in unknown environments. Q-learning [86] is one of the
well-known reinforcement learning methods for the MDP and POMDP settings,
which is able to well-address the tradeo between exploration and exploitation and
nd optimal policies in unknown environments.
The goal of Q-learning is to nd an optimal policy, i.e., a sequence of actions
that maximizes the long-term expected accumulated discounted reward. Q-learning
is a value iteration approach and the essence is to nd the Q-value of each state
and action pair, where the state x could simply be the same state in an MDP or
derived from observations (and rewards), and the action a is an action that a user
can take given the state x. The Q-value of a state-action pair¹x;aº from policy ,
23
denoted as Q
¹x;aº, is dened as the sum of the discounted reward received when
taking action a in the initial state x and then following the policy thereafter.
Q
¹x;aº is the Q-value with initial state x and initial action a, and then following
the optimal policy
. Thus, the optimal policy
can be derived as
¹xº = arg max
a
Q
¹x;aº;8x (2.14)
One can use online learning method to nd Q
¹x;aº without any knowledge of
the system dynamics. Assume that at the beginning of each time slot, the agent
takes an action a
t
that maximizes its Q-value of state-action pair¹x
t
;a
t
º given the
state is x
t
, and gains a reward r
t+1
. Then the online update rule of Q-values with
learning rate 0< < 1 is given as follows:
Q¹x
t
;a
t
º Q¹x
t
;a
t
º+»r
t+1
+
max
a
t+1
Q¹x
t+1
;a
t+1
ºQ¹x
t
;a
t
º¼ (2.15)
It has been shown that in the MDP case, if each action is executed in each state
an innite number of times on an innite run (which often requires a carefully
designed exploration algorithm) and the learning rate decays appropriately, the
Q-value of each state and action pair will converge with probability 1 to the optimal
Q
, and thus the optimal policy can be found [87].
24
2.3.2 Deep Reinforcement Learning
Q-learning works well for problems with a small state-action space, as a look-up
table can be used to execute the update rule in Eq. (2.15). But this is impossible
when the state-action space becomes very large. Even worse, since many states
are rarely visited, their corresponding Q-values are seldom updated. This causes Q
learning to take a very long time to converge.
Researchers have proposed both linear and non-linear Q-value approximations
to overcome the space size limit. In 2013, DeepMind developed a Deep Q-Network
(DQN), which makes use of a deep neural network to approximate the Q-values,
and it achieves human-level control in the challenging domain of classic Atari 2600
games [54].
Figure 2.2: Deep neural network
A neural network is a biologically-inspired programming paradigm organized in
layers, as illustrated in Figure 2.2. Each layer is made up of a number of nodes
known as neurons, each of which executes an `activation function'. Each neuron
25
takes the weighted linear combination of the outputs from neurons in the previous
layer as input and outputs the result from its nonlinear activation function to the
next layer. The networked-neuron architecture enables the neural network to be
capable of approximating nonlinear functions of the observational data. A deep
neural network is a neural network that can be considered as a deep graph with
many processing layers. A deep neural network is able to learn from low-level
observed multi-dimensional data and has found its success in areas such as computer
vision and natural language processing [65, 71].
DQN combines Q-learning with deep learning, and the Q-function is approxi-
mated by a deep neural network called Q-network that takes the state-action pair
as input and outputs the corresponding Q-value. Q-network updates its weights
at each iteration i to minimize the loss function L
i
¹
i
º = E»¹y
i
Q¹x;a;
i
ºº
2
¼,
where y
i
=E»r+
max
a
0Q¹x
0
;a
0
;
i1
º¼ is derived from the same Q-network with old
weights
i1
and new state x
0
after taking action a from state x. In this way, DQN
is able to learn Q values eciently even for high dimension, large space, compli-
cated problems. A technique called Experience Replay is introduced in [54] to store
past observations and actions in the replay memory. In each training iteration, a
collection of data samples are randomly drawn from the replay memory and are
used to train the deep neural network to update weights. This technique allows
a data sample to be used in many weight updates and improves data eciency.
26
In addition, it also breaks correlations among data samples so that it makes the
training stable and convergent.
27
Chapter 3
Related Work
In this chapter, we provide a discussion on the state-of-art research conducted in
each of the topics studied in this thesis.
3.1 Robotic Message Ferrying
In the mobile networking community, controlled mobility has become a new design
dimension to improve the network performance. In [99], Zhao et al. introduce the
concept of message ferries, which are mobile devices that can proactively move
around by following pre-designed routes to help deliver data. Based on the idea of
non-random movement, they propose a proactive routing scheme for a single ferry
to provide regular connectivity for disconnected static wireless ad hoc networks.
Later works have considered more general cases, such as multi-ferry control [36]
and delay tolerant networks [98]. In addition in [28], Goldenberg et al. consider
28
mobility as a network control primitive, and present the rst distributed and self-
adaptive mobility control scheme for improving communication performance.
On the other hand, controlled mobility is always an active research area in the
robotics community. In a system of coordinated robots, mobility controllers of
agents can be designed through local interactions to allow robots to perform useful
collective behaviors such as
ocking [60], formation control [20] and swarming [25].
Though communication plays a critical role in these coordinated systems, none of
these works consider practical communication models, rather they rely on a simple
disc model (where two robots can communicate when within some distance), which
does not match a realistic environment and thus degrades the performance.
With the recent advent of the integration of mobile robots and wireless net-
work, realistic communication factors together with routing issues have become an
emerging research topic. Researchers in [90] studied an integrity problem where
controllers of robots are designed to conduct some task while maintaining certain
desired end-to-end transmission rates at the same time. A robot router formulation
problem has been studied in [94], where an optimal conguration of robots is formu-
lated to maintain a maximized successful reception rate in realistic communication
environments that naturally experience path loss, shadowing, and multipath fading.
The expected number of transmissions per successfully delivered packet (ETX) was
also taken into consideration in [88], where researchers design a hybrid architecture
to allow robots to be optimally congured so that each
ow has a minimized ETX
29
in a multi-
ow network. Unlike these works that ignore the impact of queueing
on the joint robotic control and transmission scheduling, in Chapter 4, we design
schemes from a more practical queue-awareness perspective.
Inspiration for our work in Chapter 4 comes from the backepressure schedul-
ing and routing in wireless networks. In [73], Tassiulas and Ephremides propose
the original idea of a backpressure-based queue weight maximization scheduling
scheme, and show that it can guarantee the stability of a general network for any
arrival rate within an optimal capacity region. Later, researchers incorporate the
original results with utility optimization and present that backpressure scheduling
is a throughput-optimal simple network protocol that can integrate medium access,
routing, rate and power control for all kinds of networks [18, 26, 56, 57, 58, 59].
However, large queue sizes have to be maintained when applying backpressure
scheduling, which can cause long delay in the network. Techniques and meth-
ods to improve delay performance have been considered in both theory and prac-
tice [4, 34, 37, 55, 67, 70, 85]. In Chapter 4, we combine the idea of backpressure
scheduling with robotic control to jointly control the movement and routing in the
robotic wireless network. Not only can a certain capacity region be guaranteed, the
delay performance can also be controlled via a tunable parameter.
Closely related to the work in Chapter 4 are our two recent conference publi-
cations [24, 80]. In [80], we propose the idea of queue-aware joint robotic control
30
and transmission scheduling, and design the CBMF algorithm. Based on our ini-
tial work [80], we in Chapter 4 conduct complete capacity and delay performance
analysis of the CBMF algorithm and nd a heuristic approach to adapt network
settings according to network conditions to reduce network delay while maintaining
the network stability at the same time. In [24], we propose a ne-grained backpres-
sure message ferrying algorithm (FBMF) where robots' allocation decision is made
in a much ner manner by also taking current transmission rates into considera-
tion. We show that the FBMF algorithm is throughput optimal for the simplest
setting, which is single-
ow, single-robot, with deterministic arrival. However, there
is no certain answer about its performance in a general case. In contrast to [24],
we study the robotic message ferrying problem with multiple
ows and multiple
robots in a general scenario, and propose a CBMF algorithm whose throughput
performance approaches the ideal capacity region in the general case. Furthermore,
delay performance is also studied to gain more understanding.
3.2 Robotic Network Deployment
Networked robots have been well investigated in recent years, especially in relation
to
ocking [60], formation control [20] and swarming [25]. The key idea is to allow
a team of robots to cooperate and coordinate in a networked autonomous system to
perform a specic task. Therefore, communication among robots plays a signicant
role in enabling cooperation. While the disk communication model has been used
31
in many of these previous works, such a simple model does not t in a realistic RF
environment.
Recently, there has been growing body of works considering more realistic com-
munication performance for networked robots, referred to as Communication-Aware
Robotics. Researchers [94] studied integrity problems where controllers of robots
are designed while maintaining a desired transmission rate at the same time. A
robotic router formulation problem has been studied in [90], where an optimal con-
guration of robots is formulated that maintains a maximized successful reception
rate in realistic communication environments. ETX has been taken into consider-
ation in [88], where researchers design a hybrid architecture that allows robots to
be optimally congured so that each
ow has a minimized ETX in a multiple
ow
network. Because of the convexity of the utility functions that these papers use, po-
tential functions or gradient descent algorithms are used, which are implicitly based
on the assumption there is only one extreme solution. This assumption no longer
holds when the existence of noise sources or obstacles introduces non-convexities.
Controlling a team of robots in a network becomes more challenging when there
are obstacles in the environment. Most works design the controllers of robots
under the assumption that the obstructed environment is known a priori. This
allows researchers to explicitly add obstacle avoidance by utilizing either linear
constraints [69] or articial potentials [39]. However, when communication-oriented
performance is taken into consideration, obstacles do not only can block robots'
32
movements, they can also cause signal attenuation. Few works [23, 79] consider wall
attenuation when studying robot coordination. Additional diculties arise when
the environment is unknown, which further requires robots to take measurements
and explore the environment.
In [27], a measurement-based mapping is computed for each spatial direction
between a robot's current position and the received signal strength regardless of
the environment. This is used to obtain a quadratic optimization yielding the best
locations for a set of robotic access points to serve a set of (possibly mobile) clients.
However, this work does not consider the problem of forming a general utility-
optimized multi-hop communication network among the robotic nodes. Another
problem similar to the one we address is studied in [23] by proposing an algorithm
for maintaining end-to-end network connectivity for a team of robots. They jointly
nd robot congurations with wireless network routing. However, in order to build
the conguration space, the environment has to be known a priori .
To the best of our knowledge, our work in Chapter 5 is the rst to present a
mechanism for rapid optimal multi-hop network conguration by a team of robots in
an unknown realistic RF environment with obstructions, where the problem is non-
convex and not amenable to solution using standard potential-based approaches.
33
3.3 Dynamic Multichannel Access
Many decision making problems, such as vertical hando in heterogeneous net-
works [68] and power allocation in energy harvesting communication systems [76],
are modeled as MDPs that are fully observable. However, in dynamic multichannel
access, a user only has partial observation of channels. Thus, a POMDP framework
is often used for modeling a dynamic multichannel access problem, and nding an
optimal channel access policy has exponential time and space complexities. To over-
come the prohibitive computation complexity, a Myopic policy and its performance
were rst studied in [97] when channels are independent and identically distributed
(i.i.d.). The Myopic policy is shown to have a simple and robust round robin
structure without the necessity to know the system transition probabilities except
whether it is positively or negatively correlated. In [97], it was proved that the
Myopic policy is optimal when there are only two positively correlated channels in
the system. Later in the subsequent work [2], its optimality result was extended to
any number of positively correlated channels and two or three negatively correlated
channels. However, the Myopic policy does not have any performance guarantee
when channels are correlated or follow dierent distributions, which is the situation
considered in our work in Chapter 6.
When channels are independent but may follow dierent Markov chains, the
dynamic multichannel access problem can also be modeled as Restless Multi-armed
bandit problem (RMAB). Each channel can be considered as an arm, and its state
34
evolves following a Markov chain. At each time slot, a user chooses an arm with
a state-dependent reward. The goal is to maximize the total expected reward over
time. A Whittle Index policy is introduced in [51] and shares the same simple
semi-universal structure and optimality result as the Myopic policy when channels
are stochastically identical. Numerical results are also provided showing that the
Whittle Index policy can achieve near-optimal performance when channels are non-
identical. But the Whittle Index approach cannot be applied when channels are
correlated, which is the focus of our work in Chapter 6.
Both the Myopic policy and the Whittle Index policy are derived under the
assumption that the system transition matrix is known. When the underlying sys-
tem statistics are unknown, the user must apply an online learning policy with time
spent on exploration to learn the system dynamics (either explicitly or implicitly).
When channels are independent, the RMAB approach can be applied and the cor-
responding asymptotic performance is compared with the performance achieved by
a genie that has the full knowledge of the system statistics.The commonly used per-
formance metric is called regret, which is dened as the expected reward dierence
between a genie and a given policy. A sublinear regret is desirable as it indicates
the policy asymptotically achieves the same optimal performance as the genie. A
logarithmic regret bound that grows as a logarithmic function of time t is achieved
in [12, 50, 74] when a weak regret
1
is considered, and a O¹
p
tº regret bound and
1
As stated in [13], \The genie being compared with is weaker in the sense that it is aware only
of the steady-state distribution for each channel, and not the full transition matrices"
35
a O¹logtº regret bound with respect to strict regret
2
is achieved in [62] and [13]
respectively. However, all these prior RMAB works are based on the independent
channel assumption, and do not consider correlated channels.
In recent years, some works begin to focus on the more practical and complex
problems where both the system statistics are unknown and the channels are cor-
related. Q-learning, one of the most popular reinforcement learning approaches, is
widely used as it is a model-free method that can learn the policy directly. The
authors in [78] apply Q-learning to design channel sensing sequences, while in [95]
it is shown that Q-learning can also take care of imperfect sensing. Additionally,
in [72], the authors use universal software radio peripheral (USRP) and GNU radio
units to implement and evaluate Q-learning in a multi-hop cognitive radio network
testbed. However, all these works assume that the system state is fully observable
and formulate the problem as an MDP, which signicantly reduces the state space
so that Q-learning can be easily implemented by using a look-up table to store
and update Q-values. Since a user is only able to observe the state of the chosen
channel at each time slot, the current state of the system is not fully observable
and our problem falls into the framework of POMDP. When updating Q-values,
the original state space cannot be directly used because of its partial observability.
Instead, one could consider using either the belief or a number of historical obser-
vations. This can lead to a very large state space, which makes it impossible to
2
As stated in [13], \Comparing the performance of a policy to the genie that knows the prob-
ability transition matrices for each channel and can thus perform optimally"
36
maintain a look-up Q table. New methods able to nd approximations of Q-values
are required to solve the large space challenge.
In recent years, reinforcement learning, including Q learning, has been inte-
grated with advanced machine learning techniques, particularly deep learning, to
tackle dicult high-dimensional problems [3, 5, 48]. In 2013, DeepMind uses a
deep neural network, called DQN, to approximate the Q values in Q learning that
overcomes the limitation of the state space of the traditional look-up table ap-
proach [54]. In addition, this deep neural network approach also provides an end-
to-end approach that an agent can learn a policy directly from observations. In
Chapter 6, we formulate the dynamic multi-channel access problem as a POMDP
and employ DQN to solve this problem. To the best of our knowledge, our work
in Chapter 6 is the rst study and implementation of DQN in the eld of dynamic
multi-channel access.
37
Chapter 4
Robotic Message Ferrying for Wireless Networks
using Coarse-Grained Backpressure Control
In this chapter, we consider the problem of robots ferrying messages between
statically-placed source and sink pairs that they can communicate with wirelessly.
We analyze the capacity region for this problem under ideal conditions, and indicate
how robots could be scheduled optimally to satisfy any arrival rate in the capacity
region, given prior knowledge about arrival rate. For the more practical setting
where the arrival rate is unknown, we present a coarse-grained backpressure mes-
sage ferrying algorithm (CBMF), which schedules both motion and transmission
for each robot only based on its queueing information. We show through analysis
and simulations the conditions under which CBMF can stabilize the network, and
its corresponding delay performance. We also provide an adaptive CBMF approach
The work in this chapter is based on [80, 81].
38
to improve delay performance, and study the structural properties with its explicit
delay performance of CBMF in a homogeneous network.
4.1 Problem Formulation
We consider a network where there are K pairs of static source and destination
nodes located at arbitrary locations in a a two- or three-dimensional Euclidean
space. Let the source for the i
th
ow be denoted as src¹iº, and the destination or
sink for that
ow be denoted as sink¹iº. Packets arrive at a source following either
a deterministic process or an i.i.d. stochastic process, and the arrival rate at source
i is a constant
i
1
. A list of notations is presented in Table 4.1.
There are N mobile robotic nodes in the same space that act as message ferries,
i.e. when they talk to a source node, they can collect packets from it, and when they
talk to a sink node, they can transmit packets to it. These robotic message ferries
are special helper nodes whose mobility can be controlled to assist communication
and enhance the connectivity in a wireless networks. For simplicity, we assume
that the static nodes do not communicate directly with each other, but rather
only through the mobile robots. Also we, following most previous works ([36, 98])
as well as considering the fact that current hardware cannot support one node
simultaneously talking to multiple other notes, assume that a static source or sink
1
If the data arrival process at source i is stochastic,
i
is the expectation. Further, we assume
the second moment of the stochastic process is nite.
39
Table 4.1: List of main notations in the problem formulation
Parameter Denition
K number of source and destination pairs
N number of mobile robots
src¹iº source of the i
th
ow
sink¹iº sink of the i
th
ow
x
src¹iº
location of the source of the i
th
ow
x
sink¹iº
location of the sink of the i
th
ow
x
j
¹tº location of robot j at time t
R
src¹iº;j
¹tº transmission rate between a source for
ow i and robot j at time t
R
j;sink¹iº
¹tº transmission rate between a sink for
ow i and robot j at time t
Q
src¹iº
¹tº queue size at the source for
ow i at time t
Q
i
j
¹tº queue size at robot j for
ow i at time t
i
packet arrival rate at the source of
ow i
T epoch length
v robot's velocity
A allocation matrix
node can talk to at most one robot at any time, which indicates at most 2K robots
are needed. Thus, in the following, we assume N 2K.
Time is divided into discrete time steps of unit duration, and everyT time steps
there is a new epoch. At the start of each epoch, a centralized scheduler can collect
useful information from source and sink nodes as well as mobile robots, and use
this information to allocate each robot to either a source or sink. The matching
is represented by an allocation matrix A such that A¹i; jº is 0 if the robot j is not
allocated to either source or sink for
ow i, 1 if it is allocated to src¹iº, and1
if it is allocated to sink¹iº. When a robot is allocated to a given source (or sink),
for the rest of that epoch it moves with a uniform velocity v along the straight
line directly to the assigned node until it reaches its position. At all time steps of
40
that epoch a robot will communicate continuously and exclusively with its assigned
source (or sink) to collect (or deliver, in case of the sink) any available packets at a
rate depending on its current distance to that node. Moreover, orthogonal channels
are assigned to communication pairs to avoid interference.
The locations of the sources and sinks for
ow i are denoted by x
src¹iº
and x
sink¹iº
respectively, and the location of robot j at time t is denoted as x
j
¹tº. All locations
are in the space R
n
, where n2f2; 3g. Let the distance between a source for
ow
i and a robot j be denoted as d¹x
src¹iº
;x
j
¹tºº (similarly for the sink), which is a
metric in R
n
(for instance, Euclidean distance). So if robot j is moving towards
the source for
ow i (similarly for the sink), its position x
j
¹tº is updated so that it
moves along the vector between its previous position and the source location to be
at the following distance:
d¹x
src¹iº
;x
j
¹t + 1ºº = maxfd¹x
src¹iº
;x
j
¹tºº v; 0g (4.1)
We assume that the rate at which a source for
ow i can transmit to a robot
j, denoted by R
src¹iº;j
¹tº is always strictly positive, and decreases monotonically
with the distance between them, and similarly for the rate at which a robot j can
transmit to the sink for
ow i, denoted by R
j;sink¹iº
¹tº. We assume that when the
robot is at a location of a particular source or sink, (i.e., the distance between them
is 0), the corresponding throughput between the mobile robot and that source or
sink is maximized as R
max
41
source
1
sink
1
source
2
sink
2
robot
3
robot
4
robot
2
robot
1
flow
1
flow
2
Figure 4.1: A network containing 2 pairs of source and sink nodes and 4 robots
In the network, static nodes (sources and sinks) and mobile robots have buers
with innite size, and any undelivered packet can be stored in the corresponding
buer
2
. The queue at the source for
ow i is denoted as Q
src¹iº
. It is assumed
that there is no queue at the sinks as they directly consume all packets intended
for them. Each robot j maintains a separate queue for each
ow i, labelled Q
i
j
.
Figure 4.1 shows an illustration of this system with K = 2
ows and N = 4 robots.
2
In theory, no limit on the number of messages a robot can carry is the ideal case, and
this assumption enables us to have a clear mathematical treatment of the capacity and delay
performance study in later sections. In practice, the maximum buer occupancy will show a
concentration around some nominal value which depends upon how close the arrival rate gets to
the boundary of the capacity region; this can be used to determine how to size the buers to be
a nite value while ensuring a negligible packet drop probability.
42
Therefore, if a robot j is communicating with src¹iº at time t, the update equa-
tions for the corresponding queue of the robot and the source queue will be
n
p
¹tº = minfR
src¹iº;j
¹tº;Q
src¹iº
¹tºg
Q
i
j
¹t + 1º = Q
i
j
¹tº+n
p
¹tº
Q
src¹iº
¹t + 1º = Q
src¹iº
¹tºn
p
¹tº+
i
(4.2)
Similarly, if the robot j is communicating with sink¹iº at time t, the queue
update equation for the robot's corresponding queue will be:
n
q
¹tº = minfR
j;sink¹iº
¹tº;Q
i
j
¹tºg
Q
i
j
¹t + 1º = Q
i
j
¹tºn
q
¹tº
(4.3)
The above formulated system model will be used in all following sections. And
the goal is to study and design scheduling and allocation algorithms about how
the centralized scheduler allocates mobile robots to improve the communication
performance of the robotic message ferrying network.
4.2 Capacity Analysis
In this section, we aim at nding the capacity region of the network of robotic
message ferrying problem under the following assumptions: a) The message arrival
rates at sources are known; b) The epoch length T and robots' velocity v can be
43
set arbitrarily large. In other words, we want to nd the largest rate region such
that any arrival rate vector
3
within this region can be stably served (i.e., the av-
erage size of each queue can be maintained to be bounded) under ideal conditions.
This capacity region serves as a performance upper bound that will guide us to de-
sign robotic scheduling algorithms and evaluate corresponding performances under
practical conditions in later sections.
Denition 1. (Capacity Region) The capacity region is the set of all arrival rate
vectors that are stably supportable by the network, considering all possible scheduling
policies.
We dene an open region of arrival rates as follows:
=
(
j0
i
< R
max
; 8 i;
K
Õ
i=1
i
<
R
max
N
2
)
(4.4)
We shall show that this arrival rate region can be served by a convex combi-
nation of basis congurations in which robots are allocated to serve distinct
ows.
Let
~
be a nite set of vectors dened epoch is:
~
=
(
j
i
=
a
i
R
max
2
; 8 i;a
i
2f0; 1; 2g;
K
Õ
i=1
a
i
N
)
: (4.5)
Each element of this set
~
is an arrival rate vector of one basis conguration,
and the corresponding integer vector a corresponds to a basis allocation of robots
3
An arrival rate vector is a vector of arrival rates of
ows.
44
to distinct sources and sinks that can serve each
ow at rate
i
. Specically, the
i
th
item a
i
refers to the number of robots allocated to serve
ow i. And each basis
allocation corresponding to the elements of
~
can actually be expressed as two
distinct but symmetric allocations of robots to sources/sinks over two successive
epochs. For the i
th
ow, if a
i
= 0, there is no robot allocated to either the source
or sink in either of these two epochs, yielding a service rate of
i
= 0; if a
i
= 1, a
particular robot is assigned to be at the source at the rst epoch and at the sink
at the second epoch, yielding a service rate of
i
=
R
max
2
; if a
i
= 2, two robots are
assigned (call them R
1
and R
2
) such that R
1
is at the source at the rst epoch and
at the sink at the second epoch while R
2
is at the sink at the rst epoch and at the
source at the second epoch, yielding a service rate of
i
= R
max
. The constraints on
a
i
ensure that the total number of robots allocated does not exceed the available
number N.
Let us refer to the convex hull of
~
asH¹
~
º or, for readability, simplyH. Then
we have the following lemma:
Lemma 1.H
Proof. First, note that the convex hull of
~
can be written as follows:
H =
(
j
i
=
a
i
R
max
2
;a
i
2»0; 2¼ 8 i;
K
Õ
i=1
a
i
N
)
(4.6)
45
In other words, the convex hull of the set
~
is obtained by allowing a
i
to vary
continuously. Now using the relationship a
i
=
2
i
R
max
, we can re-expressH as follows:
H =
(
j
2
i
R
max
2»0; 2¼ 8 i;
K
Õ
i=1
2
i
R
max
N
)
=
(
j0
i
R
max
8 i;
K
Õ
i=1
i
R
max
N
2
)
The setH describes all possible robot service rates that can be obtained by
a convex combination of these basis allocations. Consider a rate vector
2H.
Since it lies in the convex hull of the set
~
, it can be expressed as a vector of
convex coecients each of whose elements corresponds to a basis allocation of
robots. For each item
l
in , we can identify n
l
such that n
l
Í
l
n
l
=
l
4
. Then we
schedule robots in such a way by allocating n
l
epochs for each of the two distinct
but symmetric allocations of the lth basis allocation. And after a total of
Í
l
2n
l
epochs, the whole schedule can be repeated. This schedule can provide the desired
service rate vector
.
Thus far the schedules have been derived under the assumption of instantaneous
robot movements. Now we consider the eect of transit time. It is possible to choose
T or v to be suciently large to bound the fraction of time spent in transit by ,
i.e.
d
max
vT
<, where d
max
is the maximum distance between static nodes. Thus even
4
Here, for ease of exposition, we assume that
l
¹8lº is rational, otherwise it can be approxi-
mated by an arbitrarily close rational number which will not aect the overall result.
46
while taking into account time wasted in transit, we can scale either time period
of the epochs T or the velocity v so as to provide a service rate vector
0
that is
arbitrarily close to any ideal service rate
in the sense that
i
0
i
<; 8 i.
We now state one of our main results:
Theorem 1. is the achievable capacity region of the robotic message ferrying
problem.
Proof. By construction,H represents the boundary of all feasible robot service
rates, and as we have discussed time spent in transit can be accounted for by
increasing T or v so that any arrival rate that is in the interior ofH can be served.
According to Lemma 1,H. Thus, any arrival rate in can be stably served.
Furthermore,H represents the closure of the open set. Thus any arrival rate
vector that is a bounded distance outside of cannot be served stably (as it would
also be outside ofH).
Together, these imply that is the achievable capacity region of the network.
Figure 4.2 shows an example of the capacity region when the robotic message
ferrying network has two source-sink
ows and three mobile robots, i.e., K = 2 and
N = 3. The labels such as¹x; yº are given to the basis allocations on the Pareto
boundary to denote that they can be achieved by allocating an integer number of
robots x to
ow 1 and y to
ow 2. Note in particular that the point¹R
max
;R
max
º is
outside the region in this case because the only way to serve that rate is to allocate
two robots full time to each of the two
ows, and we have only 3 robots. The
47
Figure 4.2: Capacity region for a problem with 3 robots and 2
ows
vertices on the boundary of the region, which represent basis allocations, are all in
the set
~
; and the convex hullH completely describes the region.
4.3 Coarse-Grained Backpressure Control
From the previous discussion, we know that if the arrival rate is known, and within
the ideal capacity region of the system, a service schedule for the robots can be
designed in such a way that the arrival rate is served in a stable manner. The
analysis thus far assumes that either the velocity of the robot or the epoch duration
can be chosen to be arbitrarily large, which may not be true in practice. In the
following, motivated by practical considerations we consider the case when T and
v are nite and xed, and study the corresponding capacity region.
48
4.3.1 Capacity Region under nite velocity and epoch duration
Assume the epoch length T and the velocity v are nite and xed. In particular,
the restriction of T to be nite is useful for two reasons: a) it xes the overhead of
scheduling and b) it can be used to enforce an upper bound on delay (time taken
for a packet generated at the source to reach the sink). As may be expected, these
constraints reduce the capacity region.
The fraction of time spent in transit, is bounded by
d
max
vT
. We assume that
d
max
vT
< 1, which implies that a robot can always reach its assigned node (source or
sink) within an epoch. Then the average transmission rate R
avg
during an epoch in
the worst case a robot can achieve can be derived as follows. Consider the worst
case where at the beginning of an epoch a robot is allocated to collect data from a
source that is at the longest distance of d
max
away
5
. The data transmission rate of
a robot at time t is R¹d
max
vtº, and it will take the robot
d
max
v
time to reach the
assigned node. When the robot reaches the node, it stays there and maintains the
maximum transmission rate R
max
for the rest of an epoch.
Therefore, the total number of data that can be collected by the robot during an
epoch consists of two parts: the rst part is the maximum number of data that can
be collected during a robot's movement, which equals
¯ dmax
v
0
R¹d
max
vtºdt, and the
second part is the maximum number of data that can be collected when the robot
5
Though R
avg
is analyzed in the context of robot collecting data from a source, the same result
can be derived if we focus on an epoch where a robot delivers data to a sink.
49
stays at its assigned node, which is R
max
¹T
d
max
v
º. Thus, the average transmission
rate in an epoch is:
R
avg
=
1
T
¹ dmax
v
0
R¹d
max
vtºdt + R
max
¹T
d
max
v
º
(4.7)
This directly provides an inner-bound on the capacity region for nite v and
T expressed in terms of R
avg
, which can still be achieved while scheduling robots
in the same way by a convex combination of congurations in which robots are
allocated to serve distinct
ows as that in Section 4.2:
IB
¹v; Tº =
(
j0
i
< R
avg
; 8 i;
K
Õ
i=1
i
<
R
avg
N
2
)
(4.8)
4.3.2 Coarse-grained Backpressure-based Message Ferrying
In previous discussions, for any packets' arrival rate in the capacity region, as long
as they are known a priori, they can be stably served. We now consider a more
practical case when the arrival rate is not known to the centralized scheduler, while
the the epoch duration T and robots' velocity v are still kept nite and xed. Is
it still possible to schedule the movements and communications of robots in such a
way that all queues remain stable? And what is the corresponding rate region that
can be achieved?
50
The answer to this question turns out to be yes, using the notion of backpressure
scheduling rst proposed by Tassiulas and Ephremides [73]. We propose an algo-
rithm for scheduling message ferrying robots only based on queue information for
nite v and T parameters, which we refer to as coarse-grained backpressure-based
message ferrying (CBMF) presented in Algorithm 2.
Algorithm 2 CBMF Algorithm
1: for n = 1; 2;::: do
2: At the beginning of epoch n
3: for i = 1;:::;K do
4: for j = 1;:::;N do
5: w
src¹iº;j
= Q
src¹iº
¹nTºQ
i
j
¹nTº
6: w
sink¹iº;j
= Q
i
j
¹nTº
7: end for
8: end for
9: Find an allocation matrix A that solves optimization problem (P1)
10: end for
The Optimization problem (P1) is formulated as follows:
maximize
A
Õ
i;j
w¹A¹i; jºº
subject to
Õ
i
jA¹i; jºj = 1; 8j ¹aº
Õ
j
IfA¹i; jº = 1g 1; 8i ¹bº
Õ
j
IfA¹i; jº =1g 1; 8i ¹cº
(P1)
51
where w :X =f1; 0; 1g!R is a function dened as follows:
w¹xº =
8
>
>
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
>
>
:
w
src¹iº;j
if x = 1
w
sink¹iº;j
if x =1
0 if x = 0
(4.9)
The constraint (a) in (P1) ensures that each robot is allocated to exactly one
source or sink. The constraint (b) in (P1) (Ifg represents the indicator function)
ensures that no source is allocated more than one robot, while the constraint (c) in
(P1) ensures that no sink is allocated more than one robot.
The corresponding capacity and delay performance of the CBMF algorithm is
shown in Theorem 2.
Theorem 2. For any arrival rate that is within
IB
¹v; Tº, the CBMF algorithm
ensures that all source and robot queues are stable (always bounded by a nite value).
The proof of this theorem follows from bounding the drift of a quadratic Lya-
punov function and deriving a control policy that minimizes this bound, following
closely the approach pioneered by Tassiulas and Ephremides [73]. The technical
complication in this setting compared to traditional backpressure as applied to
static wireless networks is that the average rate obtained over the course of an
epoch for each matching can be slightly dierent depending on the starting posi-
tion of the robot with respect to the node it is being matched to. CBMF treats the
52
rate for each matching to be the same as R
avg
in its weight calculation, and as a
result it is not provably stable for all arrival rate vectors in the ideal capacity region
(which as discussed before is in fact achievable using scheduling with prior knowl-
edge of the arrival rates); this theoretical guarantee can only be provided for all
arrival rates up to the inner-bound. However, as v andT increase, the inner-bound
approaches the ideal capacity region.
Proof. The main idea to prove this theorem is to show the time average total queue
in the system can be upper bounded.
From an average point of view, alternatively, we can consider this mobile net-
work as a static network where robots are static and have a constant transmission
rate as R
avg
. As R
avg
is the worst-case average transmission rate, some robots may
have higher average transmission rates if their travelling distances are smaller than
d
max
in some epochs. In that case, we can assume those robots pause communi-
cating with their assigned nodes for some time during the epochs, and thus their
average transmission rates can still be the same as R
avg
. This assumption also in-
dicates why the capacity region
IB
¹v; Tº we are going to prove in the following is
only an inner bound since some robots are under utilization under the assumption.
And the CBMF algorithm can actually achieve a better capacity region in practice.
Let b
ij
¹tº2f0; 1g represent if a robot j is allocated to source src¹iº. b
ij
¹tº = 1
indicates robot j is allocated tosrc¹iº andb
ij
¹tº = 0 indicates robot j is not allocated
to src¹iº. Similarly, c
ij
¹tº2f0; 1g represents whether a robot j is allocated to sink
53
sink¹iº. Since at any time t a robot can be allocated to exactly one source or sink,
we have
Í
K
i=1
¹b
ij
¹tº+ c
ij
¹tºº = 1. The transmission rates from a src¹iº to robot j
and from robot j to sink¹iº are R
src¹iº;j
¹tº = b
ij
¹tºR
avg
and R
j;sink¹iº
¹tº = c
ij
¹tºR
avg
respectively.
At the beginning of epoch n + 1 (before making a new allocation), the queue
backlog at source i, 8i2f1;:::;Kg, is updated as follows:
Q
src¹iº
¹¹n+ 1ºTº=max
(
Q
src¹iº
¹nTº+
i
¹T 1º
N
Õ
j=1
b
ij
¹nTºR
avg
T; 0
)
+
i
(4.10)
The queue backlog at robot j for
ow i at the beginning of epoch¹n + 1ºT,
8i2 1;:::;K and j2 1;:::;N, is given by
Q
i
j
¹¹n+ 1ºTº =max
n
Q
i
j
¹nTºc
ij
¹nTºR
avg
T; 0
o
+min
Q
src¹iº
¹nTº+
i
¹T 1º;b
ij
¹nTºR
avg
T
(4.11)
Dene the queue backlog vector of this system at the beginning of epoch n as
¹nTº =
Q
src¹1º
¹nTº;:::;Q
src¹Kº
¹nTº;Q
1
1
¹nTº;:::;
Q
K
1
¹nTº;:::;Q
1
N
¹nTº;:::;Q
K
N
¹nTº
(4.12)
And the Lyapunov function as
L¹¹nTºº =
1
2
"
K
Õ
i=1
Q
src¹iº
¹nTº
2
+
K
Õ
i=1
N
Õ
j=1
Q
i
j
¹nTº
2
#
(4.13)
54
Then we have,
L¹¹¹n+ 1ºTºº L¹¹nTºº
K
Õ
i=1
"
N
Í
j=1
b
ij
¹nTºR
avg
T
i
¹T 1º
#
2
+
2
i
2
+
K
Õ
i=1
N
Õ
j=1
c
ij
¹nTºR
avg
T
2
+
b
ij
¹nTºR
avg
T
2
2
+
K
Õ
i=1
Q
src¹iº
¹nTº
"
i
N
Õ
j=1
b
ij
¹nTºR
avg
º
#
T
+
K
Õ
i=1
N
Õ
j=1
Q
i
j
¹nTº
b
ij
¹nTºc
ij
¹nTº
R
avg
T
(4.14)
where the inequality comes from equations (4.10) and (4.11), and
¹maxfQb; 0g+aº
2
Q
2
+a
2
+b
2
+ 2Q¹abº: (4.15)
¹maxfQ
1
c; 0g+minfQ
2
;bgº
2
maxfQ
1
c; 0g+b (4.16)
Dene the conditional Lyapunov drift as
4¹¹nTºº =EfL¹¹¹n+ 1ºTºº L¹¹nTººj¹nTºg (4.17)
Based on the assumption that at any time at most one robot can be allo-
cated to serve a source i (8i2f1;:::;Kg), we have among all binary variables b
ij
(8j2f1;:::;Ng), at most one variable can be 1 and all the others are 0s. Thus,
55
h
¹
Í
N
j=1
b
ij
¹nTºR
avg
Tº
i
2
¹R
avg
Tº
2
and
Í
N
j=1
¹b
ij
¹nTºR
avg
Tº
2
¹R
avg
Tº
2
. Similarly,
since at any time at most one robot can be allocated to serve a sinki (8i2f1;:::;Kg),
among all binary variables c
ij
(8j2f1;:::;Ng), at most one variable can be 1 and
all the others are 0s. And we have
Í
N
j=1
¹c
ij
¹nTºR
avg
Tº
2
¹R
avg
Tº
2
. Since N, T
and R
avg
are all nite, the rst and second moments of the data arrival process are
nite, then we can dene a nite constant B as
B =
K
Õ
i=1
¹R
max
Tº
2
+
2
i
2
+
K
Õ
i=1
¹R
max
Tº
2
+¹R
max
Tº
2
2
(4.18)
which provides an upper bound for the rst two terms on the right hand side (RHS)
of inequality (4.14).
Thus we have,
4¹¹nTººB+
K
Õ
i=1
Q
src¹iº
¹nTº
i
T
K
Õ
i=1
N
Õ
j=1
E
nh
¹Q
src¹iº
¹nTºQ
i
j
¹nTººb
ij
¹nTº
+ Q
i
j
¹nTºc
ij
¹nTº
i
R
avg
Tj¹nTº
o
(4.19)
Applying the CBMF algorithm to allocate robots, the last term on the RHS
of (4.19) can be maximized, thus the conditional drift can be minimized. Let
56
b
ij
¹tº and c
ij
¹tº represent any other robot allocation, then equation (4.19) can be
re-written as
4¹¹nTºº
B
K
Õ
i=1
Q
src¹iº
¹nTº
E
(
N
Õ
j=1
b
ij
¹nTºR
avg
j¹nTº
)
i
!
T
K
Õ
i=1
N
Õ
j=1
Q
i
j
¹nTºE
n
c
ij
¹nTºb
ij
¹nTº
R
avg
Tj¹nTº
o
(4.20)
In order to upper bound (4.20), let us rst consider the following problem:
given an arrival rate vector =¹
1
;:::;
K
º2
IB
¹v; Tº, we want to design an S-only
(depends only on the channel states) algorithm such that
nd > 0
subject to
i
+E
(
N
Õ
j=1
b
ij
¹tºR
avg
)
; 8i ¹aº
E
n
b
ij
¹tºR
avg
o
+E
n
c
ij
¹tºR
avg
o
; 8i; j ¹bº
(P2)
Similar to the previous robots' allocation policy when prior knowledge about
arrival rate vector is given in Section 4.2, dene the set of all possible robot service
rates asH
0
=
j
i
=
a
i
R
avg
2
;a
i
2»0; 2¼ 8 i;
K
Í
i=1
a
i
N
. Then an S-only algorithm
to achieve any given arrival rate vector strictly interior toH
0
can be designed as
follows.
57
Since =¹
1
;:::;
K
º2H
0
@H
0
, we can nd a vector =¹
1
;:::;
K
º such that
0
=¹
1
+
1
;:::;
K
+
K
º2 @H
0
. Let
max
¹º = minf
1
;:::;
K
g, and since is strictly
interior inH
0
, we have
max
¹º> 0.
Let
00
=¹
1
+
max
¹º;:::;
K
+
max
¹ºº 2 H
0
, and it can be represented as
a convex combination of basis allocations inH
0
. To be specic, in a network
containing K
ows and N robots, there are M (depending on K and N) basis
allocations in total. Let¹
l1
;:::;
lK
º, 8l 2f1;:::;Mg denote the capacity the l
th
allocation can provide. Let =¹
1
;:::;
M
º be the allocation vector of the convex
coecients such that 8l2 1;:::;M,
i
0 and
Í
M
l=1
l
= 1. Then we have
1
¹
11
;:::;
1K
º+:::+
M
¹
M1
;:::;
MK
º
=¹
1
+
max
¹º;:::;
K
+
max
¹ºº
(4.21)
After nding integersn
l
suchn
l
M
Í
l=1
n
l
=
l
,8l2f1;:::;Mg, the arrival rate vector
00
can be served by rst allocating n
l
epochs for the lth basis allocation, and
allocating the next n
l
epochs for the same lth basis allocation but exchanging the
robots locations, 8l2f1;:::;Mg. And after every 2
M
Í
l=1
n
l
epochs, repeat the whole
process.
The above algorithm can guarantee to nd a
max
¹º> 0 in (P2) with constraint
(a) satised. But constraint (b) in P(2) cannot be met since the above algorithm
makesE
n
b
ij
¹tºR
avg
o
=E
n
c
ij
¹tºR
avg
o
;8i; j. To satisfy constraint (b) in (P2), we can
change the above algorithm by adding a few more epochs to each 2
Í
l
n
l
epochs
58
period, during which we only have robots at sinks to help deliver data. In this way
we can nd a
0
¹º> 0 that solves (P2 ), and this allows to express equation (4.20)
as
4¹¹nTºº B
0
¹ºT
"
K
Õ
i=1
Q
src¹iº
¹nTº+
K
Õ
i=1
N
Õ
j=1
Q
i
j
¹nTº
#
(4.22)
Taking iterated expectations, summing the telescoping series, and rearranging
terms yields
n1
Õ
k=0
"
K
Õ
i=1
E
Q
src¹iº
¹kTº
+
K
Õ
i=1
N
Õ
j=1
E
n
Q
i
j
¹kTº
o
#
nB
0
¹ºT
+
EfL¹¹0ººg
0
¹ºT
(4.23)
Consider a time slot t in some epoch interval»kT;¹k+ 1ºT¼, for every
ow i with
arrival rate
i
, the total queue length of its packets satises
Q
src¹iº
¹tº+
N
Õ
j=1
Q
i
j
¹tº Q
src¹iº
¹kTº+
N
Õ
j=1
Q
i
j
¹kTº+
i
¹t kTº (4.24)
Thus, the total accumulation of queues of all
ows during time interval»0;nT1¼
satises
nT1
Õ
=0
"
K
Õ
i=1
E
Q
src¹iº
¹º
+
K
Õ
i=1
N
Õ
j=1
E
n
Q
i
j
¹º
o
#
nB¹T 1º
0
¹ºT
+
EfL¹¹0ºº¹T 1ºg
0
¹ºT
+
K
Õ
i=1
T¹T 1ºn
i
2
(4.25)
where the inequality comes from Eq. (4.23).
59
Therefore time average total queue in the system satises
Q = lim
n!1
1
nT
nT1
Õ
=0
"
K
Õ
i=1
E
Q
src¹iº
¹º
+
K
Õ
i=1
N
Õ
j=1
E
n
Q
i
j
¹º
o
#
B¹T 1º
0
¹ºT
2
+
K
Õ
i=1
¹T 1º
i
2
(4.26)
which indicates the time average total queue is bounded and the system is strongly
stable as B, T, and
0
¹º are positive constants and K and N are xed.
Further according to Eq. (4.18), B = O¹T
2
º. Then for any given 2
IB
¹v; Tº,
the time average total queue satises
Q = O¹Tº so long as the system is stable. As
per Little's Theorem [29]), the end-to-end delay is obtained by dividing the average
total queue size by the total arrival rate, which gives
D =
Q
Í
K
i=1
i
= O¹Tº.
Remark 1. In the proof, we also show that for a xed arrival rate, the end-to-end
delay scales as O¹Tº so long as T is large enough to ensure the system remains
stable.
We would like to point out that the optimization problem (P1) in Algorithm 1 is
actually a Max-Weighted Bipartite Matching problem, where the set of robots and
the set of static nodes (i.e., sources and sinks) form a bipartite graph with edges
connecting each robot and each static node with edge weights as w
src¹iº;j
and w
sink¹iº;j
accordingly. The optimal allocation can be computed in polynomial time by using
some well-known algorithms, such as Hungarian algorithm [44]. This makes the
CBMF algorithm works eciently in practice.
60
4.4 Epoch Adaptive CBMF
According to Remark 1 we know that when the number of
ows and the number
of robots are xed, the end-to-end delay increases linearly as the epoch length
T increases when the network is stable under the CBMF algorithm. Therefore,
reducing epoch length can help reduce network delay as long as the network is
maintained stable. If the arrival rates of
ows in the network are given, theoretically,
we could apply Eqs. (4.7) and (4.8) to nd the best epoch length T for the CBMF
algorithm to make the network stable while maintaining a small (if not the smallest)
delay. But this requires us to know all related network settings including the
number of
ows and robots, locations of sources and sinks, velocity of robots and
the communication model, not all of which could be available a priori in practice.
Moreover, the arrival rate of
ows is probably unknown and may change over time
as well.
Instead, we propose a heuristic approach to show how one can adapt the epoch
length T in the implementation of the CBMF algorithm to make a network stable
according to current network condition while maintaining a small delay without
necessity of knowing all information, as in Algorithm 3.
The general idea of Algorithm 3 is to initially set T as a very large positive
constant T
th
to make sure the network is stable for any arrival rate. The central-
ized coordinator allocates robots according to Algorithm 2, and reduces the epoch
duration T iteratively by a small step size every L observation duration until the
61
Algorithm 3 Epoch Adaptive CBMF Algorithm
1: Initially set T =T
th
and D
old
= D
new
=1.
2: while D
old
D
new
do
3: D
old
D
new
4: T T . Skip this step in the initial loop
5: Run Algorithm 2 for a duration L . L >> T
6: Set D
new
as current network delay
7: end while
8: T T +
9: Run Algorithm 2
network delay cannot be decreased. Then the robots are allocated according to
Algorithm 2 for the xed nalized T and a small network delay can be maintained.
The algorithm works well when the arrival rate remains constant. However, the
system becomes unstable when arrival rates increase. To solve this problem, one
can keep observing the network delay while applying Algorithm 3. And when the
network delay increases, which indicates the arrival rate vector is beyond the ca-
pacity region for the current epoch duration T, one can reset the epoch duration
back to the maximum value T
th
and re-run Algorithm 3.
4.5 Structural Properties and Delay Performance
of CBMF Algorithm in a Homogeneous Network
It has been shown in Theorem 2 that the CBMF algorithm can provide the stability
of a network for any arrival rate vector within the capacity region
IB
¹v; Tº. And
the essence is to nd an allocation of robots that solves the optimization problem
62
(P1) at the beginning of each epoch, which also indicates the total queue dierential
is maximized. However, this does not provide us with any clue about the structures
or patterns of allocations. In this section, we study a homogeneous network with
deterministic data arrival process and aim at nding structural properties of the
CBMF that can be utilized in practice to improve performance.
4.5.1 Structural Properties
We consider a homogeneous network in the sense that every
ow has the same
arrival rate, and the distance between a source and its corresponding sink is the
same among all K
ows. Initially, we assume all queues are empty.
In the following discussion, in the case when the number of robots assigned to
collect data from sources is no greater than the number of
ows in a network, we
assume that a collecting robot can collect all data in the source and the source
queue become empty at the end of that epoch. This assumption is reasonable and
re
ects the actual performance most of the time if the arrival rate vector is within
the capacity inner bound. Also if there are data keeping accumulated at the source,
the network can no longer remain stable.
As observed, when applying the CBMF algorithm to nd robots' allocation for
each epoch, there may be multiple allocations available, all of which are solutions to
the optimization problem (P1). Thus, we make two preferences when choosing one
allocation out of multiple possible allocations: a) based on the intuition that better
63
delay performance can be achieved if robots can deliver data to corresponding sinks
as soon as possible, we let the centralized coordinator prefer choosing the allocation
which can assign as many robots as possible to sinks to deliver data whenever there
is queue dierential existing between a robot and a sink; b) if no queue dierential
exists between any robot and sink, the centralized coordinator prefers choosing the
allocation which can assign as many robots as possible to sources to collect data.
Remark 2. These two preferences only deal with the case when there are multiple
possible allocations that solves the optimization problem (P1). Therefore, the robotic
allocation from CBMF algorithm with these two preferences can still stabilize the
network for any arrival rate vector within the capacity inner bound
IB
¹v; Tº.
With the above two preferences, we are able to study the structural properties of
the CBMF algorithm in the multi-
ow homogeneous network and nd the robotic
allocation from the CBMF algorithm has a time sharing structure. In the following,
we present the structural property result in two dierent cases depending on the
number of
ows and the number of robots. First, for the case when the number of
robots is no more than the number of
ows, we have the following result:
Lemma 2. In a multi-
ow homogeneous network, when the number of robots is
no greater than the number of
ows, i.e., N K, the CBMF Algrotihm with an
allocation that exactly solves the optimization problem (P1) at the start of each
epoch is equivalent to the Robot Allocation Strategy I shown in Algorithm 4.
64
Algorithm 4 Robot Allocation Strategy I
1: At the beginning of the initial epoch, randomly pick N sources and allocate a
robot to each chosen source.
2: for n = 2; 3;::: do
3: At the beginning of epoch n
4: if n is odd then
5: Allocate each robot to one of the N least recent served sources to collect
data
6: else
7: Allocate each robot to a sink corresponding to its previously assigned
source to deliver data
8: end if
9: end for
Proof. What we are interested in showing in the following is that under the ho-
mogeneous condition, the Robot Allocation Strategy I indeed follows the CBMF
algorithm that solves the optimization problem (P1) at the start of each epoch.
In the initial epoch, since all queues are empty, the centralized scheduler prefers
assigning each robot to a source to collect data and all data from each served source
can be collected at the end of this epoch. Thus at the end of the initial epoch, each
robot contains T data in its queue corresponding to its assigned source. At the
end of the rst epoch, the queues of N served sources are empty and each of the
K N unserved sources has a queue size of T. Thus, at the beginning of the
second epoch, though the optimization problem (P1) may have multiple solutions
that all give the maximum total queue dierential NT, based on our assumptions,
the scheduler prefers choosing the solution that assign all robots to sinks with the
purpose to reduce delay.
65
In the second epoch, robots can deliver all their data to sinks and their queue
become empty at the end. Thus, the queue dierential between any robot and any
sink is 0. Additionally, each source hasT new data arrive during this epoch. Thus,
at the beginning of the third epoch, according to the CBMF algorithm, all robots
need to be allocated to sources and the corresponding allocation A
3
6
solves the op-
timization problem (P1) with maximized total queue dierential as
Í
i;j
w¹A
3
¹i; jºº.
During the third epoch, robots can collect all data from their assigned sources. At
the beginning of the forth epoch, if one assign all robots to sinks associated with
the previously served sources, the total queue dierential is
Í
i;j
w¹A
3
¹i; jºº+ NT,
which is the maximum total queue dierential. This is because at the beginning of
the third epoch, the total queue dierential
Í
i;j
w¹A
0
3
¹i; jºº provided by any other
allocation A
0
3
is no greater than
Í
i;j
w¹A
3
¹i; jºº. During the third epoch, each source
has an amount of T new data arrive, and at most N sources can have their new
data arrived in the third epoch completely being collected by their serving robots.
Therefore, at the beginning of the fourth epoch, any allocation cannot provide a to-
tal queue dierential greater than
Í
i;j
w¹A
3
¹i; jºº+NT. Thus, in the fourth epoch,
the allocation that assigns robots to sinks associated with the sources they serve in
the third epoch is the solution to the optimization problem (P1).
If we apply the above analysis for all epochs, it can be proved that robots work
as one single serving group and serve sources and sinks in alternative cycles under
the CBMF algorithm. In addition, since all
ows are homogeneous, each
ow has
6
A
i
represents the allocation matrix in epoch i.
66
an equivalent amount of time being served. This further indicates that in every odd
epoch, robots need to be allocated to least recent served sources to collect data,
and in the following even epoch, robots need to be assigned to the corresponding
sinks to deliver data. Thus, the robotic allocation turns out to follow the Robot
Allocation Strategy I shown in Algorithm 4.
For the case when the number of robots is greater than the number of
ows, the
corresponding result is shown as follows:
Lemma 3. In a multi-
ow homogeneous network, when the number of robots is
greater than the number of
ows, i.e., K < N 2K, the CBMF algorithm with
an allocation that exactly solves the optimization problem (P1) at the start of each
epoch is equivalent to the Robot Allocation Strategy II shown in Algorithm 5.
Proof. Similar to the proof of Lemma 2, what we are going to show is that the
Robot Allocation Strategy II stated in Lemma 3 follows the CBMF algorithm that
solves the optimization problem (P1) at the start of each epoch.
In the rst epoch, since all queues in the system are empty, the allocation is to
allocate K robots to move to sources to collect data and the other N K robots
to move sinks to deliver data (though the robots do not have any data during this
initial epoch). The arrival rate of each source is . At the end of the rst epoch,
each of the K collecting robot contains T data in its queue associated with its
assigned source, and all other nodes or robots have queue size 0. Thus, at the start
67
Algorithm 5 Robot Allocation Strategy II
1: Divide N robots into two groups. Group I contains K robots and Group II
contains NK robots.
2: At the beginning of the initial epoch, randomly allocate each robot in Group I
to a source; And randomly choose NK sinks and allocate each robot in Group
II to a chosen sink
3: for n = 2; 3;::: do
4: At the beginning of epoch n
5: if n is odd then
6: Allocate each robot in Group I to a source to collect data
7: Allocate each robot in Group II to a sink corresponding to its previously
assigned source to deliver data
8: else
9: Allocate each robot in Group I to a sink corresponding to its previously
assigned source to deliver data
10: Allocate each robot in Group II to one of the NK least recent served
sources to collect data
11: end if
12: end for
of the second epoch, the allocation that solves the optimization problem (P1) is to
allocate each previous collecting robot to move to the corresponding sink to deliver
data and the other NK previous delivering robots to sources to collect data.
In the second epoch, each of the N K robots can collect all data from its
assigned source and each of the K delivering robots can deliver all its collected data
to its sinks. Thus, at the end of the second epoch, each of the NK robots has a
queue size ofT associated with its assigned source, and the queue of each of the K
delivering robot becomes empty. Among K sources, there are NK sources having
been served whose queues are empty and each of the 2KN unserved sources has a
queue size of T. Thus, at the beginning of the third epoch, based on the CBMF,
the allocation A
3
that solves the optimization problem (P1) with sinks preferred
68
is to allocate the N K robots to corresponding sinks to deliver data and the K
robots to sources to collect data. Denote the corresponding total queue dierential
as
Í
i;j
w¹A
3
¹i; jºº.
In the third epoch, the NK robots move to corresponding sinks to deliver data
and the K robots move to sources to collect data. At the end, the N K robots
nish delivery and their queue become empty. The total queue dierential between
robots and sinks associated with their served sources is
Í
i;j
w¹A
3
¹i; jºº+NT (or a
little bit less if there are leftovers exist in the sources, but this will not aect the
allocation except for driving the NK robots to serve sources with more leftovers),
which is the maximum total queue dierential. This is because at the beginning of
the third epoch, the total queue dierential
Í
i;j
w¹A
0
3
¹i; jºº provided by any other
allocation A
0
3
is no greater than
Í
i;j
w¹A
3
¹i; jºº. And during the third epoch, each
source an amount ofT new data arrive, and at most N sources can have their new
data arrived in the third epoch completely being collected by their serving robots.
Thus, at the beginning of the fourth epoch, any allocation cannot provide a total
queue dierential greater than
Í
i;j
w¹A
3
¹i; jºº+ NT. Thus, in the fourth epoch,
based on the CBMF, the allocation that solves the optimization problem (P1) is to
assign robots to sinks associated with the sources they serve in the third epoch.
As we apply the above analysis for all epochs, it can be proved that robots
are divided into two groups whose size are K and N K separately. And the
two groups interchangeably serve sources and sinks in alternative cycles. Similarly,
69
since all
ows are homogeneous, each
ow has an equivalent amount of time being
served. Thus, the robotic allocation follows the Robot Allocation Strategy II shown
in Algorithm 5.
Therefore, in a homogeneous network, the CBMF algorithm has a simple time
sharing structure (either Algorithm 4 or 5). The centralized coordinator does not
need to collect any queue information for making an allocation decision. It only
requires the centralized coordinator to nd out the least recent served
ows, which
can be done by keeping a record in memory. In addition, every robot can have a
nite buer since it serves a source and its corresponding sink in two consecutive
epochs. All of these make the CBMF algorithm easy to implement in practice.
4.5.2 Delay Analysis
According to Remark 1, the end-to-end delay scales as O¹Tº so long as T is large
enough to make sure the system is stable for a xed arrival rate. However, this
scaling law perspective still cannot provide a clear idea of how delay performs
when the epoch length is nite. In a multi-
ow homogeneous network, the CBMF
algorithm becomes as simple as a time-sharing robot allocation strategy (either
Algorithm 4 or 5), which enables us to nd its explicit delay performance:
Theorem 3. In a homogeneous network with K
ows and N robots, the network
delay under the CBMF algorithm can be explicitly bounded as follows:
70
If the number of robots is no greater than the number of
ows, i.e. 0< N K,
D¹c
1
+ 2ºT (4.27)
If the number of robots is greater than the number of
ows , i.e. K < N 2K,
D
8
>
>
>
>
> <
>
>
>
>
>
:
2T if
R
avg
2
n
R
avg
2
+1+¹c
2
+2º¹1
R
avg
2
º
o
T if >
R
avg
2
(4.28)
where c
1
=b
K
N
c and c
2
=b
K
NK
c, which are constants as K and N are given.
Proof. Depending on the number of
ows and the number of robots in the network,
the delay analysis falls into one of the following two cases:
Case 1 when 0 < N K, the system follows Robot Allocation Strategy I in
Algorithm 4. Since the arrival rate vector is within capacity region
IB
¹v; Tº, it will
take a nite amount of time for the system to become stable. When it is stable, the
total queue size in the system evolves exactly the same every two epochs. Assume
the rst epoch in a two-epoch cycle is the one when robots move to sources to
collect data and the second of a two-epoch cycle is the one when robots move to
sinks to deliver data. Denote the total queue size at the beginning of a two-epoch
cycle in the system as Q
cycle
, which remains the same every two epochs.
At the beginning of a new two-epoch cycle when N robots start to move to
sources to collect data, the queue size at a most recent served source is T. This
71
is because a most recent served source is the one which is served in the last two-
epoch cycle. And in the rst epoch of the last two-epoch cycle, all data at the
source can be collected by a robot, and in the second epoch of the last two-epoch
cycle, as the robot delivers data to the corresponding sink, new data keep arriving
and accumulating at the source to a size of T. The number of the most recent
served sources is N, which is the same as the number of robots. Similarly, the
data at a second most served source is 3T, where the 2T additional data come
from the fact that there is no robot serving it in the most recent two-epoch cycle.
And the number of the second most recent served sources is also N. Thus, if we
repeat this analysis, we can get the queue size of each robot. Denote c
1
=b
K
N
c,
which is a constant as K and N are xed. The number of sources with queue size
T; 3T;:::;¹2c
1
1ºT respectively is N and the number of sources with a queue size
of¹2c
1
+ 1ºT is Kc
1
N. Since all robots have empty queues at the beginning of a
two-epoch cycle, and sinks always have empty queues, the total queue at the start
of a new two-epoch cycle in the system only consists of queues at source, which is
Q
cycle
= NT»1+ 3+:::+¹2c
1
1º¼+¹Kc
1
Nº¹2c
1
+ 1ºT
=»¹2c
1
+ 1ºK¹c
2
1
+c
1
ºN¼T
(4.29)
In the rst epoch of a two-epoch cycle when robots move to sources to collect
data, since there is no data being delivered to sinks and each source has an arrival
rate of , the total queue backlog in the system increases as Kt. In the second
72
epoch when robots move to sinks to deliver data, each source keeps having data
arrive at rate . But due to the fact that data is being delivered to sinks from
robots, the total queue growth rate is no greater than Kt. Thus, assume the time
needed for the system to become stable is n
1
epochs, then the total queue in the
system at a time t after the system is stable satises
Q¹tº Q
cycle
+K¹tn
1
T 2Tb
tn
1
T
2T
cº (4.30)
Thus the total accumulation of queues of all
ows during a time interval»0;¹n
1
+
2n
2
ºT¼ that we are interested in becomes
¹
¹n
1
+2n
2
ºT
0
Q¹tºdt
=
¹
n
1
T
0
Q¹tºdt +
¹
¹n
1
+2n
2
ºT
n
1
T
Q¹tºdt
¹
n
1
T
0
Q¹tºdt +n
2
¹
2T
0
¹Q
cycle
+Ktºdt
(4.31)
the last inequality comes from Eq. (4.30).
Therefore time average total queue in the system satises
Q = lim
n
2
!1
1
¹n
1
+ 2n
2
ºT
¹
¹n
1
+2n
2
ºT
0
Q¹tºdt
lim
n
2
!1
1
¹n
1
+ 2n
2
ºT
¹
n
1
T
0
Q¹tºdt +n
2
¹
2T
0
¹Q
cycle
+Ktºdt
1
2T
¹
2T
0
¹Q
cycle
+Ktºdt
(4.32)
73
where the last inequality indicates that the time average total queue is upper
bounded by the time average queue in a stable two-epoch cycle.
Taking Eq. (4.29) to Eq. (4.32) gives
Q»¹2c
1
+ 2ºK¹c
2
1
+c
1
ºN¼T (4.33)
According to the Little's Theorem, the corresponding delay is
D =
Q
K
, thus we
have
D»2c
1
+ 2
¹c
2
1
+c
1
º
K
N
¼T (4.34)
Further, since c
1
=b
K
N
c < c
1
+ 1, we have
D¹c
1
+ 2ºT (4.35)
Case 2 when K < N 2K, the system follows Robot Allocation Strategy II
in Algorithm 5. According to Algorithm 5, robots are divided into two groups
containing K and N K robots respectively. In addition, the serving patterns of
these two group are complementary (or interchangeably). Specically, if in an epoch
when the group of K robots collect data from sources, the group of NK robots
deliver data to sinks; Or if in an epoch when the group of K robots deliver data
to sinks, the group of NK robots collect data from sources. Instead of randomly
allocate a robot in Group I to a source to collect data in odd epochs in Algorithm 5,
one feasible solution is to keep each robot in Group I serving the same source in
74
all odd epochs
7
. Thus, depending on arrival rate, the corresponding delay of the
network can be analyzed in two dierent subcases as follows.
Subcase 1 when 0
R
avg
2
: only using the group of K robots can keep the
network remain stable because they can take all data from sources in the collecting
epoch, and all the data can be delivered to sinks in the following delivering epoch.
The other group of NK robots can help reduce delay in the network, but their
dierent kinds of assignments result in the same network delay. Similarly, when the
system is stable, the total queue size evolves exactly the same every two epochs.
Let the rst epoch of the two-epoch stable cycle be the one that the group of K
robots nish delivering data to sinks and move to sources to collect data while the
other group of NK robots nish collecting data from sources and move to sinks
to deliver data; And let the second epoch of the two-epoch stable cycle be the one
that the group of K robots nish collecting and move to sinks to deliver data while
the other group of N K robots nish delivering and move to sources to collect
data. Since every source has T data arrive that have not been delivered to its
sink, the total queue length in the system at the beginning of a two-epoch service
cycle is Q
cycle
= KT.
Since in a two-epoch cycle, there are always some robots delivering data, the
total queue growth rate is less than K. Similar to the argument in Case 1 that
7
The centralized coordinator can arbitrarily x the robotic allocation of Group I in Algorithm 5
in such a way that keeps each robot in Group I serving the same source in all odd epochs. There
might be other allocations of robots in Group I providing a better delay performance, but the
theoretical result derived in Theorem 3 still holds.
75
the time average total queue in the system is upper bounded by the time average
queue in a stable two-epoch cycle, therefore, we have
Q
1
2T
¹
2T
0
¹Q
cyce
+Ktºdt
2KT
(4.36)
And
D 2T (4.37)
Subcase 2 when >
R
avg
2
: after the group of K robots collecting data from
sources, there can be leftovers in the sources which can drive the group of N K
robots serve K
ows in the same manner as Algorithm 4.
Therefore, the network can be considered as formed by two sub-networks. In one
sub-network, there are K
ows with arrival rate
R
avg
2
and each
ow has a designated
robot serving it. In the second sub-network, there are K
ows with arrival rate
R
avg
2
and N K robots providing service according to the Robot Allocation
Strategy I in Algorithm 4. And serving patterns of these two sub-networks are
complementary.
When the network is stable, the network evolves exactly the same every two
epochs. Let the rst epoch of the two-epoch stable cycle be the one that K robots
move to sources to collect data in the rst sub-network while NK robots move
to sinks to deliver data in the second sub-network, and let the second epoch of the
two-epoch stable cycle be the one that K robots move to sinks to deliver data in
76
the rst sub-network while N K robots move to sources to collect data in the
second sub-network. Let m = N K and
2
=
R
avg
2
. The queue of the rst
sub-network at the beginning of the two-epoch stable cycle is Q
cycle
1
= K
R
avg
2
T in
the rst sub-network.
For the second sub-network, following the similar analysis in Case 1, we can get
the queue size of each robot at the beginning of one epoch before a new two-epoch
cycle. Denote a constant as c
2
=b
K
m
c, then he number of sources with queue size
2
T; 3
2
T;:::;¹2c
2
1º
2
T respectively is m and the number of sources with a queue
size of¹2c
2
+ 1º
2
T is Kc
2
m at the beginning of the prior epoch. Taking the new
arrival data within the prior epoch into consideration, the total queue in the second
sub-network at the start of a new two-epoch cycle is
Q
cycle
2
= m
2
T»1+ 3+:::+¹2c
2
1º¼+¹Kc
2
mº¹2c
2
+ 1º
2
T +K
2
T
=»¹2c
2
+ 2ºK¹c
2
2
+c
2
ºm¼
2
T
(4.38)
Thus, the total queue in the system at the beginning of a new two-epoch cycle
is
Q
cycle
= Q
cycle
1
+Q
cycle
2
= K
R
avg
2
T +»¹2c
2
+ 2ºK¹c
2
2
+c
2
ºm¼
2
T
(4.39)
In a two-epoch cycle, there are always some robots delivering data, the total
queue growth rate is less than K. Similar to the previous argument that the time
77
average total queue in the system is upper bounded by the time average queue in
a stable two-epoch cycle, then we have
Q
1
2T
¯
2T
0
¹Q
cycle
+Ktºdt, which yields
Q K
R
avg
2
T +»¹2c
2
+ 2ºKc
2
2
mc
2
m¼
2
T +KT (4.40)
And
D
(
R
avg
2
+»¹2c
2
+ 2º
¹c
2
2
+c
2
º
K
m
¼
2
+ 1
)
T (4.41)
Similarly, since c
2
=b
K
m
c < c
2
+ 1 and
2
=
R
avg
2
, we further have
D
R
avg
2
+ 1+¹c
2
+ 2º¹1
R
avg
2
º
T (4.42)
4.6 Simulation And Evaluation
We rst present numerical simulation results for a network containing twenty
ows
and thirty robots. The sources and sinks of twenty
ows are randomly located in
a 200 200 2D plane, and all mobile robots initially start from the center of the
plane, i.e., the location¹100; 100º. Packets arrive at sources following the same
deterministic process. In the simulation, we use a typical distance-rate function to
represent the transmission rate. A reasonable rst-order model can be obtained by
combining Shannon's formula [29] for the capacity of an AWGN formula [29] as a
78
function of its SNR. If the transmit power is kept constant, the SNR in turn depends
upon how the signal power varies with distance. To account for near-eld eects,
we assume that this decay happens as
P
max
1+d
, where is the path loss exponent, and
P
max
is the maximum received power. Thus the transmission rate as a function
of distance is R¹dº = log
1+
C
1+d
, where C is a constant that takes into account
the transmitter power as well as the noise variance at the receiver. We use = 2,
C = 1, v, T and are varied as shown in the Figure 4.3.
In the Figure 4.3 (left) we see the average end-to-end delay (it is obtained
by measuring the average total queue size in the simulations and dividing by the
total arrival rate, as per Little's Theorem [47]) versus arrival rate as we x epoch
duration T and vary velocity v, plotted wherever CBMF results in stable queues;
we nd that we are able to get converging, bounded delays (indicative of stability)
even beyond the inner-bound capacity line. Also marked on the gure is the lower
(inner) bound of capacity, for rates below which CBMF is provably stable. We see
that as the velocity increases, so does the capacity, and at the same time the delay
decreases. Thus improvement in robot velocity benets both throughput and delay
performance of CBMF, as may be expected. Figure 4.3 (right), in which the velocity
v is kept constant across curves but the epoch duration T is varied, is somewhat
similar but with one striking dierence, however, as the epoch duration increases,
so does the capacity; but at the same time, the average delay also increases (for
79
the same arrival rates, so long as stability is maintained). Thus, increasing the
scheduling epoch duration improves throughput but hurts delay performance.
Input rate 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Delay
0
50
100
150
200
250
300
v=2*sqrt(2)
capacity inner bound=0.1371
v=8*sqrt(2)
capacity inner bound=0.5968
v=100*sqrt(2)
capacity inner bound=0.7377
Input rate 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Delay
0
50
100
150
200
250
300
350
400
T=25
capacity inner bound=0.1371
T=50
capacity inner bound=0.4434
T=200
capacity inner bound=0.6734
Figure 4.3: Delay as we vary v forT = 100 (left) and delay as we varyT for v = 8
p
2
(right) for 20-Flows-30-Robots network
Next, to show the performance of the Epoch Adaptive CBMF in Algorithm 3,
we conduct a simulation of a network whose settings are the same as the previous
20-Flows-30-Robots network. We use T
th
= 70, = 10 and L = 5000. Two dierent
algorithms have been implemented to set the epoch length T: one is to apply the
Epoch Adaptive CBMF in Algorithm 3, and the other is a non-adaptive scheme
whereT is xed asT
th
all the time. We present the delay performance (in Figure 4.4
left) of these two approaches together with the corresponding epoch length at each
time (in Figure 4.4 right). Initially the arrival rate is set as = 0:3. As is shown
in Figure 4.4, the Epoch Adaptive CBMF reduces the epoch duration to improve
network delay while keeping the network stable. When there is an increase in the
arrival rate where becomes 0:45, the Epoch Adaptive CBMF algorithm can detect
this change and adapt T accordingly to keep the network stable while maintaining
80
a small delay. Overall, applying our heuristic algorithm to adjust the epoch length
according to current network condition can provide a better delay performance.
Time #10
4
0 1 2 3 4 5 6 7
Delay
0
10
20
30
40
50
60
70
80
90
100
6=0.3 6=0.45
delay of Epoch Adaptive CBMF
delay of non-adaptive approach
Time #10
4
0 1 2 3 4 5 6 7
Epoch length T
25
30
35
40
45
50
55
60
65
70
75
80
6=0.3 6=0.45
T of Epoch Adaptive CBMF
T of non-adaptive approach
Figure 4.4: Delay (left) and Epoch Duration (right) comparison of the Epoch Adap-
tive CBMF Algorithm with a non-adaptive scheme for 20-Flows-30-Robots network
Finally, we evaluate the performance of the CBMF algorithm in a multi-
ow
homogeneous network. Particularly, we are interested in how delay changes with re-
spect to the epoch durationT, which is tunable to meet dierent network conditions
when applying CBMF algorithm in practice. We consider two dierent network set-
tings: a 20-Flows-10-Robots network and a 20-Flows-30-Robots network. As can be
seen in Figure 4.5, the delay performance in simulation is bounded by our theoret-
ical analysis. We also present the results of how delay changes according to epoch
length T in Figure 4.6. As it is shown, the delay grows linearly with the epoch
length T in both simulation and theory. In addition, during the running of our
simulation, we make an observation of the allocation of robots at each epoch and
conrm that the CBMF algorithm indeed matches the robot allocation strategy in
Algorithm 4 or 5, which indicates the CBMF has a simple time sharing structure.
81
Input rate 6
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22
Delay
0
100
200
300
400
500
600
T=25 in sim
T=25 in thm
T=50 in sim
T=50 in thm
T=200 in sim
T=200 in thm
Input rate 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Delay
0
100
200
300
400
500
600
700
T=25 in sim
T=25 in thm
T=50 in sim
T=50 in thm
T=200 in sim
T=200 in thm
Figure 4.5: Delay of as we vary for homogeneous 20-Flows-10-Robots network
(left) and 20-Flows-30-Robots network (right)
4.7 Conclusion
In this chapter, we have addressed three fundamental questions in robotic message
ferrying for wireless networks: what is the throughput capacity region of such
systems? How can they be scheduled to ensure stable operation, even without prior
knowledge of arrival rates? And what is the corresponding delay performance, and
Epoch Length T
20 30 40 50 60 70 80 90 100 110
Delay
50
100
150
200
250
300
350
delay in sim
delay upper bound
Epoch Length T
20 30 40 50 60 70 80 90 100 110
Delay
20
40
60
80
100
120
140
160
180
200
220
delay in sim
delay upper bound
Figure 4.6: Delay of as we vary T for a homogeneous 20-Flows-10-Robots network
(left) and 20-Flows-30-Robots network (right)
82
is there a way to maintain stability while having small delay in realistic network
settings?
We have mathematically characterized the capacity region of such systems in
both ideal and realistic settings. A dynamic CBMF algorithm has been proposed
to schedule robots to guarantee the network stability even without knowing the
arrival rates. We have derived an inner bound of the capacity region together with
its corresponding delay performance that the CBMF algorithm can achieve. The
fact that the delay scales linearly with the epoch duration has guided us to design
a heuristic approach to adapt epoch duration during run time to improve the delay
performance while keeping network stable. It has also been shown that in a multi-
ow homogeneous network, the CBMF algorithm with two additional preferences
has a simple time sharing structure, which is easy to implement in practice.
83
Chapter 5
The Optimism Principle: A Unied Framework
for Optimal Robotic Network Deployment in An
Unknown Obstructed Environment
In this chapter, we study the problem of deploying a team of robots in an unknown,
obstructed environment to form a multi-hop communication network. As a solution,
we present a unied framework, onLinE rObotic N etwork formAtion (LEONA),
that is general enough to permit optimizing the communication network for dier-
ent utility functions in non-convex environments. LEONA adopts the principle of
\optimism in the face of uncertainty" to allow the team of robots to form optimal
network congurations eciently and rapidly without having to map link qualities
in the entire area. We demonstrate and evaluate this framework on two specic
scenarios concerning the formation of a multi-hop communication path between
The work in this chapter is based on [82].
84
xed end-points. Our simulation-based evaluation shows that the use of the opti-
mism principle can signicantly reduce resources spent in exploring and mapping
the entire region prior to network optimization. We also present a mathematical
modeling of how the searched area scales with various relevant parameters.
5.1 Problem Formulation
We consider a team ofm 3 robots (robots 1 tom) performing a task in an unknown
walled environment. Robot 1 works as a source that transmits information to a
destination robot m. Our goal is to nd an optimal conguration of the remaining
relay robots so that they can form a multi-hop communication path from the source
to the destination with optimized performance.
5.1.1 Link Quality Metric
The walled environment is represented as a 2-D LL grid and each pixel in the grid
can either be a possible location for a robot or is occupied by a wall. Among all them
robots, robot 1 and robot m work as a source-destination pair, which are static and
their positions are known a priori. The rest of the robots can move around in the
space to enable and improve the communication between the source and destination.
When a robot i communicates with robot j, the strength of signal decreases as
it travels through air. Moreover, if there are walls between the communication
pair, additional signal attenuation can occur. Taking signal attenuation caused by
85
travelling distance and walls into consideration, the received signal (in dB) at the
receiver j from transmitter i can be expressed as ([19]):
Pr
i;j
= Pr
0
10 log¹
d
i;j
d
0
ºn
i;j
W (5.1)
where d
i;j
is the distance between robotsi and j; n
i;j
is the number of walls between
them; Pr
0
is the received power strength at a reference point with a small distance
d
0
from the transmitter; is the path loss parameter indicating the rate at which
the attenuation increases with distance; W is the attenuation eect of a single wall.
Let the noise power spectral density be N
0
and spectrum bandwidth be B, then
the Signal-to-Noise Ratio (SNR) at the receiver j is dened to be:
i;j
=
Pr
i;j
N
0
B
(5.2)
We dene the link quality metric l
i;j
of the communication link¹i; jº as a strictly
increasing function of the SNR at receiver j corresponding to transmitter i:
l
i;j
= f¹
i;j
º (5.3)
5.1.2 Mobility, Sensing and Environment Assumptions
Time is divided into discrete time steps of unit duration. At each time step, a
mobile robot can move to one of its four neighbor positions (up, down, left, right)
86
or stay at its current position. We assume each robot has the ability to sense and
detect walls within one moving step range, which helps a robot avoid colliding with
walls and other robots. The moving decision is made by each robot itself. We
assume the unknown environment is connected, i.e., given any two pixels in the
grid that are not occupied by walls, there always exists a path between them. This
assumption ensures all the available positions in the environment could be reached
by robots.
5.1.3 Objective Function
Our goal is to design an optimal conguration of mobile relay robots such that they
nally form a communication path
1
connecting the source and destination with a
maximized utilityU, which is a monotonic function of all link qualities:
max
¹x
1
;:::;x
m
º2P
U¹l
1;2
;:::;l
m1;m
º (5.4)
where x
k
(where 1 k m) is the position of the k
th
robot andP is the set of all
possible congurations.
1
Conguration, communication path and solution are used interchangeably in the rest of the
paper.
87
5.2 OnLinE RObotic Netowrk FormAtion (LEONA)
There are three challenges inherent in the problem we address: 1) The environ-
ment is unknown so that robots need to dynamically combine exploration with
conguration optimization; 2) The signal attenuation caused by walls results in a
non-metric space so that prior metric-based algorithms cannot be applied to our
problem; 3) The objective function in general is not convex so that potentials and
convex optimization methods do not work. Therefore, one natural question to ask
is: Given an unknown environment, is it possible to nd an optimal conguration
without fully exploring the whole space? The answer is yes, and in the following,
we propose a graph-based online approach, which is guaranteed to nd the optimal
solution with only partial exploration of the environment.
A high-level structure of the proposed unied framework LEONA is presented in
Algorithm 6 with related steps detailed in Algorithm 7-10. To begin with, each
mobile robot maintains a communication graph, which is represented as a complete
directed graph G¹V;Eº. The vertices in the graph are all pixels in the grid. And
a directed edge¹i; jº 2 E represents a communication link with transmitter at
pixel i and receiver at pixel j. The graph is complete in the sense that every two
vertices are connected by a pair of directed edges. The weight of an edge¹i; jº is
set as an optimistic prediction of the link quality l
i;j
. We assume robots can share
88
environment information with each other through the communication path between
the source and destination, thus all of them maintain the same knowledge of the
environment.
Algorithm 6 Online Robotic Network Formation (LEONA)
1: . Initialization
2: Robots start from current initial positions p
=¹x
1
;:::;x
m
º and initialize utility
of the current path p
as U
= 0. Initialize the measured SNR set and detected
walls' positions set asS = andW = respectively
3: . Update communication graph
4: G UpdateGraph¹G¹V;Eº;S;Wº
5: . Find best possible path and its utility
6:¹p;Uº FindPath¹Gº
7: while U
< U do
8: . Robots move to form the best possible path
9: ¹p
;Wº Move¹p
;pº
2
10: . Measure SNR of each link on current path p
11: S MeasureSNR¹p
º
12: . Set the utility of current path p
13: U
SetUtility¹p
;Sº
14: . Probe walls on the current path p
15: W ProbeWall¹p
º
16: . Update communication graph
17: G UpdateGraph¹G¹V;Eº;S;Wº
18: . Find best possible path and its utility
19: ¹p;Uº FindPath¹Gº
20: end while
2
This Move¹p
;pº algorithm follows the same idea of Robotic Routing Protocol [40], except for
allowing a robot to choose using either Righthand Traversal Rule or Lefthand Traversal Rule in
the Recovery Mode: A robot rst greedily moves to its goal position in Forwarding Mode. When
meeting a wall, it switches to move in Recovery Mode by following either Righthand Traversal
Rule if its goal position is on the right side of its current position or Move in Recovery Mode
by following Lefthand Traversal Rule if its goal position is on the left side of its current position
to avoid walls. When arriving in a position that is closer to its goal position than the start of
Recovery Mode, the robot switches back to Forwarding Mode. A robot can also record positions
of walls
r
sensed during moving and updateW asW[
r
. More advanced movement method
can also be applied.
3
An element inS is denoted as¹i; j;
i;j
º, wherei is the transmitter's position, j is the receiver's
position and
i;j
is the corresponding SNR of link¹i; jº. An element isW is a detected wall
position.
89
Algorithm 7 UpdateGraph(G¹V;Eº,S,W)
1: for each directed edge¹i; jº2E do
2: if¹i; j;
i;j
º2S
3
then
3: Update the egde weight as l
i;j
= f¹
i;j
º
4: else
5: Predict n
i;j
according toW
6: Calculate
i;j
according to Eqs. (5.1) and (5.2)
7: Update the edge weight as l
i;j
= f¹
i;j
º
8: end if
9: end for
10: return G
Algorithm 8 MeasureSNR(p
)
1: for each reciever robot k on p
, where k2f2;:::;mg do
2: Measure its SNR with transmitter robot k1:
k1;k
3: S S[¹x
k1
;x
k
;
k1;k
º
4: end for
5: returnS
Algorithm 9 SetUtility(p
,S)
1: for each communication link (x
k1
,x
k
) along path p
, where k2f2;:::;mg do
2: Find its corresponding
k1;k
fromS
3: Update edge weight as l
k1;k
= f¹
k1;k
º
4: end for
5: U
U
¹x
1
;:::;x
m
º=p
¹l
1;2
;:::;l
m1;m
º
6: return U
Algorithm 10 ProbeWall(p
)
1: for each robot k on p
, where k2f1;:::;mg do
2: Send a probing signal to neighbor robots respectively
3: Detect the position of closest wall w
d
that re
ects probing signal back to
robot k:W W[fw
d
g
4: end for
5: returnW
The environment information includes SNR measurements and detected wall
positions. Each robot constructs its communication graph G¹V;Eº based on its
current information. For each¹i; jº2E, if its link quality has been measured, the
90
edge weight is set as measured. Otherwise, the edge weight is predicted according
to Eqs. (5.1)-(5.3) based on explored walls' information. Thus, if all walls along
link¹i; jº are fully explored, the predicted edge weight is the same as the actual link
quality; if there is still some wall information missing, the predicted edge weight
is optimistic or overestimated. Based on the current communication graph G with
optimistic prediction, robots apply FindPath¹Gº to nd the best possible commu-
nication path. After moving to form the best possible communication path, robots
take measurements to nd actual link qualities along the path, and update their
communication graph G based on new information. They run FindPath¹Gº again
to nd the best possible communication path. If the current communication path's
measured utility is as good as the best possible path, the algorithm terminates.
Otherwise, robots move to form the new best-possible communication path, and
repeat above procedures until the termination condition is met.
Theorem 4. The robotic network conguration obtained from LEONA is optimal.
Proof. In LEONA, by construction, when updating the communication graphG at
each step based on measurements, the predicted weight of each edge is always no
worse
4
than its actual weight. Thus the communication graph is always optimistic.
When LEONA terminates, the actual utility of nal conguration is at least as good
as the best possible estimated conguration, which indicates the actual utility of
4
The predicted edge weight is the same as the actual weight if the link quality has been
measured or all related wall information has been detected; overestimated if there is still some
related walls' information missing.
91
the nal conguration is as least as good as the actual utilities of all other possible
congurations. Therefore, the nal conguration is optimal.
LEONA provides a unied framework to nding an optimal communication path
that combines both environment exploration and conguration optimization. And
it is general enough to permit optimizing for dierent utility functions in non-convex
environments. In the following, we provide two specic case studies in which we
apply LEONA with FindPath¹Gº instantiated to nd optimal congurations with
respect to two dierent metrics.
5.3 Case Study I: Finding Minimized ETX Path
One commonly used metric to measure the link quality is the expected number of
transmissions per successfully delivered packet (ETX), which can be modeled as the
inverse of the successful packet transmission rate
i;j
over a link¹i; jº. Once specics
of a communication system (modulation and coding scheme, etc.) are xed,
i;j
can
typically be expressed in terms of either a Q function or an exponential function of
i;j
([92]). We use Q function as an example here, and the corresponding successful
packet transmission rate of link¹i; jº is
i;j
=¹1Q¹
p
c
i;j
ºº
h
(5.5)
92
where h is the length of a packet and c is some positive constant.
And the corresponding link quality, dened asETX, is
l
i;j
=
1
¹1Q¹
p
c
i;j
ºº
h
(5.6)
The ETX of a path is the summation of all link ETXs, and our goal is to let
robots form a communication path between the source and destination which has
the minimized ETX or, equivalently, maximized utility:
max
¹x
1
;:::;x
m
º2P
m
Õ
k=2
l
k1;k
(5.7)
The optimal conguration of robots can be found by applying LEONA with the
FindPath¹Gº implemented as running a Shortest Path Algorithm (e.g., Bellman-
Ford algorithm) with a constraint that the total number of hops along a path is at
most m 1 if the weight of each edge is set to the associated ETX on the current
graph G.
5.3.1 Analysis of the Sucient Searched Area
As the optimality of the LEONA is guaranteed, one additional question to ask is:
How much space needs to be explored in order to nd an optimal conguration?
93
Assume the distance and the number of walls between the source robot 1 and
destination robot m are denoted as d and n respectively, then the following theorem
provides an upper bound on the sucient searched area.
Theorem 5. When applying LEONA, the size of the sucient searched areaA
that guarantees robots to nd the optimal conguration with minimized ETX is
A = O¹d
2
10
n
W
5
º (5.8)
Further, if the distribution of n walls along the straight line connecting the source
and destination allows each communication pair to have same number of walls when
robots are evenly spaced along the straight line, the sucient searched area can
reduce to O¹d
2
10
d
n
m1
e
W
5
º.
Remark 3. 1) The sucient searched area scales polynomially with the distance
between the source and destination and exponentially with the number of walls along
the straight line connecting them. But any other wall that is not along the straight
line has no eect on the size of the sucient searched area; 2) When the distance
and the number of walls along the straight line are xed, the distribution of walls
plays a crucial role in the size of searched area. The searched area becomes large
when walls gather close to each other; And it becomes small when walls separately
locate along the straight line; 3) Given an unknown environment, one possible way
to reduce the size of searched area is to send more robots in the space which can
94
provide a better chance to have walls separated among communication pairs, which
suers less signal attenuation and thus have a better ETX.
Before proving Theorem 5, let us simplify Eq. (5.5) rst. According to [88],
the packet reception rate of a link¹i; jº can be nely approximated as a sigmoidal
function of distance
i;j
= 1
1
1+e
¹d
i;j
º
, where ;2R
+
are shape and center pa-
rameters depending on the communication range and variance of the environmental
fading. Figure 5.1 illustrates how good the approximation is as we compared the
sigmoidal function with the packet reception rate function derived directly from
either a Q function or an exponential function.
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Distance
ETX
Q func
EXP func
Approx.
Figure 5.1: Comparisons of
i;j
derived from sigmoidal, Q and exponential functions
When link¹i; jº suers from wall attenuation, the wall attenuation can be equiv-
alently converted to distance attenuation. According to Eqs. (5.1) and (5.2), the
95
received SNR at robot j is
10
P
0
10
¹
d
i;j
d
0
º
10
n
i;j
W
10 N
0
B
, where n
i;j
is the number of walls along
the link. Consider an equivalent case where there are no walls between robots i
and j but the signal attenuation is the same as the obstructed case. Let the equiv-
alent distance of robots i and j free from walls be D¹:º, which is a function of n
i;j
and d
i;j
. Since the received SNR at robot j in the equivalent wall-free case can
be represented as
10
P
0
10
¹
D¹n
i;j
;d
i;j
º
d
0
º
N
0
B
and if we set it equal to the received SNR in the
obstructed case, then the distance of a wall-free link with same received SNR (or
signal attenuation) can be expressed as:
D¹n
i;j
;d
i;j
º = d
i;j
10
n
i;j
W
10
(5.9)
By converting wall attenuation to distance attenuation from Eq. (5.9), the
packet reception rate of link¹i; jº in the obstructed case can be expressed as
i;j
= 1
1
1+e
¹d
i;j
10
n
i;j
W
10
º
(5.10)
And the corresponding ETX of the link becomes
!
i;j
=
1
i;j
= 1+e
¹d
i;j
10
n
i;j
W
10
º
(5.11)
Now, let us move on to analyze the scenario of m robots in an obstructed
environment. We consider placing all m 2 relay robots equally spaced along the
96
straight line connecting source and destination as a benchmark case, and without
loss of generality, assume robots can locate at positions that are occupied by walls.
If the distance and the number of walls between source and destination are xed,
how distribution of walls aects ETX is shown in the following lemma.
Lemma 4. In the benchmark scenario, given a xed
ow distance and a xed
number of walls causing signal attenuation to the
ow, the distribution of walls that
maximizes ETX is the one that all walls gather together within one communication
link, and the distribution that minimizes ETX is the one that walls evenly distributed
among communication links.
Proof. Let n
k1;k
denote the number of walls between robots x
k1
and x
k
, where
2 k m. As the distance between each two adjacent robots is xed as
d
m1
in the
benchmark case, the total ETX of the
ow is
m
Õ
k=2
¹1+e
¹
d
m1
10
n
k1;k
W
10
º
º (5.12)
which is a convex function of m 1 variables n
k1;k
(2 k m).
First, let us consider the minimization problem and for simplicity, we use y
k1
to
representn
k1;k
(2 k m). Therefore, nding a walls' distribution that minimizes
the ETX is equivalent to solving the following convex optimization problem:
97
min
y
g¹yº =
m1
Õ
i=1
¹1+e
¹
d
m1
10
y
i
W
10
º
º
s:t:
m1
Õ
i=1
y
i
= n
y
i
0; 81 i m 1
(5.13)
where y =¹y
1
;:::; y
m1
º is the allocation of walls among links.
From KKT Conditions [8], we have
@g¹yº
@y
i
= 8i 2 f1;:::;m 1g, where
is the Lagrange multiplier. Therefore, the optimal solution is to set y
i
=
n
m1
,
8i2f1;:::;m 1g.
Next, let us focus on the maximization problem and we prove the result by
induction. Consider the following problem in which it is equivalent to the maxi-
mization problem when k = m 1.
max
y
g
k
¹yº = k +e
k
Õ
i=1
e
d
m1
10
y
i
W
10
s:t:
k
Õ
i=1
y
i
= n
y
i
0; 81 i k
(5.14)
When k = 2, we have g
2
¹yº = 2+e
Í
2
i=1
e
d
m1
10
y
i
W
10
with y
1
+ y
2
= n. As g
2
¹yº
is a convex function, its maximum achieves at one of its boundary points, either
¹n; 0º or¹0;nº.
98
Assume, 8k K, g
k
¹yº is maximized when one variable takes its value as n and
all others are 0. Then when k = K+1, we have g
K+1
¹yº = K + 1+e
Í
K+1
i=1
e
d
m1
10
y
i
W
10
where
Í
K+1
i=1
y
i
= n. Then, we have
max
y
g
K+1
¹yº
= max
0y
K+1
n
1+e
e
d
m1
10
y
K+1
W
10
+ max
Í
K
i=1
y
i
=ny
K+1
fK +e
K
Õ
i=1
e
d
m1
10
y
i
W
10
g
)
= max
0y
K+1
n
1+e
e
d
m1
10
y
K+1
W
10
+K +e
¹¹K 1ºe
d
m1
+e
d
m1
10
¹ny
K+1
ºW
10
º
)
(5.15)
where the last equality comes from the fact that K +e
Í
K
i=1
e
d
m1
10
y
i
W
10
under
the condition
Í
K
i=1
y
i
= n y
K+1
takes its maximum when one variable is n y
K+1
and all the others are 0. Because of the convexity property, Eq. (5.15) achieves its
maximum when y
K+1
equals either 0 or n, which means when k = K + 1, g
K+1
¹yº
achieves its maximum if one variable equals n, and all the others are 0.
When running LEONA, after the rst run of FindPath¹Gº with the communica-
tion graphG constructed under the assumption the environment is wall-free, robots
move to be evenly spaced along the straight line connecting the source and desti-
nation. And if this is not the optimal conguration, robots' exploration starts from
here and expands outwards. If the area robots have searched is large enough such
that any communication path even without wall attenuation outside the searched
99
area has a worse ETX than that in the evenly-spaced-along-the-straight-line bench-
mark case, there is no need to search further and the optimal communication path
is guaranteed to be found in the searched area. According to [88], it may be the
case that not all relay robots are needed to form the path. However, the bench-
mark case can give a no better ETX than the optimal case, and thus, the sucient
searched area is still large enough to guarantee to nd the optimal conguration.
Therefore, based on the benchmark case and Lemma 4, Theorem 5 can be proved
by considering the size of sucient searched area in two edge cases and for any
other case, the searched area size falls in between.
Proof. On one hand, where n walls gather together between one communication
link that yields worst ETX, the corresponding ETX is¹m 2º¹1+e
¹
d
m1
º
º+ 1+
e
¹
d
m1
10
nW
10
º
. Consider a wall-free path of length D whose ETX can be expressed
as¹m 1º¹1+e
¹
D
m1
º
º. When D d10
nW
10
, even though it is wall-free, the ETX of
the free path is still no better than the obstructed case where walls gather together.
Thus, the sucient explored area is O¹D
2
º. Therefore,the size of the suciently
large exploration area to nd an optimal solution in this case isA = O¹d
2
10
n
W
5
º,
which is also an upper bound for the size of the sucient searched area .
On the other hand where walls are equally distributed among all communication
links which gives best ETX, the corresponding ETX is¹m 1ºe
¹
d
m1
10
d
n
m1
e
W
10
º
considering the number of walls is an integer between each communication link.
100
Similarly, when a free path with distance satisfying D d10
d
n
m1
e
W
10
, the corre-
sponding ETX is no less than the benchmark case in the obstructed space. Thus
the sucient explored area isA = O¹d
2
10
d
n
m1
e
W
5
º. Further if m = O¹nº, the ex-
plored area decreases toA = O¹d
2
º, which means the size of searched area only
depends the distance between the source and destination.
5.3.2 Simulation Results
We present numerical simulation results for a network containing 11 robots in a
50 50 environment. The robot 1 and robot 11, as the source and destination,
are statically located at¹3; 3º and¹48; 48º respectively. The rest are mobile robots
moving around with the purpose of formulating a communication path between the
source and destination with minimized ETX. The initial locations of mobile robots
are set to be as equally spaced as possible along the straight line connecting the
source and destination
5
. We use P
0
=20, d
0
= 1, = 3:3, W = 20, N
0
= 10
14
,
and B = 2 10
6
. When taking wall attenuation into account, the packet reception
rate of a link¹i; jº is set as Eq. (5.10), which considers dierent kinds of specics
of a communication system. We use this model in our simulations and set = 0:2
and = 8.
We x the shape and size of walls, and use the number of walls as an indicator
of the complexity of an unknown environment. We randomly deploy walls in the
5
Mobile robots can initially start from any locations in the environment. And they can nd
the optimal congurations when LEONA terminates.
101
space and the number of walls is varied as shown in Figure 5.2. The performance
result for each case in Figure 5.2 is averaged over ten runs. We compare LEONA
with an Oine Algorithm, which is guaranteed to nd the optimal conguration.
In the Oine Algorithm, robots rst fully explore the environment
6
to nd the
prior mapping of the area and build the corresponding communication graph G.
Then they apply Shortest Path Algorithm to nd the optimal conguration. In
addition to presenting the ETX of the optimal path, we also present the total
number of moving steps robots take before nding the optimal path, which serves
as an indicator of searched area size. As can be seen in the Figure 5.2, LEONA can
always nd optimal congurations; And as the number of walls (or the complexity of
the environment) increases, the number of steps taken during exploration increases
under both schemes, but LEONA takes far less.
Figure 5.2: ETX (left) and moving steps (right)
6
In the simulation, we let robots search the environment column by column and use the Right-
hand Traversal Rule to avoid walls.
102
In Figure 5.3, we present the optimal robot congurations obtained in an envi-
ronment with 12 walls with single wall attenuation eectW = 20 in (a) andW = 3:3
in (b). In Figure 5.3 (a), wall attenuation is strong so that the optimal congu-
ration is a path free from walls, along which robots are roughly evenly spaced. In
Figure 5.3 (b) the optimal path still has walls aecting it, since the wall attenuation
is weak which does not cause robots to move away from the straight line. However,
robots are no longer evenly spaced and those communication robot pairs suering
from wall attenuation compensate by having shorter link distance. As can be seen
from both cases, robots only explore part of the environment before nding the
optimal conguration. One thing needs to be noted is that due to the fact that the
utility in the ETX case is a unimodal (increasing-then-decreasing) function of the
number of robots, not all available robots are always required in the optimal path
conguration; though, generally, more robots are needed as the number of walls
increases.
Figure 5.3: Illustration of robot congurations: strong wall attenuation (left) and
weak wall attenuation (right)
103
5.4 Case Study II: Finding Maximized Transmission-
Rate Path
Another important metric to measure link quality is transmission rate. From the
classic Shannon-Hartley Formula, the transmission rate of a link¹i; jº is a function
of the SNR at receiver j corresponding to transmitter i:
l
i;j
= Blog¹1+
i;j
º (5.16)
The transmission rate of a communication path formed by multiple links is de-
termined by the transmission rate of its bottleneck link. Thus, to have a maximized
transmission rate between the source and destination, the m 2 mobile robots need
to form a transmission path which has a maximized bottleneck rate. Therefore, the
objective function in (5.4) is instantiated as
max
¹x
1
;:::;x
m
º2P
min¹l
1;2
;:::;l
m1;m
º (5.17)
The optimal conguration of robots in this case can be found by LEONA with
the FindPath¹Gº implemented as Widest Path Algorithm [31] with a constraint that
the total number of hops along a path is at most m 1.
104
5.4.1 Analysis of the Sucient searched Area
In the case of nding maximized transmission-rate path, we have the same sucient
searched area result:
Theorem 6. When applying LEONA, the size of the sucient searched areaA that
guarantees robots to nd the optimal conguration with maximized transmission rate
is
A = O¹d
2
10
n
W
5
º (5.18)
Further, if the distribution of n walls along the straight line connecting the source
and destination allows each communication pair to have same number of walls when
robots are evenly spaced along the straight line, the sucient searched area can
reduce to O¹d
2
10
d
n
m1
e
W
5
º.
Proof. Similarly, we still consider placing robots equally spaced along the straight
line connecting source and destination as a benchmark case. Since the distance be-
tween any two communication robots is
d
m1
, the bottleneck link, which determines
the transmission rate of the benchmark path, is the communication pair which
has most walls between them. The more walls between the bottleneck communi-
cation robots' pair, the less the transmission rate of the benchmark path is. Let
the number of walls aecting the bottleneck link be denoted as n
b
. According to
Eq. (5.9), the equivalent distance of the bottleneck communication pair free from
walls is D¹n
b
;
d
m1
º. Thus, the size of the suciently large exploration area to nd
105
an optimal solution isA = O¹d
2
10
n
b
W
5
º. On one hand, if all the walls gathers in the
bottleneck link, n
b
takes its maximum as n, which indicates the walls' attenuation
is maximized. And the size of the explored area becomesA = O¹d
2
10
n
W
5
º. On the
other hand, if walls are evenly distributed among communication robots' pairs, n
b
takes its minimum asd
n
m1
e, which indicates the walls' attenuation is minimized.
And the size of the explored area becomesA = O¹d
2
10
d
n
m1
e
W
5
º. Additionally, if
m = O¹nº, the explored area decreases toA = O¹d
2
º.
5.4.2 Simulation Results
We conduct simulations in the same network scenario as Case Study I. The Oine
Algorithm is the same as Case Study I with only one dierence that robots apply
Widest Path Algorithm to nd the optimal conguration. As seen in Figure 5.4,
similar results are found: LEONA can always nd the optimal conguration and
takes less amount of movements and explorations than the oine scheme. We also
present robot congurations in an environment with 12 walls with single wall atten-
uation W = 20 in Figure 5.5 (a) and W = 3:3 in Figure 5.5 (b). In the strong wall
attenuation case, robots form a wall-free optimal path on which they are roughly
evenly spaced. However, in the weak wall attenuation case, the optimal path still
has walls along it where communication pairs suering from wall attenuation com-
pensate by having shorter link distance. One dierence from Case Study I is that
since the maximized transmission rate of a path increases as the number of robots
106
increases when environmental conguration is xed, thus, all 9 relay robots are
required in the optimal conguration.
Figure 5.4: Transmission rate (left) and moving steps (right)
Figure 5.5: Illustration of robot congurations: strong wall attenuation (left) and
weak wall attenuation (right)
107
5.5 Conclusion
We have shown in this chapter how the adoption of an iterative online search
combined with a graph based approach can allow for the formation of optimal
robotic network congurations in unknown environments with obstructions. We
have illustrated our general LEONA framework with two specic case studies. It
is straightforward to incorporate many other utility functions and constraints into
this framework. In this chapter, we assumed a simple path loss model with wall
attenuation in order to make and update predictions. However, the framework is
exible enough to accommodate other models / approaches to prediction as long
as at each step an optimistic estimate can be generated.
For many path and network optimization problems such as the ones considered
in this chapter, it is possible to obtain a global solution in polynomial time. So long
as the network of robots moves in such a way as to ensure connectivity is maintained
at all times, they can exchange their measurements in an online fashion, and the
predicted graph can be updated in a consistent manner by all robots in the network
allowing them to each compute the optimal predicted location for themselves in a
parallel fashion. It may be possible to adopt and interleave more sophisticated
message passing mechanisms with the iterations of the online algorithm to further
improve the robustness and eciency of the system.
108
Chapter 6
Deep Reinforcement Learning for Dynamic
Multichannel Access in Wireless Networks
In this chapter, we consider a dynamic multichannel access problem, where mul-
tiple correlated channels follow an unknown joint Markov model. The problem is
formulated as a partially observable Markov decision process (POMDP) with un-
known system dynamics. To overcome the challenges of unknown system dynamics
as well as prohibitive computation, we apply the concept of reinforcement learn-
ing and implement a Deep Q-Network (DQN) that can deal with large state space
without any prior knowledge of the system dynamics. We provide an analytical
study on the optimal policy for xed-pattern channel switching with known system
dynamics and show through simulations that DQN can achieve the same optimal
performance without knowing the system statistics. We compare the performance
of DQN with a Myopic policy and a Whittle Index-based heuristic through both
The work in this chapter is based on [83, 84].
109
simulations as well as real-data trace and show that DQN achieves near-optimal
performance in more complex situations. We also show DQN has the potential to
tackle more complicated and piratical multi-user scenario. Finally, we propose an
adaptive DQN approach with the capability to adapt its learning in time-varying,
dynamic scenarios.
6.1 Problem Formulation
Consider a dynamic multichannel access problem where there is a single user dy-
namically choosing one out of N channels to transmit packets. Each channel can
be in one of two states: good (1) or bad (0). Since channels may be correlated, the
whole system can be described as a 2
N
-state Markov chain. At the beginning of each
time slot, a user selects one channel to sense and transmit a packet. If the channel
quality is good, the transmission succeeds and the user receives a positive reward
(+1). Otherwise, the transmission fails and the user receives a negative reward
(1). The objective is to design a policy that maximizes the expected long-term
reward.
Let the state space of the Markov chain beS =fs
1
;:::; s
2
Ng. Each state s
i
(i2
f1;:::; 2
N
g) is a length-N vector»s
i1
;:::;s
iN
¼, where s
ik
is the binary representation
of the state of channel k: good (1) or bad (0). The transition matrix of the
Markov chain is denoted as P. Since the user can only sense one channel and
observe its state at the beginning of each time slot, the full state of the system,
110
i.e., the states of all channels, is not observable. However, the user can infer the
system state according to his sensing decisions and observations. Thus, the dynamic
multichannel access problem falls into the general framework of POMDP. Let
¹tº =
»!
s
1
¹tº;:::;!
s
2
N
¹tº¼ represent the belief vector maintained by the user, where!
s
i
¹tº is
the conditional probability that the system is in state s
i
given all previous decisions
and observations. Given the sensing action a¹tº 2 f1;:::;Ng representing which
channel to sense at the beginning of time slot t, the user can observe the state
of channel a¹tº, denoted as o¹tº2f0; 1g. Then, based on this observation, he can
update the belief vector at time slot t, denoted as
^
¹tº =» ^ !
s
1
¹tº;:::; ^ !
s
2
N
¹tº¼. The
belief of each possible state ^ !
s
i
¹tº is updated as follows:
^ !
s
i
¹tº =
8
>
>
>
>
>
> <
>
>
>
>
>
>
:
!
s
i
¹tº1¹s
ik
¹tº=1º
Í
2
N
i=1
!
s
i
¹tº1¹s
ik
¹tº=1º
a¹tº = k;o¹tº = 1
!
s
i
¹tº1¹s
ik
¹tº=0º
Í
2
N
i=1
!
s
i
¹tº1¹s
ik
¹tº=0º
a¹tº = k;o¹tº = 0
(6.1)
where 1¹:º is the indicator function.
Combining the newly updated belief vector
^
¹tº for time slot t with the system
transition matrix P, the belief vector for time slot t + 1 can be updated as:
¹t + 1º =
^
¹tºP (6.2)
A sensing policy :
¹tº! a¹tº is a function that maps the belief vector
¹tº
to a sensing action a¹tº at each time slot t. Given a policy , the long-term reward
111
considered in this problem is the expected accumulated discounted reward over
innite time horizon, dened as:
E
»
1
Õ
t=1
t1
R
¹
¹tºº
¹tºj
¹1º¼ (6.3)
where 0
< 1 is a discounted factor, ¹
¹tºº is the action (i.e., which channel
to sense) at time t when the current belief vector is
¹tº, and R
¹
¹tºº
¹tº is the
corresponding reward.
If no information about the initial distribution of the system state is available,
one can assume the initial belief vector
¹1º to be the stationary distribution of the
system. Our objective is to nd a sensing policy
that maximizes the expected
accumulated discounted reward over innite time
= arg max
E
»
1
Õ
t=1
t1
R
¹
¹tºº
¹tºj
¹1º¼ (6.4)
As the dynamic multichannel access problem is a POMDP, the optimal sensing
policy
can be found by considering its belief space and solving an augmented
MDP instead. LetB represent the belief space, and let V
¹bº be the maximum
expected accumulated discounted reward from the optimal policy
with initial
112
belief as b. Then for all belief b2B, we have the following Bellman optimality
equation
V
¹bº = max
k=1;:::;N
(
2
N
Õ
i=1
!
s
i
1¹s
ik
= 1º+
2
N
Õ
i=1
!
s
i
1¹s
ik
= 1ºV
¹T¹bja = k;o = 1ºº
+
2
N
Õ
i=1
!
s
i
1¹s
ik
= 0ºV
¹T¹bja = k;o = 0ºº
)
(6.5)
where the T¹bja;oº is the updated belief at given the action a and observation o as
in Eq. (6.2).
In theory, the value function V
¹bº together with the optimal policy
can be
found via value iteration approach. However, since there are multiple channels and
they might be correlated, the belief space becomes a high-dimensional space. For
instance, in a typical multichannel WSN based on the widely used IEEE 802.15.4-
2015 standard [35], nodes have to choose one out of 16 available channels to sense at
each time slot. If we consider the potential correlations among channels and simplify
each channel's condition to be in only two states: good or bad, the state space size
becomes 2
16
. As the belief represents a probability distribution function over all
possible states, it also becomes high dimensional, which increases computation cost.
Even worse, the innite size of the continuous belief space and the impact of
the current action on the future reward makes POMDP PSPACE-hard, which is
even less likely to be solved in polynomial time than NP-hard problems [63]. To
exemplify the time complexity of solving such POMDP problem, we simulate the
113
Figure 6.1: Running time (seconds)
of the POMDP solver as we vary the
number of channels in the system
Figure 6.2: Gilbert-Elliot channel
model
multichannel access problem with known system dynamics and use a POMDP solver
called SolvePOMDP [66] to nd its optimal solution. In Figure 6.1, we show the
run-time as we increase the number of channels in the system. When the number of
channels is higher than 5, the POMDP solver can not converge after a long interval,
and it gets terminated when the run-time exceeds the time limit.
All these factors make it impossible to nd the optimal solution to a POMDP in
general, and many existing works [2, 12, 13, 50, 51, 62, 74, 97] attempt to address
this challenge of prohibitive computation by considering either simpler models or
approximation algorithms.
6.2 Myopic Policy and Whittle Index
In the domain of dynamic multichannel access, there are many existing works on
nding the optimal / near-optimal policy with low computation cost when the
channels are independent and system statistics (P) is known. The Myopic policy
114
and the Whittle Index policy are two eective and easy-to-implement approaches
for this setting.
6.2.1 Myopic Policy
A Myopic policy only focuses on the immediate reward obtained from an action
and ignores its eects in the future. Thus the user always tries to select a channel
which gives the maximized expected immediate reward
^ a¹tº = arg max
k=1;:::;N
2
N
Õ
i=1
!
s
i
¹tº1¹s
ik
¹tº = 1º (6.6)
The Myopic policy is not optimal in general. Researchers in [2, 97] have studied
its optimality when N channels are independent and statistically identical Gilbert-
Elliot channels that follow the same 2-state Markov chain with the transition matrix
as
p
00
p
01
p
10
p
11
, as illustrated in Figure 6.2. It is shown that the Myopic policy is
optimal for any number of channels when the channel state transitions are positively
correlated, i.e., p
11
p
01
. The same optimal result still holds for two or three
channels when channel state transitions are negatively correlated, i.e., p
11
< p
01
.
In addition, the Myopic policy has a simple robust structure that follows a round-
robin channel selection procedure.
115
6.2.2 Whittle Index Based Heuristic Policy
When channels are independent, the dynamic multichannel access problem can also
be considered as a restless multi-armed bandit problem (RMAB) if each channel is
treated as an arm. An index policy assigns a value to each arm based on its current
state and chooses the arm with the highest index at each time slot. Similarly, the
index policy does not have optimality guarantee in general.
In [51], the Whittle Index is introduced in the case when P is known and all
channels are independent but may follow dierent 2-state Markov chain models. In
this case, the Whittle Index policy can be represented as a closed-form solution,
and it has the same optimal result as the Myopic policy: the Whittle Index policy
is optimal for any number of channels when channels are identical and positively
correlated, or for two or three channels when channels are negatively correlated.
In addition, when channels follow identical distributions, the Whittle Index policy
has the same round-robin structure as the Myopic policy.
When channels are correlated, the Whittle Index cannot be dened and thus
the Whittle Index policy cannot be directly applied to our problem. To leverage its
simplicity, we propose a heuristic that ignores the correlations among channels and
uses the joint transition matrix P and Bayes' Rule to compute the 2-state Markov
chain for each individual channel. Assume that for channel k, the transition matrix
116
is represented as p¹c
t+1
k
= mjc
t
k
= nº, where m;n2f0; 1g (bad or good). Then, based
on Bayes' Rule we have,
p¹c
t+1
k
= mjc
t
k
= nº =
p¹c
t+1
k
= m;c
t
k
= nº
p¹c
t
k
= nº
=
Í
2
N
j=1
Í
2
N
i=1
p¹s
j
js
i
ºp¹s
i
º1¹s
jk
= mº1¹s
ik
= nº
Í
2
N
i=1
p¹s
i
º1¹s
ik
= nº
(6.7)
where p¹s
i
º is the stationary distribution and p¹s
j
js
i
º is the transition probability
from state s
i
to state s
j
dened in P. After each channel model is found, we can
apply the Whittle Index policy.
The Myopic policy and the Whittle Index policy are easy to implement in prac-
tice, as both of them have polynomial run-time. And in the case of independent
channels, the Myopic and the Whittle Index policies can achieve optimality under
certain conditions. However, so far to the best of our knowledge there is no easy-
to-implement policy applicable to the general case where channels are correlated.
Moreover, both policies require the prior knowledge of the system's transition ma-
trix, which is hard to obtain beforehand in practice. Thus, we need to come up
with a new approach that copes with these challenges.
117
6.3 Deep Reinforcement Learning Approach
When channels are correlated and system dynamics are unknown, there are two
main approaches to tackle the dynamic multichannel access problem: (i) Model-
based approach: rst estimating the system model from observations and then
either solve it by following the dynamic programming method in Section 6.1 or
apply some computationally ecient heuristic algorithm such as the Myopic policy
and the Whittle Index policy (which have polynomial run-time); (ii) Model-free
approach: learn the policy directly through interactions with the system without
estimate the system model. The model-based approach is less favored since the
user can only observe one channel at a time slot and the limited observation capa-
bility may result in a bad system model estimation. Even worse, even if the system
dynamics is well estimated, solving a POMDP in a large state space is always a
bottleneck as the dynamic programming method has exponential time complex-
ity (as explained in Section 6.1) and the heuristic approaches do not have any
performance guarantee in general. All these challenges motivate us to follow the
model-free approach, which, by incorporating the idea of Reinforcement Learning,
can learn directly from observations without the necessity of nding an estimated
system model and can be easily extended to very large and complicated systems.
We focus on reinforcement learning paradigm, Q-learning [86] specically, to
incorporate learning in the solution for the dynamic multichannel access problem.
In the context of the dynamic multichannel access, the problem can be converted
118
to an MDP when considering the belief as the state space, and Q-learning can be
directly applied. However, this approach is impractical since the belief update is
maintained by knowing the system transition matrix P a-priori, which is hardly
available in practice. Instead, we apply Q-learning by directly considering the
history of observations and actions. We dene the state for the Q-learning at time
slot t as a combination of historical selected channels as well as their observed
channel conditions over previous M time slots, i.e., x
t
=»a
t1
;o
t1
;:::;a
tM
;o
tM
¼.
Then we can execute the online learning following Eq. (2.15) to nd the sensing
policy. Intuitively, the more historical information we consider (i.e., the larger M
is), the better Q-learning can learn.
Q-learning works well when the state-action space is small. However, since we
directly use the previous historical observations and actions as the state, the state
space for Q-learning becomes exponentially large when we increase the considered
historical information. The state space size of Q-learning in this dynamic multi-
channel problem can be represented as¹2Nº
M
, which grows exponentially with M.
This is because the state of Q-learning is dened as a combination of observations
and actions over past M time slots. In a single time slot, the number of possible
observations is 2N, as the user can only sense one out of N channels and each
channel has 2 possible states. We do not consider the size of action space as action
information is implicitly included in the observation. Thus, the state size of Q-
learning is the number of all possible combinations of observations over previous M
119
time slots, which is¹2Nº
M
. As we mentioned before, the number of previous time
slots M is also required to be large so that Q-learning can capture enough system
information and learn better. This can cause the state space of Q-learning become
very large, which prohibits using a traditional look-up table approach. Therefore,
in this chapter, encouraged by the success of deep learning in many elds, we im-
plement a DQN framework for the dynamic multichannel access problem to handle
large state space and show promising performance of this end-to-end approach.
6.4 Optimal Policy for Known Fixed-Pattern Channel
Switching
To study the performance of DQN, we rst consider a situation when all the N
channels in the system can be divided into several independent subsets and these
subsets take turns to be activated following a xed pattern. We assume at each
time slot, only a single subset is activated such that all channels in the activated
subset are good and all channels in inactivated subsets are bad. At each time slot,
with probability p (0 p 1) the next following subset is activated, and with
probability 1 p the current subset remains activated. We assume the activation
order of the subsets is xed and does not change over time.
In this section, we assume that the subset activation order, the activation switch-
ing probability p as well as the initially activated subset are known a-priori. The
120
optimal policy can be found analytically and is summarized in Theorem 7. This
serves as a baseline to evaluate the performance of DQN in the next section.
Theorem 7. When the system follows a xed-pattern channel switching, if the
activation order, switching probability p and the initial activation subset are known,
the optimal channel access policy follows Algorithm 11 or Algorithm 12 depending
on the value of p.
Algorithm 11 Optimal Policy when 0:5 p 1
1: At the beginning of time slot 0, choose a channel in the initial activated subset
C
1
2: for n = 1; 2;::: do
3: At the beginning of time slot n,
4: if The previous chosen channel is good then
5: Choose a channel in the next activated subset according to the subset
activation order
6: else
7: Stay in the same channel
8: end if
9: end for
Algorithm 12 Optimal Policy when 0 p< 0:5
1: At the beginning of time slot 0, choose a channel in the initial activated subset
C
1
2: for n = 1; 2;::: do
3: At the beginning of time slot n.
4: if The previous chosen channel is good then
5: Stay in the same channel
6: else
7: Choose a channel in the next activated subset according to the subset
activation order
8: end if
9: end for
121
Proof. Assume the current activated subset is known a-priori. Then the problem
can be modeled as a fully-observable MDP, and the corresponding optimal policy
can be found by nding and comparing Q values of all possible state-action pairs.
Assume all N channels in the system form M independent subsets, thus there
are M states in total. The subsets are indexed according to their xed activation
order as C
1
;C
2
;:::;C
M
, where C
1
is the initial activation subset at the start of the
system. Note the channel subset activation order is circular so that C
M
is followed
by C
1
in the order. Suppose the current system state is S
i
(1 i M) when the
current activated subset is C
i
, and S
i
is fully observable as we assume C
i
is known.
Let p¹S
j
jS
i
º be the transition probability from state S
i
to state S
j
(i; j2f1;:::;Mg)
of the Markov chain, and we have:
p¹S
j
jS
i
º =
8
>
>
>
>
> <
>
>
>
>
>
:
p; j =i+ 1
1 p; j =i
(6.8)
Then the corresponding Q-value of the optimal policy starting with state S
i
and
action a is:
Q
¹S
i
;aº =
M
Õ
j=1
p¹S
j
jS
i
º»R¹S
i
;aº+
V
¹S
j
º¼ (6.9)
where a represents which channel to choose in the following time slot, and R¹S
i
;aº
is the immediate reward, i.e. either +1 if the chosen channel is good or1 if the
chosen channel is bad. V
¹S
j
º, dened as max
a
Q
¹S
i
;aº, represents the expected
122
accumulated discounted reward given by an optimal policy over innite time horizon
with initial state as S
j
.
Taking Eq. (6.8) into Eq. (6.9), we have
Q
¹S
i
;aº =
8
>
>
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
>
>
:
p 1+¹1 pº¹1º+c; a2 C
i+1
p¹1º+¹1 pº 1+c; a2 C
i
1+c; otherwise
=
8
>
>
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
>
>
:
2p 1+c; a2 C
i+1
1 2p+c; a2 C
i
1+c; otherwise
(6.10)
where c =
»pV
¹S
i+1
º+¹1 pºV
¹S
i
º¼, which does not depend on the action.
Since the optimal action a
¹S
i
º for each state S
i
is a
¹S
i
º = arg max
a
Q
¹S
j
;aº,
the optimal action to maximize the Q value of a given state S
i
in Eq. (6.10) is
a
¹S
i
º =
8
>
>
>
>
> <
>
>
>
>
>
:
any channel in C
i+1
; 0:5 p 1
any channel in C
i
; 0 p< 0:5
(6.11)
All the above analysis holds based on the assumption that the current activated
subset is known. Since we assume the initially activated channel subset is known,
the user can initially choose a channel in this activated subset and then follow
123
Eq. (6.11) afterward. Based on the observation of the chosen channel the user is
guaranteed to know what the current state is: if the chosen channel is good, the
currently activated subset is the subset containing the chosen channel; otherwise,
the currently activated subset is the subset prior to the chosen channel's subset in
the activation order. Thus, the current state of the MDP is fully observable, and
the optimality of the policies in Algorithm 11 and Algorithm 12 is achieved.
It turns out that the optimal policy for the xed-pattern channel switching
shares a similarly simple and robust structure with the Myopic policy in [97]: the
optimal policy has a round-robin structure (in terms of the channel subset activation
order) and does not require to know the exact value of p except whether it is
above/below 0:5. This semi-universal property makes the optimal policy easy to
implement in practice and robust to mismatches of system dynamics.
6.5 Experiment and Evaluation of Learning for
Unknown Fixed-Pattern Channel Switching
Having derived the optimal policy for xed-pattern channel switching when one has
a full knowledge of the system statistics in the previous section, we implement a
DQN in this section and study how it performs in the xed-pattern channel switch-
ing even without any prior knowledge of the system statistics. We rst present
124
details of our DQN implementation and then evaluate its performance through
three experiments.
6.5.1 DQN Architecture
We design a DQN by following the Deep Q-learning with Experience Replay Al-
gorithm [54] and implement it in TensorFlow [1]. The structure of our DQN is
nalized as a fully connected neural network with each of the two hidden layers
containing 200 neurons
1
. The activation function of each neuron is Rectied Lin-
ear Unit (ReLU ), which computes the function f¹xº = max¹x; 0º. The state of
the DQN is dened as the combination of previous actions and observations over
previous M time slots, which serves as the input to the DQN. And the considered
number of historical time slots is the same as the number of channels in the sys-
tem, i.e., M = N. A vector of length N is used to represent the observation at a
time slot, where each item in the vector indicates the quality of the corresponding
channel. If channel i is selected, the value of the ith item in the vector is 1 if the
channel quality is good or1 if the channel quality is bad; otherwise, we use 0 to
indicate that channel i is not selected. And this vector implicitly contains action
1
Generally speaking, deciding the number of hidden layers and the number of neurons in a
layer needs many trials and errors. But we follow some general guidance provided in [32]. We
choose a two-hidden-layers neural network as it \can represent an arbitrary decision boundary to
arbitrary accuracy with rational activation functions and can approximate any smooth mapping
to any accuracy." And to decide the number of neurons in each layer, one of the rules of thumb
methods is that \The number of hidden neurons should be between the size of the input layer and
the size of the output layer." We have tried a dierent number of neurons between 16 (output
layer size) and 256 (input layer size), and the network structure with 200 neurons provided a good
performance with small training time.
125
Table 6.1: List of DQN hyperparameters
Hyperparameters Values
0:1
Minibatch size 32
Optimizer Adam
Activation Function ReLU
Learning rate 10
4
Experience replay size 1; 000; 000
0:9
information, as a non-zero item in the vector indicates the corresponding channel
is selected. The output of the DQN is a vector of length N, where the ith item rep-
resents the Q value of a given state if channel i is selected. We apply the -greedy
policy with xed as 0:1 to balance the exploration and exploitation, i.e., with
probability 0:1 the agent selects uniformly a random action, and with probability
0:9 the agent chooses the action that maximizes the Q value of a given state. At
each time slot t during training, when an action a
t
is taken given the state is x
t
,
the user gains a corresponding reward r
t
and the state is updated to x
t+1
, a piece of
record¹x
t
;a
t
;r
t
; x
t+1
º is stored into replay memory. When updating the weights of
the DQN, a minibatch of 32 samples are randomly selected from the replay memory
to compute the loss function, and then a recently proposed Adam algorithm [42] is
used to conduct the stochastic gradient descent to update the weights (details on
the hyperparameters are listed in Table 6.1). In the following experiment settings,
we consider a system of 16 channels, i.e., N = 16, which is a typical multichan-
nel WSN. All experiments are conducted on a machine with a quad-core 2.4GHz
Intel(R) Xeon CPU and the training times vary from 0.5 to 20 hours.
126
6.5.2 Single Good Channel, Round Robin Switching Situation
We rst consider a situation where there is only one good channel in the system
at any time slot. The channels take turns to become good with some probability
in a sequential round-robin fashion. In other words, if at time slot t, channel k
is good and all other channels are bad, then in the following time slot t + 1, with
probability p the following channel k + 1 becomes good and all others bad, and
with probability 1p channel k remains good and all others bad. In this situation,
the inherited dependence and correlation between channels are high. Actually, this
is the xed-pattern channel switching with each independent subset contains one
single channel and is activated in a sequential order. In Figure 6.3, we provide a
pixel illustration to visualize how the states of channels change in the 16-channel
system that follows a single good channel, round-robin situation over 50 time slots.
The x-axis is the index of each channel, and the y-axis is the time slot number.
A white cell indicates that the corresponding channel is good, and a black cell
indicates that the corresponding channel is bad.
We compare the DQN with two other policies: the Whittle Index heuristic pol-
icy and the optimal policy with known system dynamics from Section 6.4. The
optimal policy has full knowledge of the system dynamics and serves as a perfor-
mance upper bound. In the Whittle Index heuristic, the user assumes all channels
are independent. For each channel, the user observes it for 10; 000 time slots and
uses Maximum Likelihood Estimation (MLE) to estimate the corresponding 2-state
127
Figure 6.3: A capture of a single good
channel, round robin switching situa-
tion over 50 time slots
Figure 6.4: Average discounted re-
ward as we vary the switching prob-
ability p in the single good channel,
round robin switching
Markov chain transition matrix. Once the system model is estimated, Whittle In-
dex can be applied. As can be seen in Figure 6.4, as the switching probability p
varies, DQN remains robust and achieves the same optimal performance in all ve
cases as the optimal policy and performs signicantly better than the Whittle Index
heuristic. This lies in the fact that DQN can implicitly learn the system dynamics
including the correlation among channels, and nds the optimal policy accordingly.
On the contrary, the Whittle Index heuristic simply assumes the channels are in-
dependent and is not able to nd or make use of the correlation among channels.
Moreover, as the switching probability p increases, the accumulated reward from
DQN also increases because there is more certainty in the system that leads to an
increase in the optimal reward.
128
Figure 6.5: A capture of a single good
channel, arbitrary switching situation
over 50 time slots
Figure 6.6: Average discounted re-
ward as we vary the switching order
in the single good channel, arbitrary
switching
6.5.3 Single Good Channel, Arbitrary Switching Situation
Next, we study a situation in which there is still only one channel being good in
any time slot. However, unlike the previous situation, the channels become good in
an arbitrary order. Figure 6.5 shows a pixel illustration of the 16-channel system
in this situation.
In the experiment, the channel-switching probability p is xed as 0:9, and we
randomly choose 8 dierent arbitrary channel switching orders. As can be seen from
Figure 6.6, DQN achieves the optimal performance and signicantly outperforms
Whittle Index heuristic in all cases.
129
Figure 6.7: A capture of a multiple
good channels situation over 50 time
slots
Figure 6.8: Average discounted re-
ward as we increase the number of
good channels in the multiple good
channels situation
6.5.4 Multiple Good Channels Situation
In this section, we investigate the situation when there may be more than one good
channels in a time slot. The 16 channels are evenly divided into several subsets,
where each subset contains the same number of channels. At any time slot, there is
only one subset activated where all channels in this subset are good, and channels in
other inactivated subsets are bad. The subsets take turns to become available with
a switching probability xed at 0:9. And this is the xed-pattern channel switching
with each independent subset contains one or more channels. Figure 6.7 shows a
pixel illustration of the 16-channel system in a multiple good channels situation.
We vary the number of channels in a subset as 1, 2, 4 and 8 in the experiment,
and present the experimental result in Figure 6.8. The 16 channels in the system
are in order and the subsets are activated in a sequential round-robin order in
130
the upper graph in Figure 6.8, while the channels are arranged arbitrarily and the
activation order of subsets is also arbitrary in the bottom graph in Figure 6.8. As
can be seen, DQN always achieve the optimal performance, and the training time
decreases as the number of good channels increases. This is because there is more
chance to nd a good channel when more good channels are available at a time slot,
and the learning process becomes easier so that the DQN agent can take less time
exploring and is able to nd the optimal policy more quickly. This also explains
why Whittle Index heuristic performs better when there are more good channels.
However, DQN signicantly outperforms Whittle Index heuristic in all cases.
6.6 Experiment and Evaluation of DQN for More
Complex Situations
From the results in Section 6.5, we can see that DQN outperforms Whittle Index
heuristic and achieves optimal performance in the unknown xed-pattern channel
switching. Another question to ask is: can DQN achieve a good or even optimal
performance in more complex and realistic situations? To answer this question
and at the same time provide a better and deeper understanding of DQN, we have
re-tuned our neural network structure to become a fully connected neural network
131
with each hidden layer containing 50 neurons (and the learning rate is set as 10
5
)
2
,
and considered more complex simulated situations as well as real data traces.
In this section, in addition to the Whittle Index heuristic, we also compare
DQN with a Random Policy in which the user randomly selects one channel with
equal probability at each time slot. Since the optimal policy even with a full knowl-
edge of the system statistics is computationally prohibitive to obtain (by solving
the Bellman-Ford equation in the belief state space) in general, we implement the
Myopic policy as it is simple, robust and can achieve an optimal performance in
some situations. However, one cannot consider the Myopic policy in general when
system statistics is unknown since a single user is not able to observe the states
of all channels at the same time so that one could not provide an estimation of
the transition matrix of the entire system. Moreover, even if we allow the user to
observe the states of all channels, the state space of the full system is too large
to estimate and one would easily run out of memory when storing such a large
transition matrix. Therefore, in the following simulation, we only consider cases
when P is sparse and easy to access, and implement the Myopic policy as a genie
(knowing the system statistics a-priori) and evaluate its performance.
2
We have tried the same DQN structure as that in Section VII, but it does not perform well.
One intuition is that the parameters in the two-hidden layer DQN with each layer containing 200
neurons DQN is very large, which may require careful and longer training. Additionally, the two-
hidden layer neural network may also not be able to provide a good approximation of Q values
in more complex problems. Therefore, we decide to add one more hidden layer and reduce the
number of neurons to 50. This deeper DQN with fewer neurons has the ability to approximate
more complicated Q-value function, and in the meanwhile requires less time to train before nding
a good policy.
132
6.6.1 Perfectly correlated scenario
We consider a highly correlated scenario. In a 16-channel system, we assume only
two or three channels are independent, and other channels are exactly identical or
opposite to one of these independent channels. This is the case when some channels
are perfectly correlated, i.e., the correlation coecient is either 1 or1.
During the simulation, we arbitrarily set the independent channels to follow the
same 2-state Markov chain with p
11
p
01
. When the correlation coecient = 1,
the user can ignore those channels that are perfectly correlated with independent
channels and only select a channel from the independent channels. In this case,
the multichannel access problem becomes selecting one channel from several i.i.d.
channels that are positively correlated, i.e., p
11
p
01
. Then as it is shown in the
previous work [2, 97], the Myopic policy with known P is optimal and has a simple
round-robin structure alternating among independent channels. In the case when
=1, the Myopic policy with known P also has a simple structure that alternates
between two negatively perfectly correlated channels. Though more analysis needs
to be done in future to show whether the Myopic policy is optimal/near-optimal
when =1, it can still serve as a performance benchmark as the Myopic policy
is obtained with full knowledge of the system dynamics.
In Figure 6.9 we present the performance of all four policies: (i) DQN, (ii)
Random, (iii) Whittle Index heuristic, and (iv) Myopic policy with known P. In
the rst three cases (x-axis 0, 1 and 2), the correlation coecient is xed as 1
133
Figure 6.9: Average discounted re-
ward for 6 dierent cases. Each case
considers a dierent set of correlated
channels
Figure 6.10: Average maximum Q-
value of a set of randomly selected
states in 6 dierent simulation cases
and in the last three cases (x-axis 3, 4 and 5), is xed as1. We also vary the
set of correlated channels to make cases dierent. The Myopic policy in the rst
three cases is optimal, and in the last three cases is conjectured to be near-optimal.
As it is shown in Figure 6.9, the Myopic policy, which is implemented based on the
full knowledge of the system, is the best among all six cases and serves as an upper
bound. DQN provides a performance very close to the Myopic policy without any
knowledge of the system dynamics. The Whittle Index policy performs worse than
DQN in all cases.
In addition, we collect the Q-values predicted from the DQN to show that
DQN, indeed, tries to learn and improve its performance. Given a state x, the
maximum Q-value over all actions, i.e., max
a
Q¹x;aº, represents the estimate of
the maximum expected accumulated discounted reward starting from x over an
innite time horizon. For each simulation case, we x a set of states that are
randomly selected, and then plot the average maximum Q value of all these states
as the training is executed. As it is shown in Figure 6.10, in all cases, the average
134
maximum Q-value rst increases and then becomes stable, which indicates DQN
learns from experience to improve its performance and converges to a good policy.
As the environment cases are dierent, DQN may take a dierent amount of time to
nd a good policy, which is indicated as the dierent number of training iterations
in each case in the gure for Q values becoming stable.
6.6.2 Real data trace
We use real data trace collected from our indoor testbed Tutornet
3
to train and
evaluate the performance of DQN on real systems. The testbed is composed of
TelosB nodes with IEEE 802.15.4 radio. We programmed a pair of motes distanced
approximately 20 meters to be transmitter/receiver. The transmitter continually
transmits one packet to each one of the 16 available channels periodically (every 4
milliseconds) and the receiver records the successful and failed attempts. The trans-
mitter switches transmitting on dierent channels so fast that the time dierence
can be ignored and the channel states of 16 channels measured at each period can
be considered to be in the same time slot. After nishing transmitting one packet
in each one of the 16 channels, the transmitter waits for 700 milliseconds and then
repeat the transmission process. Both nodes are synchronized to avoid packet loss
due to frequency mismatch and the other motes on the testbed are not in use. The
interference suered is from surrounding Wi-Fi networks and multi-path fading.
3
More information about the testbed on http://anrg.usc.edu/www/tutornet/
135
There are 8 Wi-Fi access points on the same
oor and dozens of people working in
the environment, which creates a very dynamic scenario for multichannel access.
The data collection starts on a weekday morning and lasts for around 17 hours.
Due to the conguration of Wi-Fi central channels, there are 8 channels whose
conditions are signicantly better than others. Randomly selecting one channel
from these good channels and keeping using it can lead to a good performance.
Thus, in order to create a more adverse scenario and test the learning capability of
the DQN, we ignore all these good channels and only use the data trace from the
rest 8 channels.
In the experiment, we create a WSN environment simulator to mimic the actual
testbed by cycling through the real data trace that contains 5; 200 data samples
(i.e., it iterates through the data trace, and starts over from the beginning when it
reaches the end) and adding some randomness to the data trace to avoid completely
repeating the same data trace. The simulated environment allows dierent methods
to interact with it in realtime. We compare the performance of the DQN policy,
the Whittle index based heuristic policy and the Random policy. The Myopic
Policy is not considered as nding the transmission matrix of the entire system
is computationally expensive. The average accumulated discounted reward from
each policy is listed in Table 6.2. It can be seen that DQN performs best in this
complicated real scenario. We also present the channel utilization of each policy
in Figure 6.11 to illustrate the dierence among them. It shows DQN benets
136
from using other channels when the two best channels (used by the Whittle Index
heuristic all the time) may not be in good states.
Table 6.2:
Performance on real data trace
Method Reward
DQN 0:9473
Whittle Index 0:7673
Random Policy 2:1697
Figure 6.11: Channel utilization of 8
channels in the testbed
6.6.3 Multi-User Scenario
In WSNs, there are more realistic and complicated scenarios such as multi-user,
multi-hop and simultaneous transmissions. The framework of DQN can be directly
extended to consider these practical factors in a simple way. For example, in the
situation of multiple users, to avoid interference and collisions among users, we can
adopt a centralized approach: assuming there is a centralized controller that can
select a subset of non-interfering channels at any time slot, and assign one to each
user to avoid a collision. By redening the action as selecting a subset of non-
interfering channels, the DQN framework can be directly used for this multi-user
scenario. As the action space becomes large when selecting multiple channels, the
current DQN structure requires careful re-design and may require very long training
137
2.0 2.5 3.0 3.5 4.0
Number of users
5
10
15
20
25
30
35
Accumulated Reward
Whittle index heuristic
DQN
Optimal policy with known system dynamics
Figure 6.12: Average discounted reward as we vary the number of users in the
multiple-user situation
interval before nding a reasonable solution. Instead, we consider the multi-user
situation in a smaller system that contains 8 channels where at any time slot 6
channels become good and channel conditions change in a round-robin pattern.
The number of users varies from 2 to 4. As is shown in Figure 6.12, DQN can still
achieve a good performance in the multiple-user case.
As mentioned above, when the number of users in the network becomes large,
the above proposed centralized approach becomes too computationally expensive.
In future, we plan to study a more practical distributed approach where each user
can learn a channel selection policy independently. One intuitive idea is to im-
plement a DQN at each user independently. Then users can learn their channel
selection policies parallelly, and avoid interference and con
icts by making proper
channel-selection decisions based on the information gained from observations and
138
rewards. However, whether a good or optimal policy can be learned, and whether
an equilibrium exists are unknown and need further investigation.
6.6.4 Practical Issues
When discussing the channel access problem, we only focus on one user and simply
assume the user can always observe the actual state of his selected channel at each
time slot. In practice, there are two entities involved, the sender and the receiver.
They must be synchronized and use the same channel to communicate all the time.
In a time slot when the sender selects a channel to transmit a packet, the receiver
knows the selected channel condition based on whether it receives the packet or not,
and the sender knows the selected channel condition from the acknowledgement
(ACK) or negative-acknowledgement (NAK) message sent back by the receiver. If
the receiver successfully receives the packet, it knows the channel is good and sends
back an ACK to the sender, so that based on the received ACK the sender also
knows the channel is good; if the receiver does not receive any packet, it knows
the channel is bad and sends back an NAK to the sender, and thus the sender also
knows the channel is bad. Therefore, when applying DQN in practice, we need
to make sure the sender and the receiver always select the same channel at each
time slot to guarantee their communication as well as having the same information
about channel conditions through ACKs and NAKs.
139
One approach is to run the same structured DQNs at the sender and the re-
ceiver separately. The two DQNs start with the same default channel and are
trained concurrently. We need to make sure the two DQNs have the same trained
parameters and select the same channels at all times during training. Even though
the ACK/NAK method can guarantee the sender and receiver have the same chan-
nel observations and thus training samples, there are still two facts that may cause
the channel selection at the sender and the receiver to be dierent. First, in the
exploration step, since each DNQ randomly selects a channel, it may happen that
the two DQNs select dierent channels. Second, in the back propagation step, each
DQN randomly selects a set of data samples from its experience replay to update
its parameters. This may cause the parameters of two DQNs to become dierent,
which further results in dierent channel selection policies. To resolve the possible
mismatch, we can use the same random seed on both sides to initialize the pseu-
dorandom number generator in the implementation. In this way, the two DQNs
always select the same random channel during exploration and use the same set of
data samples to update parameters. Thus, we can ensure the two DQNs will always
select the same channel and the nal learned policy is guaranteed to be the same.
The channel mismatch problem can still happen when an ACK or NAK is lost
(due to noise and/or interference) so that the sender and receiver might have dif-
ferent observations on the selected channel condition, and thus they may select
dierent channels later. This inconsistent channel observation not only causes loss
140
of communication, but also results in dierent learned DQN models at the sender
and receiver that give dierent channel selection policies. One possible approach is
to nd a way to let the sender and the receiver be aware of the time when a chan-
nel mismatch happens, and try to recover in time. Since the sender is expecting
to receive an ACK or NAK after each message is sent, the sender can detect the
mismatch events if no ACK or NAK are received. Once the sender detects the pos-
sible channel mismatch event, it stops updating its DQN model as well as training
dataset and transmits data in the future using one single channel - or a small set
of channels known so far to have better channel conditions [41]. In addition, along
with the original data messages, the sender also sends the timestamp when the
channel mismatch was perceived. The sender keeps sending this channel mismatch
time information until an ACK being received, which indicates the receiver is on the
same channel again and receives the channel mismatch information. Therefore, the
receiver can set its DQN model as well as its observation training dataset back to
the state right before the channel mismatch happened (assume the receiver uses ad-
ditional memory to store dierent states of trained parameters and data samples),
which guarantees that the sender and the receiver have the same DQN models as
well as training datasets. They can resume operating and training thereafter. Sup-
pose the sender only uses one current best channel to send the channel mismatch
timestamp, and let p
good
be the probability of this channel being good in a time
slot, p
ack
be the probability an ACK or NAK being lost, and N be the number of
141
channels in the system. As the receiver keeps training its DQN model before being
aware of the channel mismatch problem, it applies the -greedy exploration policy
(explained in Section 6.5.1) during training phase. Therefore, with probability ,
the receiver randomly picks a channel. Thus, after a channel mismatch happens,
the probability that the sender and the receiver meet again on the same good chan-
nel and at the same time the ACK is successfully received is
p
good
¹1p
ack
º
N
. Once
they meet on the same good channel, they can re-synchronize. Based on the above
approach, the expected number of time slots required for re-syncing after a channel
mismatch is
N
p
good
¹1p
ack
º
. Since the ACK packet is very small, the probability of
loss is small [15]. As long as the sender and the receiver can re-synchronize again
after a channel mismatch, the eectiveness of the proposed policy is guaranteed
and the performance will not be aected too much on average.
6.7 Adaptive DQN for Unknown, Time-Varying
Environments
The studies in previous sections all focus on stationary situations, and DQN per-
forms well in learning and nding good or even optimal dynamic multichannel access
policies. However, in practice, real systems are often dynamic across time, and our
DQN framework in previous sections cannot perform well in such situations. This
is because we keep evaluating the newly-learned policy after each training iteration
142
and once a good policy is learned
4
, our DQN framework stops learning and keeps
following this good policy. Thus, it lacks the ability to discover the change and
re-learn if needed. To make DQN more applicable in realistic situations, we have
designed an adaptive algorithm in Algorithm 13 to make DQN able to be aware of
the system change and re-learn if needed. The main idea is to let DQN periodi-
cally evaluate the performance (i.e., the accumulated reward) of its current policy,
and if the performance degrades by a certain amount, the DQN can infer that the
environment has changed and start re-learning.
On the other hand, the Whittle Index heuristic cannot detect the environment
change by simply observing the reward change. This is because the policy given
from Whittle Index heuristic is far from the optimal policy, and it may have the low
performance in both old and new environments so that there is no signicant change
in the reward leading to the claim that the environment has changed. In addition,
even if the Whittle Index heuristic could detect the change, the new policy may
still give a bad performance as the Whittle Index heuristic ignores the correlations
among channels and is not able to have a correct estimation of the system dynamics
due to its limited partial observation ability.
In the experiment, we make the system initially follow one of the xed-pattern
channel switching cases from Section 6.5, and after some time it changes to another
4
In this chapter, we manually check the evaluation performance and stop the learning when a
policy is good enough. More advanced techniques such as Secretary Problem [21] (by considering
each learned policy as a secretary) can be used to decide when to accept a policy and stop learning.
5
The threshold is set by the user according to her preference.
143
Algorithm 13 Adaptive DQN
1: First train DQN to nd a good policy to operate with
2: for n = 1; 2;::: do
3: At the beginning of period n
4: Evaluate the accumulated reward of the current policy
5: if The reward is reduced by a given threshold
5
then
6: Re-train the DQN to nd a new good policy
7: else
8: Keep using the current policy
9: end if
10: end for
Figure 6.13: Average discounted re-
ward as we vary the channel switch-
ing pattern situations
Figure 6.14: Average discounted re-
ward in real time during training
in unknown xed-pattern channel
switching
case. We consider both single good channel and multiple good channel situations.
We let DQN automatically operate according to Algorithm 13, while we manually
re-train Whittle Index heuristic when there is a change in the environment. Fig-
ure 6.13 compares the reward of both the old and new policies learned for DQN and
the Whittle Index heuristic in the new environment, as we vary the pattern changes.
As can be seen, DQN is able to nd an optimal policy for the new environment as
the genie optimal policy does, while Whittle Index heuristic does not.
144
We also provide the real-time accumulated reward during the learning process
of DQN and the Whittle Index heuristic in one of the above pattern changing
situations in Figure 6.14. The system initially starts with an environment that
has 8 channels being good at each time slot for the rst 10 iterations. As can be
seen, both DQN and the Whittle Index heuristic are able to quickly nd a good
channel access policy, but DQN achieves the optimal performance. At iteration 11,
the environment changes to having only 1 channel being good at each time slot.
As there is a signicant drop in the reward, DQN can detect the change and starts
re-learning. And at iteration 70, DQN nds the optimal policy and our system
keeps following the optimal policy thereafter. On the other hand, even though we
manually enable the Whittle Index heuristic to detect the change and re-estimate
the system model and re-nd a new policy, its performance is still unsatisfying as
it cannot make use of the correlation among channels.
6.8 Conclusion
In this chapter, we have considered the dynamic multichannel access problem in
a more general and practical scenario when channels are correlated and system
statistics is unknown. As the problem, in general, is an unknown POMDP without
any tractable solution, we have applied an end-to-end DQN approach that directly
utilizes historical observations and actions to nd the access policy via online learn-
ing. In the xed-pattern channel switching, we have been able to analytically nd
145
the optimal access policy that is achieved by a genie with known system statistics
and full observation ability. Through simulations, we have shown DQN is able to
achieve the same optimal performance even without knowing any system statistics.
We have re-tuned the DQN implementation, and shown from both simulations and
real data trace that DQN can achieve near-optimal performance in more complex
scenarios. In addition, we also shows that DQN can be directly applied to han-
dle to more practical cases such as multi-user network. Finally, we have proposed
an adaptive DQN and shown from numerical simulations that it is able to detect
system changes and re-learn in non-stationary dynamic environments to provide a
good performance.
146
Chapter 7
Conclusion and Open Questions
In this thesis, we have presented the study of applying autonomous robots and
online learning, specically, deep reinforcement learning, to overcome the challenges
in nowadays's complex, heterogeneous, dynamic and uncertain wireless networks.
The potential and benets of learning, adaptation and control given by this AI-
assisted approach have been explored in the following settings:
robotic message ferrying for multi-
ow wireless networks
robotic network deployment in unknown obstructed environments
dynamic multichannel access in wireless networks
In our rst study of robotic message ferrying for multi-
ow wireless networks,
we have proposed a dynamic Coarse-grained Backpressure Message Ferrying algo-
rithm (CBMF) to schedule and control mobile robots, which guarantees the network
stability even without knowing the arrival rates. We have also designed an adap-
tive approach to make CBMF adapt itself in time-varying, dynamic environments.
147
More practical issues and structural properties have been studied to improve the
eciency of CBMF in practice.
In our second study, we have considered robotic network deployment in unknown
obstructed environments. We have presented a unied framework, onLinE rObotic
N etwork formAtion (LEONA), that is general enough to permit optimizing the
communication network for dierent purposes while reducing the overhead or time
associated with deploying the robotic network at the same time.
While our rst two studies have validated the fact that the controllable mo-
bility of robots provides us with a new dimension to improve wireless network
performance, in our third study we have focused on the promising application of
deep reinforcement learning in wireless networks. We have considered the dynamic
multichannel access problem in a more general and practical setting, and applied
an end-to-end DQN approach that directly utilizes historical observations and ac-
tions to nd the access policy via online learning. We have shown DQN is able to
achieve optimal or near-optimal performance through various simulations as well as
real data trace even without knowing any system statistics. Additionally, we have
also shown how DQN can be improved so that it is able to detect system changes
and re-learn in non-stationary dynamic environments.
While the results in this thesis have provided useful insights into AI-assisted
wireless network problems, there are a number of interesting open questions and
extension directions to be explored in the future.
148
7.1 Extensions on Robotic Message Ferrying
Our current CBMF algorithm makes it possible to dynamically allocate robots and
control their movements according to current network information to enhance con-
nectivity and assist communication in wireless networks. However, because of the
coarse-grained nature of our CBMF algorithm that the robotic allocation decision
is made once every epoch, each robot can only be assigned to one node for the whole
epoch. Even if some robots may nish their tasks earlier than the others, they can
not be re-assigned to serve other nodes in that epoch. This under utilization of
robotic message ferries adversely aects both capacity and delay performance.
To support the entire capacity region without delay ineciency, the design of a
much ner-grained allocation and motion control algorithm is required. A gener-
alization of [24] can take care of this scenario by making the allocation decision at
each time slot. This algorithm is conceptually similar to the traditional backpres-
sure routing algorithm. However, the capacity analysis in the traditional backpres-
sure formulation is based on the assumption that the link states / network topology
is independent of the nodes' allocation, whereas in this problem, the robots' allo-
cations can change the link states (i.e., the communication rates). This coupling
between allocation and link rate makes the problem more complicated and does
not follow the standard analysis using backpressure formulation. Therefore, the
problem becomes non-trivial and requires further investigations.
149
7.2 Extensions on Robotic Network Deployment
Our current LEONA framework has several limitations. First, for LEONA to work
successfully, all robots must have the same up-to-date map. Thus, every iteration,
robots need to broadcast their newly learnt environmental information to others to
keep everyone updated. However, this
ooding of messages can consume a large
amount of communication resources. Second, every iteration in the framework of
LEONA, robots are required to compute the best possible communication path by
running the FindPath¹Gº algorithm on the communication graph G. Though the
FindPath¹Gº algorithm can achieve the polynomial run time in many cases, the size
of the graphG can become very large since it is completely connected, especially in
a large environment. This makes it dicult to implement LEONA in practice when
considering robots' limited on-board computational ability and power shortage. All
these factors require the new design of distributed algorithms which can reduce the
resources needed for communication and computation.
In the previous study of robotic network deployment, though the environment
is unknown, we assume robots are aware of their exact positions. In many realistic
scenarios, because of the lack of global position information, robots do not know
their positions when deployed in an unknown environment either. This is actually
related to the eld of Simultaneous Localization and Mapping (SLAM) [17], which
has been actively developed in the robotics community. The SLAM problem asks
if it is possible for a mobile robot to be placed at an unknown location in an
150
unknown environment and for the robot to incrementally build a consistent map
of this environment while simultaneously determining its location within this map.
However, the problem of robotic network deployment is more than just a SLAM.
Though SLAM is required, the ultimate goal, instead of mapping and localization,
is to nd optimal congurations of robots with best networking and communication
performance achieved in an unknown environment, even without a global position
system.
7.3 Extensions on Dynamic Multichannel Access
In addition to the problems discussed in Section 6.6.3 on applying DQN in more
realistic and complicated scenarios, there are still a number of open directions
suggested by our previous work. First, more advanced deep reinforcement learning
approaches, such as Deep Deterministic Policy Gradient (DDPG) [49], should also
be studied in future, especially for the challenge of large and even continuous state-
action space. Second, as training a deep neural network is expensive, a more ecient
way is to study the structure and property of the policy learned from DQN and
then design heuristics that can perform well in practice without the burden of the
long learning period. Third, as DQN is not easy to tune and may get stuck in local
optima easily, improving the DQN implementation as well as considering other Deep
Reinforcement Learning approaches are needed to see if they have the ability to
reach the optimal performance in general situations and study the tradeo between
151
implementation complexity and performance guarantee. Also as a way to test
the full potential of DQN (or Adaptive DQN) as well as other deep reinforcement
learning technologies in the problem of dynamic multichannel access, we encourage
the networking community to work together and create an open source dataset
that contains dierent practical channel access scenarios so that researchers can
benchmark the performance of dierent approaches. We have published all the
channel access environments and real data trace considered in Chapter 6
1
. This
might serve as an useful benchmark dataset for researchers to use.
1
https://github.com/ANRGUSC/MultichannelDQN-channelModel
152
References
[1] Mart n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef-
frey Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving, Michael Isard,
Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G.
Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin
Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor
ow: A system for large-scale
machine learning. USENIX Conference on Operating Systems Design and Im-
plementation (OSDI), 2016.
[2] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari. Optimal-
ity of myopic sensing in multichannel opportunistic access. IEEE Transactions
on Information Theory, 55(9):4040{4050, Sept 2009.
[3] John-Alexander M Assael, Niklas Wahlstr om, Thomas B Sch on, and Marc Pe-
ter Deisenroth. Data-ecient learning of feedback policies from image pixels
using deep dynamical models. arXiv preprint arXiv:1510.02173, 2015.
[4] E. Athanasopoulou, L. X. Bui, T. Ji, R. Srikant, and A. Stolyar. Back-
pressure-based packet-by-packet adaptive routing in communication networks.
IEEE/ACM Transactions on Networking, 21(1):244{257, Feb 2013.
[5] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recog-
nition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
[6] Richard Bellman. Dynamic Programming. Princeton University Press, Prince-
ton, NJ, USA, 1 edition, 1957.
[7] A. A. Bobtsov and A. S. Borgul. Multiagent aerial vehicles system for ecolog-
ical monitoring. International Conference on Intelligent Data Acquisition and
Advanced Computing Systems (IDAACS), 2013.
[8] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge
University Press, New York, NY, USA, 2004.
[9] Darius Braziunas. POMDP solution methods. Technical report, 2003.
153
[10] J. Burgess, B. Gallagher, D. Jensen, and B. N. Levine. Maxprop: Routing
for vehicle-based disruption-tolerant networks. International Conference on
Computer Communications (INFOCOM), 2006.
[11] B. Burns, O. Brock, and B. N. Levine. MV routing and capacity building in
disruption tolerant networks. Joint Conference of the IEEE Computer and
Communications Societies (INFOCOM), 2005.
[12] W. Dai, Y. Gai, and B. Krishnamachari. Ecient online learning for oppor-
tunistic spectrum access. International Conference on Computer Communica-
tions (INFOCOM), 2012.
[13] W. Dai, Y. Gai, and B. Krishnamachari. Online learning for multi-channel
opportunistic access over unknown markovian channels. IEEE International
Conference on Sensing, Communication, and Networking (SECON), 2014.
[14] Z. Dawy, W. Saad, A. Ghosh, J. G. Andrews, and E. Yaacoub. Toward mas-
sive machine type cellular communications. IEEE Wireless Communications,
24(1):120{128, February 2017.
[15] Douglas S. J. De Couto, Daniel Aguayo, John Bicket, and Robert Morris. A
high-throughput path metric for multi-hop wireless routing. Wireless Net-
works, 11(4):419{434, July 2005.
[16] M. Dunbabin and L. Marques. Robots for environmental monitoring: Signif-
icant advancements and applications. IEEE Robotics Automation Magazine,
19(1):24{39, March 2012.
[17] H. Durrant-Whyte and T. Bailey. Simultaneous localization and mapping:
part i. IEEE Robotics Automation Magazine, 13(2):99{110, June 2006.
[18] Amit Dvir and Athanasios V. Vasilakos. Backpressure-based routing protocol
for dtns. ACM SIGCOMM.
[19] D. Faria. Modeling signal attenuation in IEEE 802.11 wireless lans. vol. 1.
Technical Report Technical Report TR-KP06-0118, Stanford University, 2005.
[20] J. A. Fax and R. M. Murray. Information
ow and cooperative control of
vehicle formations. IEEE Transactions on Automatic Control, 49(9):1465{
1476, Sept 2004.
[21] T.S. Ferguson. Who solved the secretary problem? Statistical Science, 4:282{
296, 1989.
[22] E. Ferranti and N. Trigoni. Robot-assisted discovery of evacuation routes in
emergency scenarios. International Conference on Robotics and Automation
(ICRA), 2008.
154
[23] J. Fink, A. Ribeiro, and V. Kumar. Motion planning for robust wireless net-
working. International Conference on Robotics and Automation (ICRA), 2012.
[24] A. Gasparri and B. Krishnamachari. Throughput-optimal robotic message
ferrying for wireless networks using backpressure control. International Con-
ference on Mobile Ad Hoc and Sensor Systems (MASS), 2014.
[25] V. Gazi and K. M. Passino. Stability analysis of swarms. IEEE Transactions
on Automatic Control, 48(4):692{697, April 2003.
[26] Leonidas Georgiadis, Michael J. Neely, and Leandros Tassiulas. Resource al-
location and cross-layer control in wireless networks. Foundations and Trends
in Networking, 1(1):1{144, April 2006.
[27] Stephanie Gil, Swarun Kumar, Dina Katabi, and Daniela Rus. Adaptive com-
munication in multi-robot systems using directionality of signal strength. In-
ternational Journal of Robotics Research, 34(7):946{968, June 2015.
[28] David Kiyoshi Goldenberg, Jie Lin, A. Stephen Morse, Brad E. Rosen, and
Y. Richard Yang. Towards mobility as a network control primitive. Interna-
tional Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc),
2004.
[29] Andrea Goldsmith. Wireless Communications. Cambridge University Press,
New York, NY, USA, 2005.
[30] M. Grossglauser and D. N. C. Tse. Mobility increases the capacity of ad hoc
wireless networks. IEEE/ACM Transactions on Networking, 10(4):477{486,
Aug 2002.
[31] R. Guerin and A. Orda. Computing shortest paths for any number of hops.
IEEE/ACM Transactions on Networking, 10(5):613{620, Oct 2002.
[32] Je Heaton. Introduction to Neural Networks for Java, 2Nd Edition. Heaton
Research, Inc., 2nd edition, 2008.
[33] Andrew Howard, Lynne E. Parker, and Gaurav S. Sukhatme. Experiments
with a large heterogeneous mobile robot team: Exploration, mapping, deploy-
ment and detection. International Journal of Robotics Research, 25(5-6):431{
447, May 2006.
[34] L. Huang, S. Moeller, M. J. Neely, and B. Krishnamachari. LIFO-backpressure
achieves near-optimal utility-delay tradeo. IEEE/ACM Transactions on Net-
working, 21(3):831{844, June 2013.
[35] 802.15.4-2015 - IEEE Standard for Low-Rate Wireless Personal Area Networks
(WPANs), 2015.
155
[36] David Jea, Arun Somasundara, and Mani Srivastava. Multiple controlled mo-
bile elements (data mules) for data collection in sensor networks. International
Conference on Distributed Computing in Sensor Systems (DCOSS), 2005.
[37] B. Ji, C. Joo, and N. B. Shro. Delay-based back-pressure scheduling in multi-
hop wireless networks. IEEE/ACM Transactions on Networking, 21(5):1539{
1552, Oct 2013.
[38] Philo Juang, Hidekazu Oki, Yong Wang, Margaret Martonosi, Li Shiuan Peh,
and Daniel Rubenstein. Energy-ecient computing for wildlife tracking: De-
sign tradeos and early experiences with zebranet. International Conference
on Architectural Support for Programming Languages and Operating Systems
(ASPLOS X), 2002.
[39] O. Khatib. Real-time obstacle avoidance for manipulators and mobile robots.
IEEE International Conference on Robotics and Automation (ICRA), 1985.
[40] D. Kim. Grid-based geographic routing for mobile ad-hoc networks. Technical
Report PhD. Thesis, Stanford University, 2007.
[41] Jae-Young Kim, Sang-Hwa Chung, and Yu-Vin Ha. A fast joining scheme
based on channel quality for IEEE 802.15.4e TSCH in severe interference en-
vironment. Ubiquitous and Future Networks (ICUFN), 2017.
[42] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
[43] R. Knopp and P.A. Humblet. Information capacity and power control in single-
cell multiuser communications. International Conference on Communications
(ICC), 1995.
[44] H. W. Kuhn and Bryn Yaw. The Hungarian method for the assignment prob-
lem. Naval Research Logistics Quarterly, pages 83{97, 1955.
[45] Karthik Kumar, Jibang Liu, Yung-Hsiang Lu, and Bharat Bhargava. A survey
of computation ooading for mobile systems. Mobile Networks and Applica-
tions, 18(1):129{140, February 2013.
[46] Vijay Kumar, Daniela Rus, and Gaurav S. Sukhatme. Networked robots. In
Springer Handbook of Robotics, pages 943{958. 2008.
[47] Alberto Leon-Garcia. Probability, Statistics, and Random Processes for Elec-
trical Engineering. Pearson/Prentice Hall, Upper Saddle River, NJ, third edi-
tion, 2008.
[48] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end
training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.
156
[49] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom
Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with
deep reinforcement learning. arXiv preprint arXiv:1509.02971 [cs.LG], 2015.
[50] H. Liu, K. Liu, and Q. Zhao. Logarithmic weak regret of non-bayesian restless
multi-armed bandit. International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2011.
[51] Keqin Liu and Qing Zhao. Indexability of restless bandit problems and opti-
mality of whittle index for dynamic multichannel access. IEEE Transactions
on Information Theory, 56(11):5547{5567, nov 2010.
[52] N. C. Luong, D. T. Hoang, P. Wang, D. Niyato, D. I. Kim, and Z. Han. Data
collection and wireless communication in internet of things (iot) using eco-
nomic analysis and pricing models: A survey. IEEE Communications Surveys
Tutorials, 18(4):2546{2590, Fourthquarter 2016.
[53] Walid Saad Changchuan Yin M erouane Debbah Mingzhe Chen, Ursula Chal-
lita. Machine learning for wireless networks with articial intelligence: A
tutorial on neural networks. arXiv preprint arXiv:1710.02913 [cs.IT], 2017.
[54] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep
reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[55] Scott Moeller, Avinash Sridharan, Bhaskar Krishnamachari, and Omprakash
Gnawali. Routing without routes: The backpressure collection protocol. In-
ternational Conference on Information Processing in Sensor Networks (IPSN),
2010.
[56] M. J. Neely. Order optimal delay for opportunistic scheduling in multi-user
wireless uplinks and downlinks. IEEE/ACM Transactions on Networking,
16(5):1188{1199, Oct 2008.
[57] M. J. Neely. Intelligent packet dropping for optimal energy-delay tradeos in
wireless downlinks. IEEE Transactions on Automatic Control, 54(3):565{579,
March 2009.
[58] M. J. Neely, E. Modiano, and C. P. Li. Fairness and optimal stochastic con-
trol for heterogeneous networks. IEEE/ACM Transactions on Networking,
16(2):396{409, April 2008.
[59] M. J. Neely and R. Urgaonkar. Opportunism, backpressure, and stochastic
optimization with the wireless broadcast advantage. Asilomar Conference on
Signals, Systems and Computers (ASILOMAR)), 2008.
157
[60] R. Olfati-Saber. Flocking for multi-agent dynamic systems: algorithms and
theory. IEEE Transactions on Automatic Control, 51(3):401{420, March 2006.
[61] Reza Olfati-Saber and Richard M. Murray. Distributed cooperative control of
multiple vehicle formations using structural potential functions. IFAC World
Congress, 2002.
[62] R. Ortner, P. Auer D. Ryabko, and R. Munos. Regret bounds for restless
markov bandits. International Conference on Algorithmic Learning Theory
(ALT), 2012.
[63] Christos Papadimitriou and John N. Tsitsiklis. The complexity of markov
decision processes. Mathematics of Operations Research, 12(3):441{450, 1987.
[64] T. Park, N. Abuzainab, and W. Saad. Learning how to communicate in the
internet of things: Finite resources and heterogeneity. IEEE Access, 4:7063{
7073, 2016.
[65] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[66] Solvepomdp. http://erwinwalraven.nl/solvepomdp/.
[67] A. Sridharan, S. Moeller, and B. Krishnamachari. Implementing backpressure-
based rate control in wireless networks. In Information Theory and Applica-
tions Workshop (ITA), 2009.
[68] E. Stevens-Navarro, Y. Lin, and V. W. S. Wong. An MDP-based vertical hand-
o decision algorithm for heterogeneous wireless networks. IEEE Transactions
on Vehicular Technology, 57(2):1243{1254, March 2008.
[69] E. Stump, A. Jadbabaie, and V. Kumar. Connectivity management in mobile
robot teams. International Conference on Robotics and Automation (ICRA),
2008.
[70] S. Supittayapornpong and M. J. Neely. Achieving utility-delay-reliability
tradeo in stochastic network optimization with nite buers. International
Conference on Computer Communications (INFOCOM), 2015.
[71] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learn-
ing with neural networks. International Conference on Neural Information
Processing Systems (NIPS), 2014.
[72] A. R. Syed, K. L. A. Yau, H. Mohamad, N. Ramli, and W. Hashim. Channel
selection in multi-hop cognitive radio network using reinforcement learning:
An experimental study. International Conference on Frontiers of Communi-
cations, Networks and Applications (ICFCNA), 2014.
158
[73] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing
systems and scheduling policies for maximum throughput in multihop radio
networks. IEEE Transactions on Automatic Control, 37(12):1936{1948, Dec
1992.
[74] C. Tekin and M. Liu. Online learning in opportunistic spectrum access: A
restless bandit approach. In International Conference on Computer Commu-
nications (INFOCOM), 2011.
[75] Ambuj Tewari and Peter L. Bartlett. Optimistic linear programming gives
logarithmic regret for irreducible MDPs. International Conference on Neural
Information Processing Systems (NIPS), 2008.
[76] S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang.
Energy harvesting wireless communications: A review of recent advances.
IEEE Journal on Selected Areas in Communications, 33(3):360{381, March
2015.
[77] Amin Vahdat and David Becker. Epidemic Routing for Partially Connected
Ad Hoc Networks. Technical report, July 2000.
[78] P. Venkatraman, B. Hamdaoui, and M. Guizani. Opportunistic bandwidth
sharing through reinforcement learning. IEEE Transactions on Vehicular
Technology, 59(6):3148{3153, July 2010.
[79] M. A. M. Vieira, R. Govindan, and G. S. Sukhatme. Towards autonomous
wireless backbone deployment in highly-obstructed environments. In Interna-
tional Conference on Robotics and Automation (ICRA), 2011.
[80] S. Wang, A. Gasparri, and B. Krishnamachari. Robotic message ferrying for
wireless networks using coarse-grained backpressure control. In 2013 IEEE
Globecom Workshops (GC Wkshps), pages 1386{1390, Dec 2013.
[81] S. Wang, A. Gasparri, and B. Krishnamachari. Robotic message ferrying for
wireless networks using coarse-grained backpressure control. IEEE Transac-
tions on Mobile Computing, 16(2):498{510, Feb 2017.
[82] S. Wang, B. Krishnamachari, and N. Ayanian. The optimism principle: A
unied framework for optimal robotic network deployment in an unknown
obstructed environment. In International Conference on Intelligent Robots
and Systems (IROS), 2015.
[83] S. Wang, H. Liu, P. Gomes, and B. Krishnamachari. Deep reinforcement
learning for dynamic multichannel access in wireless networks. International
Conference on Computing, Networking and Communications (ICNC), 2017.
159
[84] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari. Deep reinforcement
learning for dynamic multichannel access in wireless networks. IEEE Trans-
actions on Cognitive Communications and Networking, PP(99):1{1, 2018.
[85] A. Warrier, S. Janakiraman, S. Ha, and I. Rhee. DiQ: Practical dierential
backlog congestion control for wireless networks. In International Conference
on Computer Communication (INFOCOM), 2009.
[86] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. In Machine
Learning, pages 279{292, 1992.
[87] ChristopherJ Watkins and P Dayan. Technical note: Q-learning. Machine
Learning, 8(3{4), 1992.
[88] R. K. Williams, A. Gasparri, and B. Krishnamachari. Route swarm: Wire-
less network optimization through mobility. In International Conference on
Intelligent Robots and Systems (IROS), 2014.
[89] Hao Wu, Richard Fujimoto, Randall Guensler, and Michael Hunter. MDDV:
A mobility-centric data dissemination algorithm for vehicular networks. Inter-
national Workshop on Vehicular Ad Hoc Networks (VANET), 2004.
[90] Y. Yan and Y. Mosto. Robotic router formation in realistic communication
environments. IEEE Transactions on Robotics, 28(4):810{827, Aug 2012.
[91] I. Yaqoob, E. Ahmed, I. A. T. Hashem, A. I. A. Ahmed, A. Gani, M. Imran,
and M. Guizani. Internet of things architecture: Recent advances, taxonomy,
requirements, and open challenges. IEEE Wireless Communications, 24(3):10{
16, June 2017.
[92] Marco Z u~ niga Zamalloa and Bhaskar Krishnamachari. An analysis of unre-
liability and asymmetry in low-power wireless links. ACM Transactions on
Senor Networks, 3(2), June 2007.
[93] M. M. Zavlanos, M. B. Egerstedt, and G. J. Pappas. Graph-theoretic connec-
tivity control of mobile robot networks. Proceedings of the IEEE, 99(9):1525{
1540, Sept 2011.
[94] M. M. Zavlanos, A. Ribeiro, and G. J. Pappas. Mobility and routing control in
networks of robots. IEEE Conference on Decision and Control (CDC), 2010.
[95] Y. Zhang, Q. Zhang, B. Cao, and P. Chen. Model free dynamic sensing order
selection for imperfect sensing multichannel cognitive radio networks: A q-
learning approach. International Conference on Communications (ICC), 2014.
160
[96] J. Zhao and G. Cao. Vadd: Vehicle-assisted data delivery in vehicular ad hoc
networks. IEEE Transactions on Vehicular Technology, 57(3):1910{1922, May
2008.
[97] Qing Zhao, Bhaskar Krishnamachari, and Keqin Liu. On myopic sensing for
multi-channel opportunistic access: structure, optimality, and performance.
IEEE Transactions on Wireless Communications, 7(12):5431{5440, dec 2008.
[98] W. Zhao, M. Ammar, and E. Zegura. Controlling the mobility of multiple data
transport ferries in a delay-tolerant network. Joint Conference of the IEEE
Computer and Communications Societies (INFOCOM), 2005.
[99] Wenrui Zhao and M. H. Ammar. Message ferrying: proactive routing in
highly-partitioned wireless ad hoc networks. Workshop on Future Trends of
Distributed Computing Systems (FTDCS), 2003.
161
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning and control for wireless networks via graph signal processing
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Quantum computation in wireless networks
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Empirical methods in control and optimization
PDF
Exploiting diversity with online learning in the Internet of things
PDF
Rate adaptation in networks of wireless sensors
PDF
Congestion control in multi-hop wireless networks
PDF
Robust routing and energy management in wireless sensor networks
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
Optimal resource allocation and cross-layer control in cognitive and cooperative wireless networks
PDF
Transport layer rate control protocols for wireless sensor networks: from theory to practice
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
Asset Metadata
Creator
Wang, Shangxing
(author)
Core Title
Learning, adaptation and control to enhance wireless network performance
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/02/2018
Defense Date
04/12/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep reinforcement learning,Internet of Things,OAI-PMH Harvest,online learning,robotics,wireless network
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
)
Creator Email
shangxiw@usc.edu,shx.wang10@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-508199
Unique identifier
UC11268322
Identifier
etd-WangShangx-6367.pdf (filename),usctheses-c40-508199 (legacy record id)
Legacy Identifier
etd-WangShangx-6367.pdf
Dmrecord
508199
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Shangxing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep reinforcement learning
Internet of Things
online learning
robotics
wireless network