Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
(USC Thesis Other)
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Acceleration of Deep Reinforcement Learning: Efficient Algorithms and Hardware Mapping
by
Chi Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2022 Chi Zhang
Abstract
Despite the recent success of Deep Reinforcement Learning (DRL) in game playing, robotics manipulation
and data center cooling, training DRL agents takes a tremendous amount of time and computation resources.
This is because it requires collecting a large amount of data by interacting with the environment and a
large amount of policy updates via Stochastic Gradient Descent (SGD) to converge.
To reduce the amount of data to collect, existing work adopts model-based DRL that learns a world
model using the data collected by interacting with the environment. Then, it uses the world model to
generate synthetic data to perform policy updates. State-of-the-art approaches generate synthetic data by
uniformly sampling initial states. This generates a large amount of similar data and makes each policy
update less efficient. To accelerate performing policy updates, state-of-the-art hardware mappings of DRL
propose efficient customized hardware designs on FPGA. However, most of the work is only applicable for a
specific range of input parameters. To further increase the speed to perform policy updates when the input
batch size of the neural network is large, existing works split the input batch into multiple sub-batches and
adopt multiple learners to process each sub-batch on a learner concurrently. However, the synchronization
overhead including data transfer and gradient averaging significantly impairs the scalability of existing
approaches.
In this work, we address these limitations by developing efficient algorithms and hardware mappings.
First, we propose Maximum Entropy Model Rollouts (MEMR) that generates diverse synthetic data by
prioritized sampling of the initial states such that the entropy of the generated synthetic data is maximized.
ii
We mathematically derived the maximum entropy sampling criteria assuming that the synthetic data
distribution is Gaussian. To accomplish this criteria, we utilize a Prioritized Replay Buffer. Second, we
propose a framework for mapping DRL algorithms with a Prioritized Replay Buffer onto heterogeneous
platforms consisting of a multi-core CPU, a GPU and a FPGA. We develop specific accelerators for each
primitive on CPU, FPGA and GPU. Given a DRL algorithm input parameters, our design space exploration
automatically chooses the optimal mapping of various primitives based on an analytical performance
model. Finally, we propose Scalable Policy Optimization (SPO) that improves the scalability of existing
multi-learner DRL by reducing the synchronization overhead via local Stochastic Gradient Descent. Our
experimental evaluations on widely used benchmark environments suggest i) MEMR reduces the number
of policy updates to converge compared with state-of-the-art model-based DRL; ii) our framework for
hardware mapping achieves superior policy updates per second compared with other mapping methods;
iii) SPO achieves nearly linear scalability as the number of learners increases.
iii
Dedication
To my parents for their sacrifices and support. To Yuxin Liu for her accompany during the hardest time of
Covid-19.
iv
Acknowledgements
First and foremost, I would like to thank my advisor Dr. Viktor Prasanna for offering me the chance to
pursue the PhD degree in Computer Science of the University of Southern California. His mentorship
greatly shaped the way that I find interesting research problems, come up with novel ideas and successfully
present solutions to become a better researcher. I would also like to thank Dr. Sanmukh Rao Kuppannagari
for his constructive discussions and supportive guideline when I started my PhD. Moreover, I would like to
thank my defense committee Dr. Aiichiro Nakano and Dr. Paul Bogdan, and my qualifying committee Dr.
Jyotirmoy Vinay Deshmukh, Dr. Bistra Dilkina, Dr. Ashutosh Nayyar, Dr. Paul Bogdan and Dr. Kannan
Rajgopal for their interests to review my work and for providing insightful feedback.
I would like to thank my colleague Yuan Meng for two co-authored papers. I would also like to thank my
colleagues and friends Hanqing Zeng, Ta-Yang Wang, Sasindu Wijeratne, Tian Ye, Bingyi Zhang, Pengmiao
Zhang, Hongkuan Zhou, Jason Lin for constructive discussions of research topics. I would like to thank
Hongkuan Zhou, Pengmiao Zhang and Bingyi Zhang for the help to set up server machines to conduct the
experiments.
v
TableofContents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation to Accelerate Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Success of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Computational Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Model-free Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Model-based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Prioritized Replay Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Key Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Heterogeneous Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Policy Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Sample Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.3 Learning Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.4 Computation Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.5 Scaling Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Challenges in Accelerating Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . 13
1.6.1 Low Learning Efficiency in MBRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.2 Dynamic Optimal Mapping of Primitives . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.3 Synchronization Overhead of Synchronous Parallel Stochastic Gradient Descent . 14
1.7 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Model-free Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Model-based Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Distributed Reinforcement Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Hardware Mapping of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 19
vi
Chapter 3: Maximum Entropy Model Rollouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Model-based Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Maximum Entropy Sampling Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 K-ary Sum Tree for Prioritized Replay Buffer . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 4: A Framework for Mapping DRL Algorithms with Prioritized Replay Buffer onto
Heterogeneous Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Target Deep Reinforcement Learning Algorithms . . . . . . . . . . . . . . . . . . . 35
4.1.2 Motivation for Heterogeneous Platform . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Accelerator Design for Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Prefix Sum Index Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Priority Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Overall System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Data-dependency Relaxed Training Loop . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Data Transfer System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Profiling the Performance of the Primitives . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Mapping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 Template Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.2 DRL Algorithm Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.3 Primitive Acceleration Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.4 System Mapping and Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 5: Scalable Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 The Effect of Batch Size on Learning Efficiency . . . . . . . . . . . . . . . . . . . . 63
5.1.2 Computation Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1.3 Existing ApeX with Multiple Learners . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.4 Issues of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Scalable Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.1 Local Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Overall Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Analytical Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.4 Applicability of local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.5 Comparisons with Other Parallel Stochastic Gradient Descent Methods . . . . . . 75
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vii
5.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 FPGA-accelerated Environment Emulation . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2 Automatic Decomposition of Primitives . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
viii
ListofTables
4.1 Processing Unit for all the DRL Components . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Overview of benchmark environments, DRL algorithms and neural network architectures 54
4.3 System DSE result of various configurations from system design space exploration . . . . 56
4.4 Design parameters and resource allocation from FPGA Architecture Exploration . . . . . . 60
5.1 The neural network architecture representing the Q network for Arcade Learning
Environment [5].A is the dimension of the action space. . . . . . . . . . . . . . . . . . . . 63
5.2 Notations Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
ix
ListofFigures
1.1 Real world applications of Deep Reinforcement Learning. . . . . . . . . . . . . . . . . . . . 1
1.2 Training time of various environments versus the size of the state space . . . . . . . . . . . 2
1.3 Two Reinforcement Learning paradigm. Left: Model-free Reinforcement Learning (MFRL).
Right: Model-based Reinforcement Learning (MBRL). . . . . . . . . . . . . . . . . . . . . . 5
1.4 Diagram of target heterogeneous platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 The overall structure of a 4-ary sum tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Illustration of the process when updating the value in theK-ary sum tree with fanout=4 as
shown in Algorithm 1. The blue node denotes the leaf node that holds the priority. The
green nodes denote the intermediate sums that are updated by propagating the change of
the priority from the leaf to the root. The red dotted arrow shows the direction of the value
propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Illustration of the process when sampling index according to the priority in theK-ary sum
tree with fanout=4 as shown in Algorithm 1. Starting from the root node, the green nodes
denote the cutoff node during traversal and the blue node denotes the leaf node sampled.
The red dotted arrow shows the direction of the tree traversal. . . . . . . . . . . . . . . . . 28
3.4 Left: Segmented replay buffer for model generated rollouts. Each segment contains data
sampled from the same environment state distribution. The importance weights are only
valid for data within the same segment and doesn’t apply across segment since they are
sampled from different state distribution. When performing updates, we uniformly sample
segment index and then uniformly sample within that segment. Right: model rollouts.
Circles represent states and arrows represent actions. The objective is to make the actions
expanded from the same state as far as possible so that the model rollouts are "diverse". . 31
3.5 Training curves of MEMR and two baselines. Solid curves depict the mean of five trials and
shaded regions correspond to standard deviation among trials. The first row depicts the
policy quality vs. the total number of environment interactions. The second row shows the
policy quality vs. the number of policy updates. The third row shows the policy quality vs.
number of model rollouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
x
3.6 The results of ablation study. ModelDatsetSize refers to the size ofD
model
shown in
Algorithm 2. PolicyUpdates refers to the number of policy updates per environment
step. It indicates how informative the model rollouts are to the SAC agent. Prioritization
strengthα is the exponent term used to calculate probability of states being sampled. The
less it is, the more uniform the distribution would be. . . . . . . . . . . . . . . . . . . . . . 33
4.1 Overview of existing parallel reinforcement learning frameworks [28]. . . . . . . . . . . . 36
4.2 Learner Module Architecture: An example pipeline for a 2-layer neural network. L
i
means
i
th
Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Illustration of computing the Prefix Sum Index. . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Replay Samplers and Updaters. τ (τ +K) is the index of the left(right)-most child node in
the current level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Illustration of updating theK-ary Sum Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Interactions among various Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 Design space exploration and automation workflow . . . . . . . . . . . . . . . . . . . . . . 49
4.8 Reward of RL agents trained in benchmark environments with various priority update
delay. The reward is normalized by the reward of an expert agent. . . . . . . . . . . . . . . 56
4.9 Execution time (in milliseconds) of a single neural network training step in benchmark
environments of various batch sizes on various hardware platforms. . . . . . . . . . . . . . 57
4.10 Execution time (in milliseconds) of a single Prefix Sum Index computation and priority
update of various batch sizes on various hardware platforms. The number after "cpu"
denotes the number of threads using in OpenMP [57]. . . . . . . . . . . . . . . . . . . . . . 57
4.11 The training throughput in gradient steps per second (GPS) of various batch sizes in various
mappings. The mapping "X-Y" denotes that the learner is mapped onto "X" and Replay
Management Module is mapped onto "Y". The line plot shows the theoretical optimal
performance given environment and batch size. . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 The convergence rate of various batch sizes in Deepmind Control Suite [76] benchmark.
Each run is repeated for 3 different random seeds and the curve shows the mean value and
the shaded area shows the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Execution time of sampling and priority update with various batch sizes. The number after
cpu indicates the number of threads used for parallel computing. . . . . . . . . . . . . . . . 64
5.3 The execution time of the forward propagation of a 4-layer fully-connected neural network
for ApeX-TD3 with 1024 hidden units and ReLU activation [2] with various batch sizes. . . 64
xi
5.4 The execution time of weight averaging of a 4-layer fully-connected neural network for
ApeX-TD3 with 1024 hidden units and ReLU activation [2] with various learners. . . . . . 64
5.5 The execution time of the forward propagation of a standard Convolutional Neural Network
for ApeX-DQN [54] shown in Table 5.1 with various batch sizes. . . . . . . . . . . . . . . . 64
5.6 The execution time of weight averaging of a standard Convolutional Neural Network for
ApeX-DQN [54] shown in Table 5.1 with various learners. . . . . . . . . . . . . . . . . . . 65
5.7 Top: the execution flow of existing multi-learner data parallel ApeX. Bottom: the execution
flow of multi-learner ApeX with local Stochastic Gradient Descent, as known as Scalable
Policy Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.8 Overall Diagram of Scalable Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9 Benchmark Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.10 Learning efficiency of SPO-DQN on Atari benchmarks [5] of various number of learners.
The number of actors is 16 and the weight averaging frequency is 10. Each data point in the
curve represents the average episode returns at test time of 20 episodes. Each curve shows
the result of a single random seed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.11 Learning efficiency of SPO-TD3 on Mujoco benchmarks [75] of various number of learners.
The number of actors is 16 and the weight averaging frequency is 10. Each data point in the
curve represents the average episode returns at test time of 30 episodes. Each curve shows
the average of 5 independent runs with different random seeds. The shaded area shows the
standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.12 Learning efficiency of SPO-DQN on Atari benchmarks of various weight averaging
frequencies. The number of actors is 16 and the number of learners is 4. Each data point in
the curve represents the average episode returns at test time of 20 episodes. Each curve
shows the result of a single random seed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.13 Learning efficiency of SPO-TD3 on Mujoco benchmarks of various weight averaging
frequencies. The number of actors is 16 and the number of learners is 4. Each data point in
the curve represents the average episode returns at test time of 30 episodes. Each curve
shows the average of 5 independent runs with different random seeds. The shaded area
shows the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.14 Scaling factor of SPO-TD3 of various number of learners and weight averaging frequencies. 81
5.15 Scaling factor of SPO-DQN of various number of learners and weight averaging frequencies. 82
xii
Chapter1
Introduction
1.1 MotivationtoAccelerateDeepReinforcementLearning
1.1.1 SuccessofDeepReinforcementLearning
Deep Reinforcement Learning (DRL) has shown great success in a wide range of applications including
board games [69], robotics manipulation [49] and energy systems [41] as shown in Figure 1.1.
1.1.2 ComputationalRequirements
Despite the success of Deep Reinforcement Learning, it takes a considerable amount of time to train a
Deep Reinforcement Learning agent to converge. We show the training time versus the size of the state
space of three popular environments used in DRL training in Figure 1.2. On Mujoco benchmarks [10],
which is a physics engine to simulate robotics, biomechanics, etc., it takes around 3 hours to train an agent
Figure 1.1: Real world applications of Deep Reinforcement Learning.
1
− 50 0 50 100 150 200 250 300 350 400 450 500
10
− 1
10
0
10
1
10
2
Mujoco[10]
Atari Games[10]
AlphaGo Zero[70]
Size of the State Space (in log)
Days
Figure 1.2: Training time of various environments versus the size of the state space
using Pytorch [58] on a 4-core machine with a GTX 1060 GPU. On Atari [10], which is a game simulator, it
takes around 12 hours to train on the same machine. The state-of-the-art DRL algorithm for playing Go
— AlphaGo Zero [70] was trained on 4 TPUs [33] for 21 days. Thus, acceleration of Deep Reinforcement
Learning is an important research direction.
1.2 IntroductiontoReinforcementLearning
Reinforcement Learning [74] aims to solve discrete-time sequential decision making problems modeled as
Markov Decision Process [59].
1.2.1 MarkovDecisionProcess
A Markov Decision Process (MDP) [59] is defined as a tuple , whereS is the set of states,
A is the set of actions,R(s,a,s
′
):S×A×S→ R defines the intermediate reward when the agent transits
from states tos
′
with actiona,P(s
′
|s,a) :S×A×S → [0,1] defines the probability when the agent
transits from states tos
′
with actiona,µ :S→[0,1] defines the starting state distribution. A stationary
policyπ :S→P(A) is a map from states to probability distribution over actions. We denoteπ (a|s) to be
2
the probability of taking actiona in states and all stationary policies byΠ . The objective of reinforcement
learning is to select policyπ ∈Π such that
J(π )= E
τ ∼ p(τ )
[
∞
X
t=0
γ t
R(s
t
,a
t
,s
t+1
)] (1.1)
is maximized, whereJ(π ) is the infinite horizon discounted total rewards, γ ∈[0,1) is the discount factor,
τ denotes a trajectory(s
0
,a
0
,s
1
,··· ), andp(τ )=p(s
0
)
Q
∞
t=0
π (a
t
|s
t
)P(s
t+1
|s
t
,a
t
), wherea
t
∼ π (·|s
t
)
ands
t+1
∼ P(·|s
t
,a
t
). The discount factorγ denotes how much we care about the future rewards (e.g.
γ =0 yields a greedy policy).
We define the on-policy state-action value as Q
π (s,a) = E
τ ∼ p(τ )
[
P
∞
t=0
γ t
R(τ )|s
0
= s,a
0
= a] and
the on-policy value function asV
π (s) = E
τ ∼ p(τ )
[
P
∞
t=0
γ t
R(τ )|s
0
= s] = E
a∼ π [Q
π (s,a)]. The advantage
function is defined as A
π (s,a)=Q
π (s,a)− V
π (s).
1.2.2 ReinforcementLearning
Reinforcement Learning (RL) [74] is a general learning paradigm to solve discrete-time sequential decision
making problems, which are often modeled as Markov Decision Process (MDP) [59]. At each time step, the
agent receives an observation from the environment, making a decision and actuate in the environment
to receive the next observation and an intermediate reward. The agent then repeats the process until it
reaches the terminal states. The objective of the agent is to maximize the expected discounted accumulated
rewards by selecting appropriate actions at each state.
Many real world applications can be studied under this model. State-of-the-art Deep Reinforcement
Learning algorithms have shown success in computer games [69, 77], where the states are the RAM state
and the actions are keyboard inputs. In robotics [75], the states are the current positions and velocities
of each joint, and the actions are the forces applied to each joints. In data center HVAC control [83], the
states are temperatures, system node statistics and outdoor air statistics. The actions are the system node
3
setpoints and the fan speed. Besides games and classic control systems, Reinforcement Learning can also
be applied to dialogue systems. The state is the current users’ sentences and the action is the response.
The agent receives positive reward if the goal is accomplished (e.g. the users purchased the items) and
negative reward otherwise. Each use case has different constraints and emphasis: in games, samples can
be obtained easily, but the inference speed has to meet the online requirements; in robotics, the sample
efficiency becomes the bottleneck because collecting data using robotics in real world is expensive; in
building HVAC control and dialogue systems, safety is the critical issue because such systems can’t afford
catastrophic failure caused by exploration with unknown outcomes.
In general, there are two Reinforcement Learning paradigms known as model-free Reinforcement
Learning (MFRL) and model-based Reinforcement Learning (MBRL). In MFRL, the agent directly interacts
with the environment to collect data and updates its policy using the collected data. In policy gradient, the
update step is to estimate the RL objective w.r.t the parameters of the policy using on-policy samples. In
approximate dynamic programming, the update step is to learn the Q network and the actor network using
off-policy samples. In MBRL, the agent learns a dynamics model using the collected data and generate
model rollouts (synthetic data) using the learned dynamics model to update the policy. It relies on the
generalization ability of the learned dynamics model to unseen state and action to improve the sample
efficiency.
1.2.3 DeepReinforcementLearning
Traditional Reinforcement Learning often refers to algorithms that mainly focus on tabular cases [74] with
small discrete state and action space. Deep Reinforcement Learning refers to Reinforcement Learning
algorithms that use function approximators such as neural networks to represent either policies or models.
Function approximators are unavoidable when the state space is large (e.g. image [10]) or continuous [75].
4
Agent
Dataset
Environment
Update
Actuate
Store
Environment
Dynamics
Agent
Dataset
Model Rollouts
Actuate
Store Learn
Update
Generate
Figure 1.3: Two Reinforcement Learning paradigm. Left: Model-free Reinforcement Learning (MFRL). Right:
Model-based Reinforcement Learning (MBRL).
Deeplearningisasubsetofmachinelearninginartificialintelligence(AI)thathasnetworkscapableoflearning
unsupervisedly from data that is unstructured or unlabeled withdeepneuralnetworks [26].
1.2.4 Model-freeReinforcementLearning
Model-free Reinforcement Learning (MFRL) solves MDP without trying to understand the dynamics of the
environment. It formalizes the idea of trial-and-error. Let the policy beπ θ , whereθ denotes the parameters
of the policy (e.g. a neural network). The core procedure is:
1. Collect trajectories{(s
i,1
,a
i,1
,s
i,2
,a
i,2
,··· ,s
i,T
,a
i,T
)}
N
i=1
using some policiesπ exp
and add them
to a datasetD.
2. Optimize the policyπ θ using the data inD, such that the probability of “good" actions are increased
and the probability of “bad" actions are decreased. Repeat until convergence.
5
1.2.4.1 PolicyGradient
Policy gradient algorithms directly optimize the RL objectiveJ(π ) defined in Equation 1.1 by taking the
gradient w.r.t the parameters of the policyθ . The simplest form is known as REINFORCE algorithm [74],
and the policy gradient is:
∇
θ J(θ )=E
τ ∼ π θ (τ )
[(
T
X
t=1
∇
θ logπ θ (a
t
|s
t
))(
T
X
t=1
r(s
t
,a
t
))] (1.2)
The expectation is estimated via the Monte-Carlo sampling [52] using trajectories collected in datasetD.
1.2.4.2 ApproximateDynamicProgramming
Learning value function is another common approach in Reinforcement Learning. We can rewrite the
definition of the Q function in recursive formula:
Q
π (s
t
,a
t
)=r
t
+E
a
t+1
∼ π [Q
π (s
t+1
,a
t+1
)] (1.3)
For standard Q-learning [74],π ∗ (·|s)=argmax
a
Q(s,a). And equation 1.3 becomes truly off-policy and
we can estimate the Q function via dynamic programming (DP) using samples collected from any policies
(e.g. old policies). Although it is guaranteed to converge to the optimal policy given reasonable assumption
of the exploration behavior, such proof only applies to the tabular form with finite state and action space.
It is particularly challenging if the Q function is approximated using a neural network because function
approximators are not stationary outside the training data. In practice, a variety of tricks are introduced to
stabilize the training and we review two major work in this category.
DeepQNetwork(DQN) Deep Q Network (DQN) [54] works with discrete action space. Directly applying
Q-learning procedure with neural network approximated Q-function fails because i) online update with one
6
sample is too noisy. ii) sequential updates using transitions on the same trajectory causes the Q network
value to explode due to the state dependency. To stabilize the update, DQN [54] proposes to use a Replay
Buffer [49] to store all historical transitions and update using uniformly sampled mini-batch data. To break
the trajectory dependency, target networks are introduced to serve as a consistent backup for Q updates:
Q
π θ (s
t
,a
t
)=r
t
+E
a
t+1
∼ π [Q
π θ ′(s
t+1
,a
t+1
)] (1.4)
whereQ
θ is the Q network andQ
θ ′ is the target network. The target network is synchronized to the q
network everyN steps, whereN is a hyper-parameter. The Q network is updated by minimizing the Mean
Square Error (MSE) between the current Q values and the target Q values:
min
θ E
(st,at,s
t+1
,rt)∼D
[(Q
π θ (s
t
,a
t
)− (r
t
+E
a
t+1
∼ π [Q
π θ ′(s
t+1
,a
t+1
)]))
2
] (1.5)
Furthermore, the temporal difference (TD) error is defined as:
TD Error=Q
π θ (s
t
,a
t
)− (r
t
+E
a
t+1
∼ π [Q
π θ ′(s
t+1
,a
t+1
)]) (1.6)
DeepDeterministicPolicyGradient(DDPG) Deep Deterministic Policy Gradient (DDPG) [48] is the
variant of DQN [54] that works with continuous action space. For discrete action space, the optimal action
can be inferred directly from the Q function by iterating over all the actions. However, this is impossible
for continuous action space. Instead, a separate actor network is utilized to approximate the maximizer of
the Q network:
max
θ E
s∼D
[Q
ψ (s,π θ (s))] (1.7)
7
whereµ ϕ is a deterministic function. It is updated to maximize the Q function using gradient accent:
ϕ ← ϕ +α ∇
ϕ Q
θ (s,µ ϕ (s)) (1.8)
whereα is the learning rate.
Twin-delayedDeepDeterministicPolicyGradients DDPG suffers from overestimation of Q values.
Twin-delayed Deep Deterministic Policy Gradients (TD3) [21] solves this issue by learning two separate
ensemble of Q networks Q
ψ 1
and Q
ψ 2
. When computing the target values, the minimum of Q values
computed byQ
ψ 1
andQ
ψ 2
are used.
1.2.5 Model-basedReinforcementLearning
Model-based Reinforcement Learning (MBRL) has strong connections with optimal control [7]. Classic
optimal control works with handcrafted forward dynamics models (e.g., differential equations) and solves it
with model-based strategies (e.g., model predictive control (MPC) [16] or trajectory optimization [8]) with
performance guarantee. Model-based Reinforcement Learning is a data-driven control approach, where
the forward dynamics models are learned using data. Then, we apply classic optimal control methods.
Sometimes we prefer learning a policy using model generated data because optimal control methods require
solving computationally expensive optimization problems online. We show a general diagram of MBRL in
Figure 1.3.
1.2.5.1 Learningdynamicsmodels
Dynamics modelf
θ is learned using a datasetD ={(s
t
,a
t
,s
t+1
)} collected by sampling from the envi-
ronment via Stochastic Gradient Descent. It is argued in [13] that a stochastic model is superior than a
deterministic model even in non-stochastic environments due to model uncertainties. Model uncertainties
8
include aleatoric uncertainty and epistemic uncertainty. Aleatoric uncertainty captures inherent variance
of the environment and the observed data. Epistemic uncertainty captures the model error due to lack of
data and overfitting. Aleatoric uncertainty is handled by setting the dynamics model as a probabilistic
function (e.g. Gaussian distribution that outputs the mean and the variance of the next state given the
current state and action as input: s
t+1
= f
θ (s
t
,a
t
) =N(µ θ (s
t
,a
t
),σ θ (s
t
,a
t
))) and fit using Maximum
Likelihood Estimation (MLE). Epistemic uncertainty is handled by training bootstrap ensembles.
1.2.5.2 Modelrolloutsgeneration
A model rollout is a trajectory generated by the learned dynamics model. Given learned dynamics model
ˆ
f,
current states
t
and policyπ , the objective of state propagation is to estimatep(s
t+1
),p(s
t+2
),...,p(s
T
)
by recursively sampling from
ˆ
f(·|s,a) andπ (·|s) using previously sampled states (e.g.a
t
∼ π (·|s
t
),ˆ s
t+1
∼ ˆ
f(·|s
t
,a
t
),a
t+1
∼ π (·| ˆ s
t+1
),ˆ s
t+2
∼ ˆ
f(·| ˆ s
t+1
,a
t+1
)). Due to model errors and inherent uncertainties, the
discrepancy between the ground truth states and the predicted states grows as the propagation horizon
increases. Various trajectory sampling approaches are proposed to mitigate this issue. We refer to [13] for
more details.
1.2.5.3 Learningpolicies
Modelpredictivecontrol(MPC) The policy of model predictive control (MPC) [16] is the solution to an
optimization problem at each step. Given learned dynamics model
ˆ
f, current states
t
and planning horizon
T , the objective of the MPC controller is to execute the first action a
t
returned from a sequence of actions
a
t
,a
t+1
,...,a
t+T− 1
that maximizes the expected rewards under model uncertainties:
a
t
= argmax
at,a
t+1
,...,a
t+T− 1
t+T− 1
X
τ =t
E
ˆ
f
[r(s
t
,a
t
)] (1.9)
9
The longer horizonT we foresee, the less greedy the controller is and the better performance we have.
However, it comes at the risk of higher prediction error using the learned dynamics and the cost of more
computation. For linear dynamics and quadratic reward function, optimization problem 1.9 can be solved
analytically using linear quadratic regulator [3]. For general non-linear functions, local approximations
are made [44]. In practice, numerical methods are often used as fast approximated solutions, including
shooting methods [61], cross entropy methods [9]. More advanced approaches such as model predictive
path integral (MPPI) [80] are left for the readers to explore.
Policyoptimization Due to the computation cost of the MPC, learning an explicit policy is often desired
in many practical scenarios. Unfortunately, directly applying either policy gradient or Q learning on model
generated rollouts often yields poor quality policy. This is because the objective the policy optimizes diverge
too much from the true objective. Despite this fact, model-based policy optimization [32] can match with
model-free counterparts in terms of performance if we generate model rollouts carefully. Many existing
approaches are developed and we refer to [34, 39, 19] for more details. Most MBRL algorithms that explicitly
learn a policy suffer from long training time as it requires to perform a large number of policy updates.
1.3 PrioritizedReplayBuffer
Prioritized Replay Buffer [63] is a data structure that supports efficient prioritized sampling and priority
update. It is used to implement Maximum Entropy Model Rollouts proposed in Chapter 3. We also discuss
its acceleration on FPGA and on GPU in Chapter 4. Prioritized Replay Buffer consists of a Data Storage
and a Replay Management Module (RMM). Data Storage is used to store the data and Replay Management
Module manages the priorityP
i
associated with thei-th data in the Data Storage.
10
1.3.1 KeyOperations
Prioritized Replay Buffer supports sampling data andpriorityupdate. Sampling from the Prioritized Re-
play Buffer decides which indices are sampled according to the probability distribution of the priority. During
sampling, a data pointx
i
is selected according to the probability distributionPr(i)=P(i)/
P
i
P(i),i∈
[0,N), whereP(i) is the priority of data pointi andN is the total number of data points in the Prioritized
Replay Buffer. According to the InverseTransformSampling procedure, we first sample x∼ U(0,1).
Then, we use the cumulative density function (cdf(i) =
P
i
j=1
Pr(j),i∈ [0,N)) to compute the sample
indexi = cdf
− 1
(x). This is equivalent to finding the minimum index i, such that the prefix sum of the
probability up toi is greater than or equal tox, the target prefix sum value:
min
i
i
X
j=1
P(j)≥ x· N
X
j=1
P(j) (1.10)
Such indexi is known asPrefixSumIndex . PriorityUpdate requires updating the current priorities
using newly computed priorities.
1.3.2 DataStructure
Existing Prioritized Replay Buffer uses Binary Sum Tree to store the priorities so that sampling and priority
update can be performed inO(log
2
N), whereN is the total number of data points in the Prioritized Replay
Buffer. In Chapter 3, we generalize it to K-ary Sum Tree that performs sampling inO(Klog
K
N) and
priority update inO(log
K
N).
1.4 HeterogeneousPlatform
We show a high level diagram of our target heterogeneous platform used to accelerate DRL algorithms
in Figure 1.4. It consists of a multi-core CPU, a GPU and a FPGA. Each core inside the CPU contains a
11
PCIe
CPU
Core
Core
Core
Core
DDR
L3
Cache
GPU
GDDR
SM SM SM
DDR
LUT
LUT
LUT
BRAM
BRAM
BRAM
DSP
DSP
DSP
FPGA
PCIe
Figure 1.4: Diagram of target heterogeneous platform
L1 and L2 cache and all the cores share the same L3 cache and the DDR memory. The GPU consists of
multiple streaming multiprocessors (SM) and a GDDR. Each SM consists of a shared memory and multiple
streaming processors. The FPGA consists of a number of re-configurable compute units (DSP), arithmetic
units (ALU or LUT), and large distributed on-chip SRAM.
1.5 PerformanceMetrics
In this section, we discuss the key performance metrics to evaluate a Deep Reinforcement Learning
algorithm.
1.5.1 PolicyQuality
The quality of the learned policy is evaluated by the accumulated rewards obtained by an agent as shown
in Equation 1.1.
1.5.2 SampleEfficiency
Sample efficiency is defined as the number of interactions with the environment when the accumulated
rewards obtained by an agent converges. Convergence is defined as the accumulated rewards obtained by
an agent surpass a preset threshold.
12
1.5.3 LearningEfficiency
Learning efficiency is defined as the number of policy updates when the accumulated rewards obtained by
an agent converges.
1.5.4 ComputationEfficiency
Computation efficiency is defined as the number of policy updates performed per second.
1.5.5 ScalingFactor
Scaling factor is defined as the speedup of the computation efficiency per unit computation resource
when using multiple computation resources compared with the computation efficiency using one unit
computation resource.
1.6 ChallengesinAcceleratingDeepReinforcementLearning
1.6.1 LowLearningEfficiencyinMBRL
In model-based Reinforcement Learning, the agent uses synthetic data generated by the model (model
rollouts) to update the policy. However, the current synthetic data generation method generates a large
amount of “similar” data. This causes low learning efficiency.
1.6.2 DynamicOptimalMappingofPrimitives
The computation primitives of a Deep Reinforcement Learning typically include environment emulation,
neural network inference and neural network training. If the algorithm utilizes a Prioritized Replay Buffer
[63], the computation primitives also include sampling and priority update from the Prioritized Replay
Buffer. The input parameters of a Deep Reinforcement Learning algorithms typically include the architecture
of the neural network, the batch size and the size of the Prioritized Replay Buffer. Performing light-weight
13
neural network inference and training is more efficient on CPU than on FPGA and on GPU and performing
computational intensive neural network inference and training is more efficient on GPU than on CPU and
on FPGA. Performing sampling and priority update from the Prioritized Replay Buffer is more efficient
on CPU than on FPGA when the batch size is small. Performing sampling and priority update from the
Prioritized Replay Buffer is more efficient on FPGA than on CPU when the batch size is large. The optimal
mapping of the primitives vary significantly when the input parameters change. Thus, it is impossible for a
fixed mapping to achieve optimal performance for all the input parameters.
1.6.3 SynchronizationOverheadofSynchronousParallelStochasticGradientDescent
The training time increases in linear as the batch size increases when training large scale neural networks
such as Convolutional Neural Networks. Existing synchronous parallel Stochastic Gradient Descent splits
the input data into multiple sub-batches. Each learner computes the gradient of a sub-batch concurrently.
The central learner performs gradient averaging and update the weights. However, the synchronization
overhead including gradient averaging significantly impairs the scaling factor.
1.7 ThesisStatement
Deep Reinforcement Learning can be effectively accelerated by efficient algorithms and hardware mappings.
We consider “effectively accelerated” as improving one or more performance metrics defined in Section 1.5
without negatively impacting the others.
1.8 ResearchContributions
In this paper, our research aims to tackle challenges in Section 1.6. Specifically, our contributions include:
14
1. To tackle the challenge in Section 1.6.1, we propose Maximum Entropy Model Rollouts (MEMR) that
generates model rollouts such that the entropy of the generated data is maximized. This encourages
the “diversity” of the generated data and significantly improves the learning efficiency.
2. To tackle the challenge in Section 1.6.2, we propose a framework for mapping Deep Reinforcement
Learning with Prioritized Replay Buffer onto heterogeneous platform consisting of a multi-core CPU,
a FPGA and a GPU.
3. To tackle the challenge in Section 1.6.3, we propose Scalable Policy Optimization (SPO) by utilizing
Local Stochastic Gradient Descent [71] that significantly reduces the synchronization overhead and
improves the scaling factor.
1.9 ThesisOutline
The rest of the dissertation is organized as follows: We discuss related work in Chapter 2. We then discuss
Maximum Entropy Model Rollouts in Chapter 3, our framework for mapping Deep Reinforcement Learning
with Prioritized Replay Buffer in Chapter 4 and Scalable Policy Optimization in Chapter 5. Finally, we
discuss conclusions and future works in Chapter 6.
15
Chapter2
RelatedWork
2.1 Model-freeDeepReinforcementLearning
Model-free Deep Reinforcement Learning (MFRL) learns the optimal policy by directly taking gradient of
the objective function or estimate the state-action value function and derive the optimal policy. REINFORCE
[81] is the first policy gradient method. However, the estimated policy gradient suffers from high variance
that causes instability during training. Natural Policy Gradient [35] solves this problem by constraining the
total variation divergence between the updated policy and the old policy. Trust Region Policy Optimization
[64] proposes to use line search to enforce the constraints of Natural Policy Gradient for policies represented
as neural networks. Proximal Policy Optimization (PPO) [65] proposes a first-order approximation of Trust
Region Policy Optimization. Deep Q Network (DQN) [54] is the first successfully value-based DRL algorithm.
Deep Deterministic Policy Gradient (DDPG) [48] is the continuous variant of DQN. Twin-delayed DDPG
(TD3) [21] proposes double Q learning to stabilize DDPG. Soft actor-critic (SAC) [24] proposes stochastic
policy for better exploration.
16
2.2 Model-basedDeepReinforcementLearning
Model-free Deep Reinforcement Learning approaches require large amounts of environment interactions,
which is not suitable for environments where the interaction is expensive. On the contrary, model-based
Deep Reinforcement Learning (MBRL) learns a dynamics model and directly performs model predictive
control (MPC) [55, 13, 15] or learns the policy using synthetic data generated by the learned model, as
known as model rollouts [39, 32, 11, 82].
Learning accurate dynamics model is often the bottleneck for MBRL to match the asymptotic per-
formance of MFRL counterparts. Although Gaussian process is shown effective in low-dimensional data
regime [15, 42], it is hard to generalize to high-dimensional environments like [75, 54]. [55] first utilizes
deterministic neural network dynamics model for model predictive control in robotics and [13] improves
the idea with probabilistic ensemble models that matches the asymptotic performance with MFRL baselines.
[78] combines policy networks with online planning and achieves even superior performance on challenging
benchmarks. Common MPC or planning methods include shooting method [61], cross-entropy method [9]
and model predictive path integral control [80]. Such planning methods would potentially over exploit the
learned dynamics on long horizon predictions that may impair the performance. It is also computational
expensive to perform in real time. In such cases, learning a policy, e.g. parameterized by a neural network,
is desired for better generalization.
Dyna-style MBRL utilizes learned dynamics to generate model rollouts to learn a good policy [73]. [19]
and [11] utilize the model to better estimate the value function in order to improve the sample efficiency.
[39] optimizes the policy network via policy gradient algorithm on the trajectories generated by the models.
[34] proposes video prediction network for model-based Atari games. [82] develops a theoretical framework
that provides monotonic improvement to a local maximum of the expected reward for MBRL. Model-based
policy optimization (MBPO) [32] achieves state-of-the-art sample efficiency and matches the asymptotic
performance of MFRL approaches. MBPO optimizes a policy with soft actor-critic (SAC) [24] under the
17
data distribution collected by unrolling the learned dynamics model using the current policy. Our proposed
Maximum Entropy Model Rollouts in Chapter 3 combines [32] and [73] by proposing an non-trivial sampling
approach to significantly reduce the number of policy updates and model rollouts that obtain asymptotic
performance.
2.3 DistributedReinforcementLearningFramework
GORILA [56] proposes a parallel architecture of DQN [54] to play Atari games [10]. They employ inde-
pendent actors and learners in parallel with a global parameter server. Each learner computes gradients
locally and asynchronously updates the global parameter. Although asynchronous gradient computation
improves the computation efficiency, the accumulated rewards achieved by the agent at convergence are
hindered due to “stale” gradient update. ApeX [28] tackles this problem by using a centralized learner to
perform synchronous Stochastic Gradient Descent by sampling data from a Prioritized Replay Buffer. Built
on top of ApeX, R2D2 [36] explores recurrent policies to further improve the the accumulated rewards
achieved by the agent at convergence. A3C [53] uses asynchronous actors to collect the data and update
the agent using actor critic algorithms without using a replay buffer. Due to the synchronization overhead,
A3C doesn’t scale to a large number of actors. IMPALA [18] removes the synchronization overhead of A3C
by using importance sampling to allow asynchronous updates. Seed RL [17] moves the neural network
inference onto GPUs to accelerate data collection in Apex [28] and IMPALA [18]. RLlib [46] proposes high
level abstractions for distributed reinforcement learning built on top of the Ray library [46]. [45] proposes
parallel reinforcement learning using MapReduce [14] framework with linear function approximation.
18
2.4 HardwareMappingofDeepReinforcementLearning
A few recent works have focused on hardware acceleration of DRL algorithms. A FPGA implementation of
Asynchronous Advantage Actor-Critic (A3C) algorithm is presented in [12]. In [67] and [68], a hardware
architecture is developed to accelerate Trust Region Policy Optimization (TRPO) [64]. In [23], a CPU-FPGA
architecture is proposed to accelerate Deep Deterministic Policy Gradient (DDPG) [48], which combines
Deep Q-Learning with policy optimization methods. [51] proposes an accelerator for PPO, which utilizes
separate modules for actor-critic networks. which is the first-order approximation of TRPO [64]. Most work
focuses on a specific range of inputs while our framework proposed in Chapter 4 can generate optimal
mapping for any inputs. Also, most works only consider mapping onto a single FPGA or GPU without
focusing on scalability. In Chapter 5, we propose Scalable Policy Optimization that achieves nearly linear
scalability when mapping onto multiple GPUs.
19
Chapter3
MaximumEntropyModelRollouts
3.1 Overview
Model-based Reinforcement Learning (MBRL) [32, 11, 82, 13] shows competitive performance compared
with best model-free Reinforcement Learning (MFRL) algorithms [65, 64, 54, 24, 25] with significantly
fewer environment samples on challenging robotics locomotion benchmarks [75]. A MFRL algorithm
learns complex skills by maximizing a scalar reward designed by human engineering. However, to obtain
promising performance a large number of environment interactions are needed and it may take a long time
in real-world applications. In such cases, MBRL is appealing due to its superior sample efficiency that relies
on the generalization of a learned predictive dynamics model. However, the quality of the policy trained on
imagined trajectories is often worse asymptotically than the best MFRL counterparts due to the imperfect
models.
Recently, [32] proposed model-based policy optimization (MBPO), including a theoretical framework that
encourages short-horizon model usage based on an optimistic assumption of a bounded model generalization
error given policy shift. Although empirical studies in [32] support this assumption, this property is hard
to guarantee in the whole state distribution. Moreover, uniform sampling of the environment states to
generate branched model rollouts degrades the diversity of the model dataset, especially when the policy
shift is small. The lack of diversity of the model dataset makes the policy updates inefficient.
20
Our main contribution is a practical algorithm, which we called Maximum Entropy Model Rollouts
(MEMR) based on the aforementioned insights. The differences between MEMR and MBPO are: 1) MEMR
follows Dyna [73] that only generates single-step model rollouts while MBPO encourages generating
short-horizon model rollouts. The generalization ability of MEMR is strictly guaranteed by supervised
machine learning theory and it can be empirically estimated by validation errors [66]. 2) MEMR utilizes
a Prioritized Replay Buffer [63] to generate max-diversity model rollouts for efficient policy updates. We
validate this idea on challenging locomotion benchmarks [75] and the experimental results show that
MEMR matches the asymptotic performance and sample efficiency of MBPO [32] and significantly reduces
the number of policy updates to converge.
3.1.1 Model-basedPolicyOptimization
Model-based policy optimization (MBPO) [32] achieves state-of-the-art sample efficiency and matches the
asymptotic performance of MFRL approaches. MBPO optimizes a policy with soft actor-critic (SAC) [24]
under the data distribution collected by unrolling the learned dynamics model using the current policy.
However, the sample efficiency comes at the cost of 2.5x to 5x increased number of policy updates compared
with SAC [24] and a large number of model rollouts. The increased policy updates significantly decrease
the training speed. To mitigate this issue, we analyze the model usage and model rollout distribution and
propose insights on how to improve MBPO to obtain better learning efficiency.
Modelusage. In MBPO, learned dynamics model is used to generate branched model rollouts with short
horizons [32]. Although [32] presented theoretical analysis to bound the policy performance trained using
model generated rollouts, the over exploitation of model generalization can’t be eliminated. In this work,
oneofourcoreideasisthatweonlyrelyonlearnedmodeltogenerateone-steprollouts. Weinterpretitasmodel-
based exploration. The nice property of this model usage is the naturally bounded model generalization
error and it can be estimated in practice by the validation dataset [66].
21
Modelrolloutdistribution. Uniformly sampling of real states
∗
to generate model rollouts is adopted in
MBPO [32]. This potentially generates large amount of similar data when the policy and the learned model
changes slowly as training progresses. As a result, the efficiency of the policy updates is deteriorated. In
this work, we propose to sample real states to generate single-step model rollouts such that the joint entropy
of the state and action of the model dataset is maximized. The intuition is to increase the "diversity" of the
model dataset. The diverse model dataset leads to efficient policy updates.
3.2 Method
In this section, we unveil the technical details of our Maximum Entropy Model Rollouts (MEMR) for
model based policy optimization. First, we propose the Maximum Entropy Sampling Theorem to help
understand the choice of our prioritization criteria. Based on the theoretical analysis, we propose a practical
implementation of this idea and discuss the challenges posed by runtime complexity along with their
solutions.
3.2.1 MaximumEntropySamplingCriteria
We begin by considering the following problem definition:
Problem3.2.1 (Maximum Entropy Sampling). LetD
env
={s
i
}
Nenv
i=1
be the collection of all the states in the
environment dataset. LetD
model
={(s,a)}
N
model
j=1
be the collection of all the state-action pairs in the model
dataset
†
. LetD
sample
={(s
i
,a
i
)}
Nenv
i=1
be the states and actions sampled fromD
env
anda
i
∼ π ϕ (·|s
i
). Assume
weparameterizethepolicydistributionderivedfromthemodeldatasetasaGaussiandistributionwithdiagonal
covariance: π ψ (a
i
|s
i
) =N(µ ψ (s
i
),Σ ψ (s
i
)). Let the joint entropy of the state-action in the model dataset
beH(S,A). Now we select(s
k
,a
k
) fromD
sample
and add it to theD
model
. Let the joint entropy of the new
∗
States encountered in real environment as opposed to imagined states that are generated by the model.
†
The tasks considered in this work are deterministic so we omit s
′
for simplicity.
22
D
′
model
=D
model
∪{(s
k
,a
k
)} beH(S
′
,A
′
), the optimal sampling criteria problem is to choose indexk such
thatH(S
′
,A
′
)− H(S,A) is maximized.
Theorem3.2.1 (Maximum Entropy Sampling Theorem). AssumeN
model
≫ 1suchthatthestatedistribution
ofD
model
andD
′
model
are identical, then
k =argmin
i
log(
√
2ππ ψ (a
i
|s
i
)σ (π ψ (·|s
i
))) (3.1)
whereπ ψ (a
i
|s
i
) is the probability of model data policy at(s
i
,a
i
),σ (π ψ (·|s
i
)) is the standard deviation of the
conditional distribution ats
i
.
To prove theorem 3.2.1, we begin with a useful lemma as follows:
Lemma3.2.2 (Entropy Gain of Gaussian distribution). Suppose random variableX ∼N (µ,σ 2
), whereµ andσ are unknown. Now suppose we have observationsx
1
,x
2
,··· ,x
N
,N ≫ 1 and obtain an estimation of
thedistributiondenotedasN
1
(µ 1
,σ 2
1
). Ifwehaveonemoreobservationt(variable)andobtainanewestimation
of the distribution usingx
1
,x
2
,··· ,x
N
,t, which is denoted asN
2
(µ 2
,σ 2
2
). Let the density ofN
1
bef
1
(x).
Let the differential entropy of N
1
,N
2
beh
1
andh
2
. Letg(t)=h
2
− h
1
Then,g(t)=− log(
√
2πf
1
(t)σ 1
)/N.
23
Proof. According to the maximum likelihood estimation of Gaussian distribution, we obtainµ 1
=(x
1
+
x
2
+··· +x
N
)/N, σ 2
1
=
P
N
i=1
(x
i
− µ 1
)
2
/N, µ 2
(x) = (x
1
+x
2
+··· +x
N
+t)/(N +1), σ 2
2
(t) =
(
P
N
i=1
(x
i
− µ 2
)
2
+(t− µ 2
)
2
)/(N +1). Then,
g(t)=
1
2
log(
σ 2
(t)
σ 1
)
2
=
1
2
log
P
N
n=1
(x
i
− µ 2
)
2
+(t− µ 2
)
2
P
N
n=1
(x
i
− µ 1
)
2
· N
N +1
=
1
2
log(
P
N
n=1
(x
i
− µ 1
+µ 1
− µ 2
)
2
+(t− µ 2
+µ 1
− µ 1
)
2
P
N
n=1
(x
i
− µ 1
)
2
· N
N +1
)
=
1
2
log((
P
N
n=1
(x
i
− µ 1
)
2
+
P
N
n=1
(µ 1
− µ 2
)
2
+2
P
N
n=1
(x
i
− µ 1
)(µ 1
− µ 2
)
P
N
n=1
(x
i
− µ 1
)
2
+
(t− µ 1
)
2
+(µ 1
− µ 2
)
2
+2(t− µ 1
)(µ 1
− µ 2
)
P
N
n=1
(x
i
− µ 1
)
2
)· N
N +1
)
=
1
2
log(
N
N +1
· Nσ 2
1
+(N +1)(µ 1
− µ 2
)
2
+(t− µ 1
)
2
+2(t− µ 1
)(µ 1
− µ 2
)
Nσ 2
1
)
=
1
2
log(
N
N +1
(1+
(t− µ 1
)
2
N(N +1)σ 2
1
+
(t− µ 1
)
2
Nσ 2
1
+
2(t− µ 1
)(µ 1
− t)
N(N +1)σ 2
1
))
=
1
2
log(
N
N +1
(1+
(t− µ 1
)
2
(N +1)σ 2
1
))
≈ 1
2
log(1+
(t− µ 1
)
2
Nσ 2
1
)≈ (t− µ 1
)
2
2Nσ 2
1
=− log(
√
2πf
1
(t)σ 1
)
N
(3.2)
Then, we prove theorem 3.2.1:
Proof. We begin by expanding the state and action joint entropy of the model dataset
H(S
′
,A
′
)− H(S,A)=− Z
s
′
p(s
′
)logp(s
′
)ds
′
+
Z
s
′
p(s
′
)H(A
′
|S =s
′
)ds
′
+
Z
s
p(s)logp(s)ds− Z
s
p(s)H(A|S =s)ds
≈ p(s
i
)(H(A
′
|s
i
)− H(A|s
i
)) (SinceN
model
≫ 1, p(s
′
)≈ p(s)) (3.3)
24
According to Lemma 3.2.2, we obtain
H(S
′
,A
′
)− H(S,A)=p(s
i
)(− log(
√
2ππ ψ (a
i
|s
i
)σ (π ψ (·|s
i
))
C· p(s
i
)
)=− log(
√
2ππ ψ (a
i
|s
i
)σ (π ψ (·|s
i
))
C
(3.4)
Note thatN in Lemma 3.2.2 denotes the number of states
i
inD
model
, which is equal top(s
i
)· C, whereC
is the model dataset size. This is a rough density estimation and more accurate methods are left for future
work. Thus,
k =argmax
i
(H(S
′
,A
′
)− H(S,A))=argmin
i
log(
√
2ππ ψ (a
i
|s
i
)σ (π ψ (·|s
i
))) (3.5)
3.2.2 K-arySumTreeforPrioritizedReplayBuffer
Before discussing the practical implementation, we first propose K-ary Sum Tree for efficient sampling
and priority update of Prioritized Replay Buffer. We show an example of a K-ary sum tree forK =4 in
Figure 3.1. Each node hasK child nodes. The value stored in the parent node is the sum of all the values
stored in the child nodes. The leaf nodes hold the actual priorities.
3.2.2.1 Priorityretrieval
In order to obtain the priority for the indexi, we create an array of pointers, each pointing to its corre-
sponding leaf node that holds the priority value. Thus, priority retrieval usingK-ary sum tree requires
O(1) time.
25
……
Intermediate levels Intermediate levels
fanout = 4
Last level
P(1) P(2) P(3) P(4) P(N-3) P(N-2) P(N-1) P(N)
P(1) + P(2) + P(3) + P(4)
P(1) + … + P(N)
Root
Figure 3.1: The overall structure of a 4-ary sum tree.
……
Node with updated value
Priority change
propagation
Intermediate levels
fanout = 4
Last level
Δ=new_val-old_val
Figure 3.2: Illustration of the process when updating the value in theK-ary sum tree with fanout=4 as
shown in Algorithm 1. The blue node denotes the leaf node that holds the priority. The green nodes denote
the intermediate sums that are updated by propagating the change of the priority from the leaf to the root.
The red dotted arrow shows the direction of the value propagation.
26
Algorithm1 Key operations of the N-ary sum tree.
1: function updateValue(idx, value)
2: node_idx = convertToNodeIdx(idx);
3: ∆ = value - getValue(node_idx);
4: while !isRoot(node_idx)do
5: new_value = getValue(node_idx) +∆ ;
6: SetValue(node_idx, new_value);
7: node_idx = getParent(node_idx);
8: endwhile
9: endfunction
10: function getPrefixSumIdx(prefixSum)
11: node_idx = getRoot();
12: while !isLeaf(node_idx)do
13: node_idx = getLeftChild(node_idx);
14: partialSum = 0;
15: fori = 0;i< fan_out;i++do
16: sum = partialSum + getValue(node_idx);
17: if sum≥ prefixSum then
18: break;
19: endif
20: partialSum = sum;
21: node_idx = getNextSibling(node_idx);
22: endfor
23: prefixSum = prefixSum - partialSum;
24: endwhile
25: idx = convertToDataIdx(node_idx);
26: return idx;
27: endfunction
3.2.2.2 Priorityupdate
To update the priority of indexi, we first obtain the leaf node holding the priority. We compute the change of
the priority by subtracting the old value from the new value. Then, we propagate the change of the priority
from the leaf node to the root node by traversing along the parent nodes. We show a detailed function in
Algorithm 1 and an example in Figure 3.2. It is easy to see that this operation requiresO(log
K
N) time.
3.2.2.3 PrefixSumIndexcomputation
Given a randomly sampled numberx∼ U(0,1), the objective is to compute indexi=min
i
P
i
j=1
P(i)≥ x· P
N
j=1
P(j) as discussed in Section 1.3. The sum of all the priorities in the Replay Buffer
P
N
j=1
P(j) can
27
fanout = 4
Root
level 1
Sampled leaf node
……
…… ……
level 2
level 3
level 4
level 5
level 6
……
……
AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgQcKukegx6MVjBPOAZAmzk04yZHZ2mZkVwpKP8OJBEa9+jzf/xkmyB00saCiquunuCmLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpKFMMGi0Sk2gHVKLjEhuFGYDtWSMNAYCsY38381hMqzSP5aCYx+iEdSj7gjBorteq9tHpRmfaKJbfszkFWiZeREmSo94pf3X7EkhClYYJq3fHc2PgpVYYzgdNCN9EYUzamQ+xYKmmI2k/n507JmVX6ZBApW9KQufp7IqWh1pMwsJ0hNSO97M3E/7xOYgY3fsplnBiUbLFokAhiIjL7nfS5QmbExBLKFLe3EjaiijJjEyrYELzll1dJ87LsVcuVh6tS7TaLIw8ncArn4ME11OAe6tAABmN4hld4c2LnxXl3PhatOSebOYY/cD5/AHprjwQ=
P
6,3
AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4kJJUUY9FLx4r2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U391hPXRsTqEccJ9yM6UCIUjKKVWvVeVj2vTnqlsltxZyDLxMtJGXLUe6Wvbj9macQVMkmN6Xhugn5GNQom+aTYTQ1PKBvRAe9YqmjEjZ/Nzp2QU6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjjZ0IlKXLF5ovCVBKMyfR30heaM5RjSyjTwt5K2JBqytAmVLQheIsvL5NmteJdVS4eLsu12zyOAhzDCZyBB9dQg3uoQwMYjOAZXuHNSZwX5935mLeuOPnMEfyB8/kDcsqO/w==
P
2,2
AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgQcJuIuox6MVjBPOAZAmzk0kyZHZ2mekVwpKP8OJBEa9+jzf/xkmyB00saCiquunuCmIpDLrut5NbW9/Y3MpvF3Z29/YPiodHTRMlmvEGi2Sk2wE1XArFGyhQ8nasOQ0DyVvB+G7mt564NiJSjziJuR/SoRIDwShaqVXvpZWL6rRXLLlldw6ySryMlCBDvVf86vYjloRcIZPUmI7nxuinVKNgkk8L3cTwmLIxHfKOpYqG3Pjp/NwpObNKnwwibUshmau/J1IaGjMJA9sZUhyZZW8m/ud1Ehzc+KlQcYJcscWiQSIJRmT2O+kLzRnKiSWUaWFvJWxENWVoEyrYELzll1dJs1L2rsrVh8tS7TaLIw8ncArn4ME11OAe6tAABmN4hld4c2LnxXl3PhatOSebOYY/cD5/AHRPjwA=
P
2,3
AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgQcJuDOox6MVjBPOAZAmzk0kyZHZ2mekVwpKP8OJBEa9+jzf/xkmyB00saCiquunuCmIpDLrut5NbW9/Y3MpvF3Z29/YPiodHTRMlmvEGi2Sk2wE1XArFGyhQ8nasOQ0DyVvB+G7mt564NiJSjziJuR/SoRIDwShaqVXvpZWL6rRXLLlldw6ySryMlCBDvVf86vYjloRcIZPUmI7nxuinVKNgkk8L3cTwmLIxHfKOpYqG3Pjp/NwpObNKnwwibUshmau/J1IaGjMJA9sZUhyZZW8m/ud1Ehzc+KlQcYJcscWiQSIJRmT2O+kLzRnKiSWUaWFvJWxENWVoEyrYELzll1dJs1L2rsqXD9VS7TaLIw8ncArn4ME11OAe6tAABmN4hld4c2LnxXl3PhatOSebOYY/cD5/AHXUjwE=
P
2,4
AAAB73icbVDLSgNBEOz1GeMr6tHLYBA8SNg1vo5BLx4jmAckIcxOZpMhs7PrTK8QlvyEFw+KePV3vPk3TpI9aGJBQ1HVTXeXH0th0HW/naXlldW19dxGfnNre2e3sLdfN1GiGa+xSEa66VPDpVC8hgIlb8aa09CXvOEPbyd+44lrIyL1gKOYd0LaVyIQjKKVmtVuWj71LsbdQtEtuVOQReJlpAgZqt3CV7sXsSTkCpmkxrQ8N8ZOSjUKJvk4304Mjykb0j5vWapoyE0nnd47JsdW6ZEg0rYUkqn6eyKloTGj0LedIcWBmfcm4n9eK8HgupMKFSfIFZstChJJMCKT50lPaM5QjiyhTAt7K2EDqilDG1HehuDNv7xI6mcl77JUvj8vVm6yOHJwCEdwAh5cQQXuoAo1YCDhGV7hzXl0Xpx352PWuuRkMwfwB87nD+n/jz4=
P
3,15
AAAB73icbVDLSgNBEOz1GeMr6tHLYBA8SNg1Ej0GvXiMYB6QLGF2MkmGzM6uM71CWPITXjwo4tXf8ebfOEn2oIkFDUVVN91dQSyFQdf9dlZW19Y3NnNb+e2d3b39wsFhw0SJZrzOIhnpVkANl0LxOgqUvBVrTsNA8mYwup36zSeujYjUA45j7od0oERfMIpWatW6afncq0y6haJbcmcgy8TLSBEy1LqFr04vYknIFTJJjWl7box+SjUKJvkk30kMjykb0QFvW6poyI2fzu6dkFOr9Eg/0rYUkpn6eyKloTHjMLCdIcWhWfSm4n9eO8H+tZ8KFSfIFZsv6ieSYESmz5Oe0JyhHFtCmRb2VsKGVFOGNqK8DcFbfHmZNC5KXqVUvr8sVm+yOHJwDCdwBh5cQRXuoAZ1YCDhGV7hzXl0Xpx352PeuuJkM0fwB87nD+uEjz8=
P
3,16
AAAB73icbVDLSgNBEOz1GeMr6tHLYBA8SNjVED0GvXiMYB6QLGF2MkmGzM6uM71CWPITXjwo4tXf8ebfOEn2oIkFDUVVN91dQSyFQdf9dlZW19Y3NnNb+e2d3b39wsFhw0SJZrzOIhnpVkANl0LxOgqUvBVrTsNA8mYwup36zSeujYjUA45j7od0oERfMIpWatW6afm8Up50C0W35M5AlomXkSJkqHULX51exJKQK2SSGtP23Bj9lGoUTPJJvpMYHlM2ogPetlTRkBs/nd07IadW6ZF+pG0pJDP190RKQ2PGYWA7Q4pDs+hNxf+8doL9az8VKk6QKzZf1E8kwYhMnyc9oTlDObaEMi3srYQNqaYMbUR5G4K3+PIyaVyUvErp8r5crN5kceTgGE7gDDy4gircQQ3qwEDCM7zCm/PovDjvzse8dcXJZo7gD5zPH/Ggj0M=
P
4,64
AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBg5TdatVj0YvHCvZD2qVk02wbmmSXJCuUpb/CiwdFvPpzvPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmjpKFKENEvFItQOsKWeSNgwznLZjRbEIOG0Fo9up33qiSrNIPphxTH2BB5KFjGBjpcd6L62eVaqVSa9YcsvuDGiZeBkpQYZ6r/jV7UckEVQawrHWHc+NjZ9iZRjhdFLoJprGmIzwgHYslVhQ7aezgyfoxCp9FEbKljRopv6eSLHQeiwC2ymwGepFbyr+53USE177KZNxYqgk80VhwpGJ0PR71GeKEsPHlmCimL0VkSFWmBibUcGG4C2+vEyalbJ3WT6/vyjVbrI48nAEx3AKHlxBDe6gDg0gIOAZXuHNUc6L8+58zFtzTjZzCH/gfP4AYYKPfQ==
P
5,252
AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBg5TdatVj0YvHCvZD2qVk02wbmmSXJCuUpb/CiwdFvPpzvPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3srq2vpHfLGxt7+zuFfcPmjpKFKENEvFItQOsKWeSNgwznLZjRbEIOG0Fo9up33qiSrNIPphxTH2BB5KFjGBjpcd6L62eVarepFcsuWV3BrRMvIyUIEO9V/zq9iOSCCoN4VjrjufGxk+xMoxwOil0E01jTEZ4QDuWSiyo9tPZwRN0YpU+CiNlSxo0U39PpFhoPRaB7RTYDPWiNxX/8zqJCa/9lMk4MVSS+aIw4chEaPo96jNFieFjSzBRzN6KyBArTIzNqGBD8BZfXibNStm7LJ/fX5RqN1kceTiCYzgFD66gBndQhwYQEPAMr/DmKOfFeXc+5q05J5s5hD9wPn8AX/2PfA==
P
5,251
AAAB8nicbVBNSwMxEM3Wr1q/qh69BIvgQcquSu2x6MVjBfsB26Vk02wbmk2WZFYoS3+GFw+KePXXePPfmLZ70NYHA4/3ZpiZFyaCG3Ddb6ewtr6xuVXcLu3s7u0flA+P2kalmrIWVULpbkgME1yyFnAQrJtoRuJQsE44vpv5nSemDVfyESYJC2IylDzilICV/GY/q11gz3Xr03654lbdOfAq8XJSQTma/fJXb6BoGjMJVBBjfM9NIMiIBk4Fm5Z6qWEJoWMyZL6lksTMBNn85Ck+s8oAR0rbkoDn6u+JjMTGTOLQdsYERmbZm4n/eX4KUT3IuExSYJIuFkWpwKDw7H884JpREBNLCNXc3orpiGhCwaZUsiF4yy+vkvZl1atVrx6uK43bPI4iOkGn6Bx56AY10D1qohaiSKFn9IreHHBenHfnY9FacPKZY/QHzucPKraP4g==
P
6,1008
AAAB8nicbVBNSwMxEJ2tX7V+VT16CRbBg5RdldZj0YvHCvYDtkvJptk2NJssSVYoS3+GFw+KePXXePPfmLZ70NYHA4/3ZpiZFyacaeO6305hbX1jc6u4XdrZ3ds/KB8etbVMFaEtIrlU3RBrypmgLcMMp91EURyHnHbC8d3M7zxRpZkUj2aS0CDGQ8EiRrCxkt/sZ7UL5LlufdovV9yqOwdaJV5OKpCj2S9/9QaSpDEVhnCste+5iQkyrAwjnE5LvVTTBJMxHlLfUoFjqoNsfvIUnVllgCKpbAmD5urviQzHWk/i0HbG2Iz0sjcT//P81EQ3QcZEkhoqyGJRlHJkJJr9jwZMUWL4xBJMFLO3IjLCChNjUyrZELzll1dJ+7Lq1apXD9eVxm0eRxFO4BTOwYM6NOAemtACAhKe4RXeHOO8OO/Ox6K14OQzx/AHzucPKTGP4Q==
P
6,1007
AAAB8nicbVDLSgNBEOyNrxhfUY9eBoPgQcKuj+gx6MVjBPOAzRJmJ7PJkNmZZWZWCEs+w4sHRbz6Nd78GyfJHjSxoKGo6qa7K0w408Z1v53Cyura+kZxs7S1vbO7V94/aGmZKkKbRHKpOiHWlDNBm4YZTjuJojgOOW2Ho7up336iSjMpHs04oUGMB4JFjGBjJb/Ry2pnyHPdq0mvXHGr7gxomXg5qUCORq/81e1LksZUGMKx1r7nJibIsDKMcDopdVNNE0xGeEB9SwWOqQ6y2ckTdGKVPoqksiUMmqm/JzIcaz2OQ9sZYzPUi95U/M/zUxPdBBkTSWqoIPNFUcqRkWj6P+ozRYnhY0swUczeisgQK0yMTalkQ/AWX14mrfOqV6tePFxW6rd5HEU4gmM4BQ+uoQ730IAmEJDwDK/w5hjnxXl3PuatBSefOYQ/cD5/ACYnj98=
P
6,1005
AAAB8nicbVBNSwMxEJ2tX7V+VT16CRbBg5RdLdVj0YvHCvYDtkvJptk2NJssSVYoS3+GFw+KePXXePPfmLZ70NYHA4/3ZpiZFyacaeO6305hbX1jc6u4XdrZ3ds/KB8etbVMFaEtIrlU3RBrypmgLcMMp91EURyHnHbC8d3M7zxRpZkUj2aS0CDGQ8EiRrCxkt/sZ/UL5LlubdovV9yqOwdaJV5OKpCj2S9/9QaSpDEVhnCste+5iQkyrAwjnE5LvVTTBJMxHlLfUoFjqoNsfvIUnVllgCKpbAmD5urviQzHWk/i0HbG2Iz0sjcT//P81EQ3QcZEkhoqyGJRlHJkJJr9jwZMUWL4xBJMFLO3IjLCChNjUyrZELzll1dJ+7Lq1atXD7VK4zaPowgncArn4ME1NOAemtACAhKe4RXeHOO8OO/Ox6K14OQzx/AHzucPJKKP3g==
P
6,1004
AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBg5RdW6rHohePFeyHtEvJptk2NMkuSVYoS3+FFw+KePXnePPfmLZ70NYHA4/3ZpiZF8ScaeO6305ubX1jcyu/XdjZ3ds/KB4etXSUKEKbJOKR6gRYU84kbRpmOO3EimIRcNoOxrczv/1ElWaRfDCTmPoCDyULGcHGSo+Nflq9QLXKtF8suWV3DrRKvIyUIEOjX/zqDSKSCCoN4VjrrufGxk+xMoxwOi30Ek1jTMZ4SLuWSiyo9tP5wVN0ZpUBCiNlSxo0V39PpFhoPRGB7RTYjPSyNxP/87qJCa/9lMk4MVSSxaIw4chEaPY9GjBFieETSzBRzN6KyAgrTIzNqGBD8JZfXiWty7JXK1fuq6X6TRZHHk7gFM7Bgyuowx00oAkEBDzDK7w5ynlx3p2PRWvOyWaO4Q+czx9Hho9s
P
4,63
Figure 3.3: Illustration of the process when sampling index according to the priority in theK-ary sum tree
with fanout=4 as shown in Algorithm 1. Starting from the root node, the green nodes denote the cutoff
node during traversal and the blue node denotes the leaf node sampled. The red dotted arrow shows the
direction of the tree traversal.
be computed inO(1) by simply retrieving the value stored in the root node. To design an algorithm that
obtains the target index, we start by proving Lemma 3.2.3 and Theorem 3.2.4:
Lemma3.2.3. Let the value of thei-th node at levelm beP
i,m
. Assume the height of the tree isH. Then,
at level1≤ m≤ H, there exists indexj,1≤ j ≤ K
m− 1
, such that
P
j
i=1
P
m,i
≥ x· P
N
i=1
P(i), for any
x∈(0,1).
Proof. According to the definition, the leaf node holds the priority value. Thus, P
i,H
= P(i),∀i =
1,2,...,K
H− 1
. Since x ∈ (0,1), we obtain x· P
N
i=1
P(i) ≤ P
N
i=1
P(i) ≤ P
K
H− 1
i=1
P
i,H
. Because the
priority values are non-negative, there must exist index j, 1 ≤ j ≤ K
H− 1
, such that
P
j
i=1
P
H,i
≥ x· P
N
i=1
P(i). According to the property of the sum tree, the value of the parent is the sum of all its
children. Thus,
P
K
H− 1
i=1
P
H,i
=
P
K
m− 1
i=1
P
m,i
,∀m=1,2,··· ,H− 1. Therefore, the same argument holds
for each level. This concludes the proof for Lemma 3.2.3.
Theorem 3.2.4. Let j
m
= min
j
′
P
j
′
i=1
P
m,i
≥ x· P
N
i=1
P(i). Then, j
m
is the parent node of j
m+1
,
∀m=1,2,··· H− 1.
28
Proof. The child nodes of index j at level m are K· (j− 1)+1,··· ,K· j at level m+1. According
to the definition of the sum tree and the property of j
m
, we obtain
P
jm− 1
i=1
P
m,i
=
P
K·(jm− 1)
i=1
P
m+1,i
<
x· P
N
i=1
P(i). Thus, the index of the cutoff node at level m+1 must bej
m+1
≥ K· (j
m
− 1)+1. Noticing
that
P
jm
i=1
P
m,i
=
P
K·(jm)
i=1
P
m+1,i
≥ x· P
N
i=1
P(i). Thus, the index of the cutoff node at level m+1
satisfies j
m+1
≤ K· j
m
. CombiningK· (j
m
− 1)≤ j
m+1
≤ K· j
m
, we obtainj
m
is the parent node of
j
m+1
.
We refer such nodej
m
as the cutoff node at levelm. The goal of sampling is to find the index of the
cutoff node at the last level of the tree. According to Theorem 3.2.4, the cutoff node at level m is the parent
of the cutoff node at level m+1. Therefore, we can start from the root node and perform the search only
using the child nodes. To obtain which child node is the cutoff node, we maintain a cumulative sum of
all the nodes left to the cutoff at each level. Please refer to Algorithm 1 for details. We also illustrate an
example of the process in Figure 3.3, whereK =4 andH =6.
3.2.3 PracticalImplementation
Theorem 3.2.1 provides a mathematically justified criteria to select states from the environment dataset for
rollout generation to maximize the "diversity" of the model dataset, yet it poses several practical challenges
to implement: 1) It requires a full sweep of all the states in the environment dataset before each sampling.
This is problematic because the number of data in the environment datasetN
env
grows linearly as training
progresses. 2) Stochastic gradient descent assumes uniform sampling of the data distribution whereas
prioritized sampling breaks this assumption and introduces bias. 3) Training the model data distribution to
converge is expensive but crucial before evaluating the priority. A complete algorithm that handles the
aforementioned practical challenges is presented in Algorithm 2.
29
Algorithm2 Maximum Entropy Model Rollouts for Model-Based Policy Optimization
1: Initialize environment datasetD
env
and model datasetD
model
2: Initialize SAC policyπ ϕ , predictive modelp
θ and model derived policy distributionπ ψ 3: fort=1: total_num_stepsdo
4: if t%model_update_freq==0then
5: Train modelp
θ onD
env
via maximum likelihood
6: endif
7: Samplea
t
∼ π ϕ (·|s
t
); Executea
t
in the environment and observes
t+1
8: Compute priorityp
t
according to Equation 3.7; add(s
t
,a
t
,s
t+1
,p
t
) toD
env
9: forj =1:M do
10: Samples
j
∼ P(j)=
p
α j/
P
i
p
α j
fromD
env
11: Compute importance-sampling weightw
j
=(N· P(j))
− β /max
i
w
i
12: Samplea
j
∼ π ϕ (·|s
j
); Perform one-step rollout usingp
θ and obtain ˆ s
′
j
.
13: endfor
14: Add{(s
j
,a
j
,ˆ s
′
j
,w
j
)}
M
j=1
to the next segment inD
model
15: Updateπ ψ on{(s
j
,a
j
)}
M
j=1
via maximum likelihood forD epochs
16: Update the priority ofs
j
according to Equation 3.7 for allj
17: forG iterationsdo
18: Sample segment indexk uniformly; Sample batch sizeB from segmentk uniformly
19: Update Q network asϕ Q
← ϕ Q
− λ π 1
B
P
B
i=1
w
i
·∇
ϕ Q
J
π (ϕ Q
,i)
20: Update policy usingJ
π (ϕ )=
1
B
P
B
i=1
[D
KL
(π ||exp{Q
π − V
π })]
21: endfor
22: endfor
Stochasticprioritization. Inspired by [63], we only update the priority of the states that are just sampled
to avoid an expensive full sweep before each sampling. An immediate consequence of this approach is
that certain states with low priorities will not be sampled for a very long time. This potentially leads to
overfitting. Following [63], we use stochastic prioritization that interpolates between pure greedy and
uniform sampling with the following probability of sampling statei:
Pr(i)=
p
α i
P
k
p
α k
(3.6)
wherep
i
≥ 0 is the priority of state and actioni. The exponentα determines how much prioritization is
used, withα =0 corresponding to the uniform case. According to Theorem 3.2.1, we computep
i
as
p
i
=− log(
√
2ππ ψ (a
i
|s
i
)σ (π ψ (·|s
i
))) (3.7)
30
Segment
index
…
1 2 K-1 K
Each size of M
Figure 3.4: Left: Segmented replay buffer for model generated rollouts. Each segment contains data sampled
from the same environment state distribution. The importance weights are only valid for data within the
same segment and doesn’t apply across segment since they are sampled from different state distribution.
When performing updates, we uniformly sample segment index and then uniformly sample within that
segment. Right: model rollouts. Circles represent states and arrows represent actions. The objective is to
make the actions expanded from the same state as far as possible so that the model rollouts are "diverse".
Correctingthebias. Using prioritized sampling introduces bias when fitting the Q network of the SAC.
Inspired by [63], we apply weighted importance-sampling (IS) when calculating the loss of the Q network,
where the weight for samplei is
w
i
=(
1
N
· 1
P(i)
)
β (3.8)
Segmentedreplaybuffer. According to Algorithm 2, we update the priority after sampling states from
the environment dataset to perform model rollouts. Thus, the sampling distribution of everyM model
rollout generation is different. This leads to incorrect importance weights if we randomly sample a batch
from the model dataset that contains data generated from different distributions to perform policy updates.
To fix it, we introduce segmented replay buffer that group every M rollouts in the same segment. During
sampling for policy updates, we randomly sample a segment index, then sample a batch from that segment.
Training model derived policy distribution. Fitting π ψ usingD
model
via maximum likelihood to
converge is costly since the size ofD
model
is large and this operation must be performed every time we
31
0 100 200 300
TotalEnvInteracts (k)
0
2000
4000
6000
Ant-v2
0 25 50 75 100 125
TotalEnvInteracts (k)
0
1000
2000
3000
4000
Hopper-v2
0 100 200 300 400
TotalEnvInteracts (k)
0
2500
5000
7500
10000
12500
HalfCheetah-v2
0 100 200 300
TotalEnvInteracts (k)
0
1000
2000
3000
4000
5000
Walker2d-v2
SAC MEMR MBPO SAC Convergence
0 200 400 600 800 1000
PolicyUpdates (k)
0
2000
4000
6000
Ant-v2
0 200 400 600
PolicyUpdates (k)
0
1000
2000
3000
4000
Hopper-v2
0 200 400 600 800 1000
PolicyUpdates (k)
0
2000
4000
6000
8000
10000
12000
HalfCheetah-v2
0 200 400 600 800 1000
PolicyUpdates (k)
0
1000
2000
3000
4000
5000
Walker2d-v2
SAC MEMR MBPO SAC Convergence
0 20000 40000 60000 80000 100000 120000
ModelRollouts (k)
−1000
0
1000
2000
3000
4000
5000
6000
Ant-v2
10000 20000 30000 40000
ModelRollouts (k)
0
1000
2000
3000
4000
Hopper-v2
MEMR MBPO
Figure 3.5: Training curves of MEMR and two baselines. Solid curves depict the mean of five trials and
shaded regions correspond to standard deviation among trials. The first row depicts the policy quality vs.
the total number of environment interactions. The second row shows the policy quality vs. the number of
policy updates. The third row shows the policy quality vs. number of model rollouts.
generate model rollouts. Since the data in model buffer is swapped rapidly, we treat it as an online learning
procedure and only perform several gradient updates on the newly stored data.
3.3 Evaluation
Our experimental evaluation aims to study the following questions: How well does MEMR perform on
RL benchmarks, compared to state-of-the-art model-based and model-free algorithms in terms of sample
efficiency, asymptotic performance and learning efficiency?
32
20 40 60 80 100 120
TotalEnvInteracts (k)
0
1000
2000
3000
4000
Model Dataset Size=20000
Model Dataset Size=100000
Model Dataset Size=1000000
20 40 60 80 100 120
TotalEnvInteracts (k)
0
1000
2000
3000
4000
Policy Updates=1
Policy Updates=5
Policy Updates=10
20 40 60 80 100 120
TotalEnvInteracts (k)
0
1000
2000
3000
4000
α=0.2
α=0.6
α=1.0
Figure 3.6: The results of ablation study. ModelDatsetSize refers to the size ofD
model
shown in Algorithm 2.
PolicyUpdates refers to the number of policy updates per environment step. It indicates how informative
the model rollouts are to the SAC agent. Prioritizationstrengthα is the exponent term used to calculate
probability of states being sampled. The less it is, the more uniform the distribution would be.
3.3.1 ComparativeAnalysis
We evaluate MEMR on Mujoco benchmarks [75]. We compare our method with the state-of-the-art model-
based method, MBPO [32]. As shown in Figure 3.5, MEMR matches the asymptotic performance of MBPO
whereas MEMR only uses
1
/4 policy updates and a fraction of model rollouts. It indicates that MEMR is
more efficient in terms of model rollouts data used for policy updates. It also indicates orders of training
speedup. Compared with the state-of-the-art model-free method, SAC [24], MEMR matches the asymptotic
performance and the learning efficiency.
3.3.2 AblationStudy
In this section, we make ablation studies to our proposed method. We primarily analyze how the policy
quality of our algorithm changes by varying the size of the model dataset, the number of policy updates
per environment step and the prioritization strengthα .
Modeldatasetsize The size of the model datasetD
model
affects how fast the algorithm converges. Since
SAC is an off-policy algorithm, the same experience is expected to be revisited several times on average
[63]. A small dataset hinders the learning progress as the same transition only resides in the buffer for only
33
a short period. On the other hand, a large model dataset decreases the sample diversity of each batch used
to perform policy updates.
Numberofpolicyupdatesperenvironmentstep As shown in Figure 3.5, MEMR converges as fast as
SAC in terms of policy updates. Surprisingly, we found that increasing the number of policy updates per
environment step doesn’t help to increase the convergence speed as shown in Figure 3.6. It indicates that
much of the computation power is wasted in [32] on less informative model rollouts that barely help the
learning of the value functions in SAC.
Prioritizationstrength Strong prioritization leads to overfit to local optimum whereas weak prioritiza-
tion leads to less model rollouts diversity. As shown in Figure 3.6, we found thatα =0.6 works best for all
benchmarks.
34
Chapter4
AFrameworkforMappingDRLAlgorithmswithPrioritizedReplay
BufferontoHeterogeneousPlatform
4.1 Overview
In this chapter, we present a framework for mapping DRL algorithms with Prioritized Replay Buffer [63]
onto Heterogeneous Platform. It aims to tackle the challenge of dynamic optimal mapping of primitives as
discussed in Section 1.6.2. We start by discussing the target DRL algorithms of our framework. Then, we
discuss the motivation to use CPU-GPU-FPGA heterogeneous platform. Finally, we present our mapping
and design space exploration (DSE) that yields the optimal computation efficiency.
4.1.1 TargetDeepReinforcementLearningAlgorithms
We show a generic view of target parallel RL workflow [28, 17] in Figure 4.1a. It contains a data collection
loop and a training loop. The main component of the data collection loop is the actor and the main
component of the training loop is the learner. The Prioritized Replay buffer is used to store data collected
by the actor and to sample data for the learner. The details of each component are shown in Figure 4.1b.
35
Actor
Actor
Actor
Actor
Prioritized Replay
Buffer
Learner
Data Insertion
Sample
Priority Update
Weight Synchronization
Data Collection Loop Training Loop
(a) High-level diagram
Environment
Policy Network
Local Storage
State
Action
Reward
Initial Priority
Actor
Replay Management Module
3-ary Sum Tree
Index 0 1 2 3 4 5 6 7 8
Leaf
Level 1
Root
Data Storage
0 1 2 3 4 5 6 7 8
(state, action, reward)
Insertion
Read
Learner
Data
Forward
Propagation
Loss
Backward Propagation
Gradients
Weight
Update
(b) Details of each component. The Prioritized Replay Buffer is further decomposed into Replay Management Module
and Data Storage.
Figure 4.1: Overview of existing parallel reinforcement learning frameworks [28].
36
4.1.1.1 DataCollectionLoop
The data collection loop is inside each actor as shown in Figure 4.1b. Each actor contains an instance of
the environment, a policy network represented as a neural network and a local storage. The environment
outputs the current states. The policy network computes an actiona given the current states via neural
network inference. The actiona is actuated in the environment to obtain the next states
′
and a reward
r. The policy network computes the absolute value of the current temporal difference error P defined in
Equation 1.6 as the initial priority using(s,a,s
′
,r). After observing the next states
′
, the policy network
computes the next actiona
′
and the data collection loop continues until the end of the training. Each actor
contains a local storage to temporarily store the data points consisting of tuple(s,a,s
′
,r,P) collected by
the actor. When the local storage is full, all the data points are popped out and inserted into the Prioritized
Replay Buffer [63]. The functionality of the local storage is to reduce the frequency of adding data into the
Prioritized Replay Buffer. This reduces the synchronization frequency when multiple actors add their data
concurrently.
4.1.1.2 TrainingLoop
The training loop occurs between the learner and the Prioritized Replay Buffer. Following [28, 17], we use a
centralized learner to perform policy updates. At each step, the learner i) samples a batch of indices via the
Replay Management Module using the probability distribution proportional to the priorities; ii) accesses
the actual data points in the Data Storage using the sampled indices; iii) performs forward propagation to
compute the loss; iv) performs backward propagation to compute the gradients; v) updates the weights of
the neural network using the gradients via Stochastic Gradient Descent (SGD) [62]; iv) sets the priorities of
the sampled batch data to the new priorities after SGD update via the Replay Management Module.
37
Learner abstraction Note that the learner can be instantiated by different DRL algorithms such as
DQN [54], DDPG [48], TD3 [21] and SAC [24]. The original ApeX framework uses DQN [54] and DDPG
[48]. R2D2 uses DQN [54] with recurrent neural networks. Nevertheless, the learner can be abstracted as
performing a single iteration of Stochastic Gradient Descent. This significantly simplifies our mapping
methodology as we only need to profile the execution time of forward and backward propagation without
caring the underlying algorithm.
Datadependency It is worth noticing that the priorities of the sampled batch data must be updated with
the new priorities before the sampling of the next batch of indices. Otherwise, the next batch of indices will
be sampled from the old probability distribution and it potentially causes convergence issues.
4.1.2 MotivationforHeterogeneousPlatform
Heterogeneous platforms consisting of CPU, FPGA and GPU are promising to accelerate DRL algorithms.
This is because the speed of running the key DRL primitives significantly varies among different DRL
algorithms. For instance, training neural network with small size on FPGA is faster than on CPU and on
GPU, whereas training neural network with large size is faster on GPU than on CPU and on FPGA as shown
in Section 4.5.3. Sampling with small batch size is faster on FPGA than on CPU and on GPU, whereas
sampling with large batch size is faster on GPU than on CPU and on FPGA as shown in Section 4.5.3.
4.2 AcceleratorDesignforPrimitives
In this section, we discuss the accelerator design of various primitives including actor, neural network
training, Prefix Sum Index computation and priority update on multi-core CPU, GPU and FPGA.
38
4.2.1 Actor
Each actor is mapped onto a CPU core that performs environment emulation and neural network inference.
Accelerating environment emulation is out of the scope of this paper. Our benchmark environments utilize
existing open-source software in OpenAI gym [10]. Following [28], neural network inference is performed
on a CPU core after each environment interaction.
4.2.2 NeuralNetworkTraining
4.2.2.1 AlgorithmDescription
We consider SGD training algorithms composed of forward propagation (FW), loss computation (LOSS),
backward propagation (BW), weight aggregation (WA) and weight update (WU) steps.
4.2.2.2 TrainingonCPUandGPU
We use PyTorch [58] to train neural networks on CPU and GPU. On CPU, PyTorch utilizes OpenMP [57] to
exploit intra-operation parallelism. On GPU, PyTorch utilizes CuDNN backend to exploit massive SIMT of
GPU.
4.2.2.3 TrainingonFPGA
We carefully design the Learner Module with the goal of minimizing the execution time of each gradient
step. The design principle of the Learner Module is to support both pipelining across different layers of the
neural network and data parallelism (e.g., a batch of data points is split into smaller batches and processed
concurrently). Based on this principle, we design a Multi-Pipeline Dataflow architecture composed of a
learner pipeline as shown in Fig. 4.2.
The learner pipeline for aL-layer neural network model consists ofn=3× (L− 1) stages: FW through
(L− 1) layers of policy and value networks, computing LOSS, BW through(L− 2) layers, and WA for all
39
Figure 4.2: Learner Module Architecture: An example pipeline for a 2-layer neural network. L
i
meansi
th
Layer.
the(L− 1) weight tensors. Each of these stages is mapped to a unique Tensor Unit,TU
i
,i∈[0,...,n− 1].
Each TU is a systolic array of Multiply-Accumulate elements. We express all the FW, BW and WA as
general matrix multiplication (For CNN, we apply the im2col [38] algorithm). ATU for FW (BW) exploit
parallelism both among different output neurons (access different weights in parallel). A TU for WA
takes the activations generated by FW and activation gradients generated by BW, and outputs the weight
gradients for accumulation. It exploits parallelism along the neurons in two adjacent layers. The WU
modules update the weight buffers in FW (BW) TUs after the accumulation of weight gradients from all
the samples in a batch is completed. To realize data streaming between stages in a pipeline, these modules
(TUs) are connected by FIFOs.
Let B denotes the batch size. A total number of DP such pipelines are allocated to exploit data
parallelism in a batch. Each pipeline processes a sub-batch of
B
DP
data points, and a reduction buffer is used
to average the weight gradients obtained in each pipeline. Conceptually, for a given batch size, higherDP
achieves higher throughput for FW-BW-WA stages, but causes larger time or area overhead for reduction
over all the pipelines. HighDP can also lead to low effective hardware utilization in each pipeline if the
resulting sub-batch size is too small to saturate the concurrency provided by all the stages. The Data Parallel
40
……
Prefix Sum Index
Computation path
Compute Prefix Sum Index using a
4-ary Sum Tree
Sampled node
Intermediate Node
along sample path
Root node
Legend
Batch size = 3
Figure 4.3: Illustration of computing the Prefix Sum Index.
Factor (DP ) needs to be carefully chosen to achieve the best performance under the constraints of a given
FPGA device. The design space exploration process for searching the optimalDP is described in Section
4.4.1.2.
4.2.3 PrefixSumIndexComputation
4.2.3.1 AlgorithmDescription
We illustrate an example of finding Prefix Sum Index in Figure 4.3 and Algorithm 3. The inputs to the
algorithm are the current node valuesv and the batch sizeB. The outputs areB sampled indices using the
probability distribution according to the node values. First, we compute the Prefix Sum as rand() × v[root],
where rand() samples a random number uniformly from[0,1]. In order to find the Prefix Sum Index, we
traverse from the root node to the leaf node level by level as shown in Figure 4.3. During the traversal of
each level, we find the child node such that the cumulatively sum of the values stored in the child nodes are
greater or equal to the target prefix sum value and continue to expand on that child node until it is the
leaf node, which is the node to sample. The time complexity of finding Prefix Sum Index is O(Klog
K
N),
whereN is the number of elements in the replay buffer and K is the number of child nodes of each parent
in the tree.
41
Algorithm3 Prefix Sum Index Computation on CPU
1: Input: node valuesv, batch sizeB
2: root = getRoot();
3: #pragma omp parallel for
4: fori = 0;iTP
′
, according to Equation 4.2, the achievable system throughput
equals TP
′
. (2) Else if TP
platform1
RMM
≤ TP
′
, the achievable system throughput is bound by RMM, thus
lower thanTP
′
.
Step 2. We then map the RMM to the platform yieldingmax
platform1
(B/TP
platform1
RMM
). Note that if
any of the TP
CPU
RMM
or TP
FPGA
RMM
≥ TP
′
, and the platform for learner is GPU in the previous step, we
prune out the choice of GPU for RMM. This is because letting RMM and learner share GPU resources will
reduceTP
′
fromB/TP
GPU
train
toB/(TP
GPU
train
+TP
GPU
RMM
). On CPU and FPGA, this problem does not occur.
Pre-defined hardware constraint for RMM (1 thread on CPU, 1 SLR on FPGA) ensures that RMM does not
share hardware resource with the learner, thus cannot affect TP
′
.
Step 3. Step 1 and Step 2 produces an assignment vector a where each element of a is the device
assignment for a primitive. We now decide the device assignment of the memory component, i.e., the Data
Storage. The total data traffic to the Data Storage during each gradient step is B words of sampling indices
from the RMM,B× E words of sampled data to the learner (E is the size of each data point stored in the
Data Storage), andN
actor
× E words of inserted data from the actors. We use a one-hot encoding vector,v,
53
Table 4.2: Overview of benchmark environments, DRL algorithms and neural network architectures
Environment Algo |S| |A| NN Architecture MACs
CartPole DDPG 4 1
3-layer MLP
137
hidden size 8
Hopper DDPG 11 3
3-layer MLP
70.1K
hidden size 256
Pong DQN 84× 84 6 ConvNet in [54] 18.8M
to represent the assignment of Data Storage. For a computation primitivei∈[Replay Sampling, Replay
Update, Training],v[i]=0 if and only if the Data Storage is assigned the devicea[i], andv[i]=1 if and
only if the Data Storage is not assigned the devicea[i]. We place the Data Storage to the platform that yield
the minimum total data traffic: v =argmin{v[0]× E,v[1]× B× E,v[2]× N
actor
× E}.
4.4.3 TemplateInstantiation
We develop a code base composed of (1) the complete CPU multi-thread host program template with the
Inter-Processing Unit Data Transfer system, (2) the host program for interfacing a host CPU thread with
a Processing Kernel on the accelerator (Intra-Processing Unit Data Transfer) under its various primitive
mapping, and (3) the kernel programs on the accelerators under various mapping options (GPU and FPGA).
As shown in Fig. 4.7, a Template Instantiater draws the device assignment result of primitive mapping
from the Mapping Algorithm, and use it to obtain the parameterized code snippets (2) and (3) for each
assigned accelerator from the code base. After compiling (3) to generate kernel executable (or bitstreams),
the host code snippets in (2) are then filled into (1) the host program template, for an end-to-end complete
implementation.
4.5 Experiments
Our experiments aim to demonstrate i) the DRL performance in terms of rewards achieved by our framework
is the same as the serial version of the corresponding DRL algorithm; ii) the execution time of each primitive
54
varies w.r.t the input parameters including batch size, neural network architecture; ii) the superiority of the
mapping generated by our framework compared with other mappings on various benchmark environments
solved by different DRL algorithms.
4.5.1 ExperimentalSetup
HardwarePlatformandToolchain Our experiments are conducted using Intel(R) Xeon(R) Gold 5120
CPU@ 2.20GHz with 56 cores, a Nvidia TITAN Xp GPU and a Xilinx Alveo U200 accelerator board. We
develop a parameterized FPGA kernel template using High-Level Synthesis (HLS) for quick customization
and easy integration with domain-specific frameworks (e.g., Pytorch [58]). We follow the VITIS hardware
development flow for bitstream generation. OpenCL is used to implement the data transfer between the
host and the FPGA.
BenchmarkDRLEnvironments We select 3 benchmark environments including classic control task
CartPole, MuJoCo task Hopper and Atari games Pong. The DRL algorithms used to solve each environment
and the neural network architectures are shown in Table 4.2, where|S| denotes the dimension of state
space and|A| denotes the dimension of action space.
Hyper-parameters The number of actors are set to 16 for all environments according to existing work
[28]. The number of child nodes per parent nodeK used in the Sum Tree implementation is set to 2 for
simplicity.
4.5.2 DRLAlgorithmPerformance
As discussed in Section 4.3.1, the training throughput benefits from the data-dependency relaxed training
loop, which removes the data dependency between sampling of the next batch data and the priority update
of the previous batch data. In order to empirically verify that this design decision has little impact on
55
Cartpole Hopper Pong
0.0
0.5
1.0
Normalized Reward
1 10 50 100 200
Figure 4.8: Reward of RL agents trained in benchmark environments with various priority update delay.
The reward is normalized by the reward of an expert agent.
Table 4.3: System DSE result of various configurations from system design space exploration
env=CartPole env=Hopper env=Pong
DNN MACS=137 DNN MACS=70K DNN MACS=18M
Batch size 32 256 2048 16384 32 256 2048 16384 32 64 256
RMM optimal mapping FPGA FPGA FPGA FPGA FPGA CPU CPU FPGA CPU CPU CPU
Learner optimal mapping FPGA FPGA FPGA FPGA FPGA GPU GPU GPU GPU GPU GPU
Data storage optimal mapping FPGA FPGA FPGA FPGA FPGA GPU GPU GPU GPU GPU CPU
the DRL performance in terms of rewards, we train several agents with different random seeds for each
benchmark environment using the optimal mapping. We show the achieved normalized reward of each
environment with various priority update delay in Figure 4.8. The priority update delayD is defined as the
number of sampling performed ahead of priority update. This can be controlled by setting the size of the
Data Queue in Figure 4.6 toD. We observe that the performance degradation caused by priority update
delay is almost negligible even withD =200 in all benchmark environments. It empirically verifies the
feasibility of relaxing the data dependency in the training loop. Intuitively, the probability distribution used
for sampling does not change rapidly when the training gradually converges. Thus, sampling using the
old probability distribution does not significantly impact the DRL performance. However, it significantly
improves the training throughput.
56
32
64
128
256
512
1024
2048
4096
8192
16384
Batch size
2
− 7
2
− 4
2
− 1
Execution time (ms)
env=CartPole, MACS=137
32
64
128
256
512
1024
2048
4096
8192
16384
Batch size
2
0
2
4
2
8
Execution time (ms)
env=Hopper, MACS=70K
8
16
32
64
128
256
Batch size
2
0
2
4
2
8
Execution time (ms)
env=Pong, MACS=18M
cpu fpga gpu
Figure 4.9: Execution time (in milliseconds) of a single neural network training step in benchmark environ-
ments of various batch sizes on various hardware platforms.
32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
2
− 12
2
− 10
2
− 8
2
− 6
2
− 4
2
− 2
2
0
2
2
2
4
Execution time (ms)
Prefix Sum Index Computation
32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
Batch size
2
− 12
2
− 10
2
− 8
2
− 6
2
− 4
2
− 2
2
0
2
2
2
4
Execution time (ms)
Priority Update
cpu-1 cpu-4 cpu-8 cpu-16 gpu fpga
Figure 4.10: Execution time (in milliseconds) of a single Prefix Sum Index computation and priority update
of various batch sizes on various hardware platforms. The number after "cpu" denotes the number of
threads using in OpenMP [57].
32 256 2048 16384
Batch size
2
0
2
6
2
12
2
18
Training throughput (GPS)
env=CartPole
32 256 2048 16384
Batch size
2
0
2
4
2
8
2
12
Training throughput (GPS)
env=Hopper
32 64 256
Batch size
2
0
2
4
2
8
Training throughput (GPS)
env=Pong
CPU-CPU
CPU-GPU
CPU-FPGA
GPU-CPU
GPU-GPU
GPU-FPGA
FPGA-CPU
FPGA-GPU
FPGA-FPGA
Figure 4.11: The training throughput in gradient steps per second (GPS) of various batch sizes in various
mappings. The mapping "X-Y" denotes that the learner is mapped onto "X" and Replay Management Module
is mapped onto "Y". The line plot shows the theoretical optimal performance given environment and batch
size.
57
4.5.3 PrimitiveAccelerationPerformance
We profile the performance of various primitives including neural network training in Figure 4.9, Prefix
Sum Index computation and priority update in Figure 4.10.
Trainingprimitive The number of CPU threads used for training on CPU is 16. We observe that the
training performance of FPGA dominates over CPU and GPU when the arithmetic intensity is low. (e.g.,
when the size of the neural network is small as in CartPole or when the batch size is less than 128 in Hopper
and less than 16 in Pong). The training performance of GPU dominates over CPU and FPGA when the
arithmetic intensity is high (e.g., when the batch size in Hopper is larger than 1024 and the batch size
in Pong is larger than 16). This is because the kernel and memory overhead of GPU are not negligible
when the arithmetic intensity is low. The high memory access latency with low data re-use in smaller
batch size training makes the SIMT computation power of GPU severely unsaturated. The superiority of
training performance on FPGA arises from our high throughput customized hardware design. However,
the execution time of training primitive on GPU starts to outperform that on FPGA when the batch size
increases due to negligible kernel launch overhead and higher clock frequency.
PrefixSumIndexcomputation On CPU, we observe that the execution time of Prefix Sum Index com-
putation decreases as the number of CPU threads increases. This is because Prefix Sum Index computation
is a read-only operation that each computation inside a batch can be executed in parallel. On GPU, the
execution time is almost the same when the batch size increases. This indicates that the computation of
Prefix Sum Index is bound by the kernel launch overhead and fails to saturate the GPU’s SIMT power. On
FPGA, we observe a linear increase in the execution time as batch size increases above 1024 as the number
of sampling and update pipelines reach the resource bound of an SLR. For batch sizes smaller than 1024, the
increase in RMM operations latency is not as sever because we scale up bank parallelism with increasing
batch sizes to make the performance more salable. For Prefix Sum Index computation, GPU is more salable
58
to larger batch sizes since there are no data-dependency between obtaining samples within a batch and
they can be executed in a SIMD manner. Compared to GPU, the FPGA Prefix Sum performance ranges
from41× speedup to4.6× speed-down over all the batch sizes.
Priorityupdate Different than Prefix Sum Index computation where different samples in a batch are
independent, in Priority Update computation, multiple update requests poses write-after-write data access
dependencies at the root. The execution time improvement of priority update is almost negligible or
negatively affected when the number of CPU threads increases. This is due to the poor cache performance
caused by memory access conflict as discussed in Section 4.2.4.2. The execution time of priority update
on GPU increases in linear as batch size increases due to the inevitable serial execution at root node and
high memory access latency that cannot be hidden by computation. On the other hand, FPGA-based
implementation features hardware pipelining and single-cycle on-chip data accesses to maximize the
throughput of the serial execution. This leads to consistent superior performance of priority update of all
the batch sizes (11∼ 98× speedup compared to GPU) as shown in Figure 4.10.
Overall, we observe that the optimal mapping of all the primitives vary as the input algorithm configu-
rations change. This makes a fixed mapping of DRL algorithms in efficient and motivate the necessity to
automatically generate the mapping based on the inputs.
4.5.4 SystemMappingandPerformanceAnalysis
In the bar plots of Fig. 4.11, we show the achieved system throughput for the three benchmarks under
different batch sizes and mappings, and in the line plot we show the theoretical optimal training throughput.
The theoretical optimal training throughput is the maximum throughput of the training primitive among
various mappings of the learner. It is achieved when i) the learner is mapped onto the platform with lowest
execution time of the neural network training primitive; ii) the mapping of RMM doesn’t slowdown the
training throughput; iii) the overhead from thread-level synchronization of the data transfer queues is
59
Table 4.4: Design parameters and resource allocation from FPGA Architecture Exploration
FPGA
Hardware
Parameters
Pipeline
Factor
(PI)
DataParallel
Factor
(DP)
RMMBank
Parallelism
(S)
#SLR
Constraint
(RMM,Learner)
B/2 2 1∼ 32 [1,2]
Resources SRAM REG LUT DSP
RMM
(S=32)
4.5MB
(12.8%)
263K
(11%)
181K
(15%)
1280
(18%)
DQN
Learner
17.6MB
(51%)
994K
(42%)
615K
(52%)
4315
(64%)
DDPG
Learner
12.3MB
(35%)
782K
(33%)
721K
(46%)
2557
(38%)
negligible to the system throughput. As shown in Fig. 4.11, the difference of the achieved performance
using our system DSE to the theoretical optimum is within 5% as shown in the line plot in Figure 4.11.
Accordingly with each configuration in Fig. 4.11, Table. 4.3 shows their optimal mapping returned
by the system DSE Mapping Algorithm, and Table. 4.4 shows the hardware parameters returned by the
Architecture Exploration when FPGA is used for mapping the RMM or Learner. For the CartPole benchmark,
when both the DNN size and batch size are small, both training and RMM operations are mapped on FPGA
as it outperforms GPU for these primitives. As the batch size increases (i.e., batch sizeB > 2048), the
learner gradient step execution time becomes larger, such that the latency of RMM operations can be hidden
using either GPU or FPGA. Our Mapping Algorithm chooses the mapping that minimizes the total number
of devices and the amount of data communication, so it still maps both RMM and learner to the FPGA.
In the Hopper benchmark, the same observation also applies to the medium-sized DNN when the batch
size is small. As the batch size further increases, GPU outperforms FPGA on training due to its superior
amount of parallel resources and higher frequency. As the learner is mapped to GPU forB >256, although
FPGA outperforms CPU for the RMM operations, the learner is the bottleneck and assigning RMM to
either CPU or FPGA yields the same overall throughput. However, when the batch size reaches a threshold
(B =16384) where the CPU RMM operation latency can no longer be hidden by the learner, mapping the
60
RMM to FPGA has obvious improvement over other baselines. For large DNN and batch size in the Pong
game benchmark, the learner is consistently mapped to GPU due to its superior training performance. Even
with slower RMM performance on CPU than FPGA, the learner remains the bottleneck of the system, so
we map it to CPU for minimized number of devices and communication requirement.
We observe that all the three devices (CPU-GPU-FPGA) should be used for optimal system performance
when the DNN is small and batch size is large (e.g., The Hopper benchmark using 3-layer MLP with batch
size 16384). This is because large batch training favors GPU over CPU and FPGA, while large batch RMM
operations can only be effectively accelerated without reducing system throughput on the FPGA (mapping
RMM to CPU will shift the bottleneck from training to RMM operations, and mapping RMM to GPU
will take over the resources for the learner, lowering the training throughput by about 30%). Compared
with baseline mappings, our CPU-GPU-FPGA system achieves up to 11× higher throughput. In all the
benchmarks and baselines, our framework achieves up to 997.3× speedup.
61
Chapter5
ScalablePolicyOptimization
In Chapter 4, we focus on the hardware mapping of existing parallel Deep Reinforcement Learning frame-
work ApeX [28]. In this chapter, we focus on improving the scalability of ApeX [28] by utilizing parallel
learners. We start by analyzing the motivation to use parallel learners and the computation challenges it
poses.
0.0 0.2 0.4 0.6 0.8 1.0
PolicyUpdates (million steps)
0
100
200
300
400
AverageTestEpRet
Acrobot-swingup-v1
0.0 0.2 0.4 0.6 0.8 1.0
PolicyUpdates (million steps)
0
100
200
300
400
500
600
700
AverageTestEpRet
Humanoid-run-v1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
PolicyUpdates (million steps)
0
200
400
600
800
AverageTestEpRet
Humanoid-stand-v1
Batch Size
256 512 1024 2048 4096
Figure 5.1: The convergence rate of various batch sizes in Deepmind Control Suite [76] benchmark. Each
run is repeated for 3 different random seeds and the curve shows the mean value and the shaded area shows
the standard deviation.
62
Table 5.1: The neural network architecture representing the Q network for Arcade Learning Environment
[5].A is the dimension of the action space.
Layer Input Shape Kernel Shape Stride Padding
Convolution 2D 4× 84× 84 32× 8× 8 4 4
Convolution 2D 32× 22× 22 64× 4× 4 2 2
Convolution 2D 64× 12× 12 64× 3× 3 1 1
Flatten 64× 12× 12 N/A N/A N/A
Dense 9216 9216× 512 N/A N/A
Dense 9216 512×A N/A N/A
5.1 Motivation
5.1.1 TheEffectofBatchSizeonLearningEfficiency
We first investigate the effect of increasing batch size in existing ApeX [28] implementations on the learning
efficiency. As shown in [72], increasing the batch size improves the learning efficiency when the learner in
ApeX is DQN [54] (denoted as ApeX-DQN) on Atari benchmarks. In this work, we conduct preliminary
experiments using widely used Deepmind Control Suite [76] benchmark to study how does the batch size
affect the learning efficiency when the learner in ApeX is TD3 [21] (denoted as ApeX-TD3). The results
are shown in Figure 5.1. We observe that as the batch size increases, the learning efficiency consistently
improves in all the 3 benchmark environments. It empirically indicates that we can increase the batch size
to improve the learning efficiency.
Remark 5.1.1. Note that training deep neural networks via supervised learning using large batch sizes
lead to worse generalization compared with using small batch sizes [37]. In deep reinforcement learning, the
environment used for training is exactly the environment used for testing. Thus, DRL agents in principle have
access to all the data by interacting with the environment and train models using the full dataset without
bothering generalization.
63
32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
2
− 12
2
− 10
2
− 8
2
− 6
2
− 4
2
− 2
2
0
2
2
2
4
Execution time (ms)
Prefix Sum Index Computation
32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
Batch size
2
− 12
2
− 10
2
− 8
2
− 6
2
− 4
2
− 2
2
0
2
2
2
4
Execution time (ms)
Priority Update
cpu-1 cpu-4 cpu-8 cpu-16 gpu fpga
Figure 5.2: Execution time of sampling and priority update with various batch sizes. The number after cpu
indicates the number of threads used for parallel computing.
1 4 16 64 256 1024 4096 16384
Batch Size
0
2
Time (ms)
Figure 5.3: The execution time of the forward propagation of a 4-layer fully-connected neural network for
ApeX-TD3 with 1024 hidden units and ReLU activation [2] with various batch sizes.
2 4 8 16 32
Number of Learners
0
2
Time (ms)
Figure 5.4: The execution time of weight averaging of a 4-layer fully-connected neural network for ApeX-
TD3 with 1024 hidden units and ReLU activation [2] with various learners.
1 4 16 64 256 1024
Batch Size
0.0
2.5
Time (ms)
Figure 5.5: The execution time of the forward propagation of a standard Convolutional Neural Network for
ApeX-DQN [54] shown in Table 5.1 with various batch sizes.
64
2 4 8 16 32
Number of Learners
0
1
Time (ms)
Figure 5.6: The execution time of weight averaging of a standard Convolutional Neural Network for ApeX-
DQN [54] shown in Table 5.1 with various learners.
5.1.2 ComputationChallenges
Although increasing the batch size lead to better learning efficiency, it poses computation challenges as
the batch size increases. A policy update consists of i) sampling a batch data from the Prioritized Replay
Buffer; ii) performing forward propagation to obtain the loss and new priorities; iii) performing backward
propagation to obtain the gradients; iv) updating the weights of the neural network using the gradients; v)
performing priority update of the sampled data using the new priorities in step ii. In order to gain an intuitive
understanding of how the computation time of each step scales with batch size, we profile the execution
time of sampling and priority update of various batch sizes in Figure 5.2. We also profile the execution
time of forward propagation of various batch sizes and weight averaging of various number of learners
in Figure 5.3, Figure 5.5, Figure 5.4 and Figure 5.6. Note that the execution time of backward propagation
is the same as the forward propagation and the execution time of weight update does not scale as batch
size increases. The neural network architecture for ApeX-TD3 [21] is a 4-layer fully-connected neural
network with 1024 units at each hidden layer and ReLU activation [2]. The neural network architecture
for ApeX-DQN is shown in Table 5.1. We observe that the execution time of sampling, priority update
scales in linear as the batch size increases. The execution time of forward propagation is the same when the
batch size is smaller than a threshold and scales in linear when the batch size is larger than the threshold.
The threshold is typically the peak computation throughput of a device. This indicates that continually
increasing the batch size eventually causes slowdown of the computation efficiency.
65
Sub-batch
Data
Sub-gradients
Sub-batch
Data
Sub-gradients
Gradients
Weight Update
Repeat E times
Forward/Backward Propagation
Sub-batch
Data
Sub-gradients
Sub-batch
Data
Sub-gradients
Local Weight Update
Repeat E/H times
Forward/Backward Propagation
Weight Averaging
Repeat H times
Sample
Sample
Sample
Priority Update
Priority Update
Priority Update
Multi-Learner
Data Parallel ApeX
Multi-Learner
Local SGD ApeX
Local Learner 1
Local Learner L
Central Learner
Central Learner
Local Learner 1
Local Learner L
Central Learner
Figure 5.7: Top: the execution flow of existing multi-learner data parallel ApeX. Bottom: the execution flow
of multi-learner ApeX with local Stochastic Gradient Descent, as known as Scalable Policy Optimization.
5.1.3 ExistingApeXwithMultipleLearners
5.1.3.1 Procedure
Existing ApeX [28] solves this issue via data parallelism as shown in the top of Figure 5.7. We call it
multi-learnerdataparallelApeX. It utilizes a central learner andL local learners. The central learner
samples a batch of data from the Prioritized Replay Buffer and splits it into L sub-batches. The central
learner sends one sub-batch to each local learner. Then, each local learner processes a sub-batch and
performs forward propagation to obtain the sub-new-priorities and backward propagation to obtain the
sub-gradients. Then, all the sub-new-priorities and sub-gradients are transferred to the central learner to
perform gradient averaging to obtain the gradients. The gradients are used to perform the weight update.
The sub-new-priorities are used to perform priority update. The updated weights are transferred back to
each local learner. We define synchronization overhead as the additional workload of multi-learner
data parallel ApeX compared with the single learner ApeX. The synchronization overhead of multi-learner
66
Table 5.2: Notations Table
Symbol Meaning
L Number of local learners
B Batch size of single learner ApeX
E Number of training epochs
b=
B
/K Batch size of each learner in multi-learner data parallel ApeX
T
B
sample
Time to perform sample data with batch sizeB
T
Sub− batch
Trans
Time to transfer the sub-batch from the central learner to one local learner
T
b
for
Time to perform forward propagation with batch sizeb
T
b
back
Time to perform backward propagation with batch sizeb
T
Gradient
Trans
Time to transfer the gradients from the central learner to one local learner
T
K
Average
Time to perform gradient averaging withK learners
T
Update
Time to perform weight update using calculated gradients
T
Weight
Trans
Time to transfer weights from the central learner to one local learner
T
B
priority
Time to perform priority update with batch sizeB
T
P3O
total
Time to perform a policy update of P3O
T
ApeX
total
Time to perform a policy update of single-learner ApeX
T
multi− ApeX
total
Time to perform a policy update of multi-learner ApeX
SF
multi− ApeX
Scaling factor of multi-learner data parallel ApeX
data parallel ApeX includes i) transferring each sub-batch from the central learner to the corresponding
local learner; ii) transferring the gradients and priorities from each local learner to the central learner; ii)
performing gradient averaging; iii) transferring the updated weights from the central learner to the local
learners.
5.1.3.2 AnalyticalPerformanceModel
We propose an analytical performance model to quantitatively analyze how the synchronization overhead
impact the scaling factor. To simplify the analysis, we assume that the time for each local learner to process
each sub-batch is identical. Then, the total time to perform a policy update of multi-learner ApeX is:
T
multi− ApeX
total
=T
B
sample
+T
Sub− batch
Trans
+T
b
for
+T
b
back
+T
Gradient
Trans
+T
L
Average
+T
Update
+T
Weight
Trans
+T
B
Priority
(5.1)
67
The total time to perform a policy update in single learner ApeX is:
T
ApeX
total
=T
B
sample
+T
B
for
+T
B
back
+T
Update
+T
B
Priority
(5.2)
The notations used in Equation 5.1 and Equation 5.2 are shown in Table 5.2.
5.1.4 IssuesofScalability
In order to analyze the scaling factor of multi-learner data parallel ApeX, we make the following simplifica-
tions and assumptions:
• The execution time of forward propagation is identical to the execution time of backward propagation:
T
for
=T
back
.
• The execution time to perform forward propagation with batch sizeB isL times larger than the
execution time to perform forward propagation with batch sizeb=
B
/L: T
B
for
=LT
b
for
. It is valid
when the batch size is large enough according to Figure 5.5.
• According to our measurement, the execution time to perform weight update of the target neural net-
work is less than 0.05ms. The execution time to perform other operations including forward/backward
propagation and replay operations is larger than 1ms. As the execution time of weight update is
much smaller than the execution time of forward/backward propagation and replay operations, the
execution time to perform weight update is ignored in our analysis.
Let T
B
replay
= T
B
sample
+T
B
Priority
, T
B
Trans
= T
Sub− batch
Trans
+T
Gradient
Trans
+T
Weight
Trans
. The scaling factor of
multi-learner data parallel ApeX is:
SF
multi− ApeX
=
T
ApeX
total
T
multi− ApeX
total
=1− 1
1+
2T
b
for/(T
B
replay
+T
B
Trans
+T
L
Avg
)
(5.3)
68
Actor 1
Environment
Policy Network
Local Storage
State
Action
Reward
Initial Priority
Actor M
Environment
Policy Network
Local Storage
State
Action
Reward
Initial Priority
Bank 1
Bank 2
Bank L
Queue
Bank L-1
Learn 1
Learn 2
Learn L-1
Learn L
Data
New Priorities
Data
New Priorities
Data
New Priorities
Data
New Priorities
Weight Server
Weight Synchronization
Averaging
AAACSXicfZDLSsNAFIYn9R7vunQTLIKIlEREXYq6cCMqWBWaWk6mp+3QySTMnIgl5D3c6uv4BD6GO3HltFbwhgcGPv7zz8w5f5RKYcj3n53SyOjY+MTklDs9Mzs3v7C4dGmSTHOs8kQm+joCg1IorJIgidepRogjiVdR97Dfv7pFbUSiLqiXYj2GthItwYGsdBNSBwlu8q2ikVPRWCj7FX9Q3m8IhlBmwzprLDprYTPhWYyKuARjaoGfUj0HTYJLLNwwM5gC70IbaxYVxGjq+WDswluzStNrJdoeRd5A/Xojh9iYXhxZZwzUMT97ffGvXi2j1l49FyrNCBX/+KiVSY8Sr5+B1xQaOcmeBeBa2Fk93gENnGxSbniEdheNJ/bd0xQ1UKI38hB0O4a7wu7WDjf79J9RqE+jJde1wQY/Y/wNl1uVYKeyfb5d3j8YRjzJVtgqW2cB22X77JidsSrjTLN79sAenSfnxXl13j6sJWd4Z5l9q9LIOyzlsqg=
✓ 2
t
AAACTXicfZBdSxtBFIZnY9V060fUy94sDYFSNOyKqJdBvfCmbQpGA9kYzk5OksHZ2WXmbDAs+0+81b/jtT/Eu1I6iSn4RQ8MPLznnZlz3iiVwpDvPzilhQ+LS8vlj+6nldW19crG5rlJMs2xxROZ6HYEBqVQ2CJBEtupRogjiRfR1fG0fzFGbUSizmiSYjeGoRIDwYGs1KtUQhohwWX+YycoejkVvUrVr/uz8t5CMIcqm1ezt+HUwn7CsxgVcQnGdAI/pW4OmgSXWLhhZjAFfgVD7FhUEKPp5rPRC69mlb43SLQ9iryZ+vxGDrExkziyzhhoZF73puJ7vU5Gg8NuLlSaESr+9NEgkx4l3jQHry80cpITC8C1sLN6fAQaONm03PAE7S4av9t3f6aogRL9LQ9BD2O4Luxuw3B7Sv8zCvXPaMl1bbDB6xjfwvluPdiv7/3aqzaO5hGX2Wf2hX1lATtgDXbKmqzFOBuzG3bL7px759H57fx5spac+Z0t9qJKy38B5EuzZw==
✓ N 1
t
AAACSXicfZDLSsNAFIYn9R7vunQTLIKIlEREXYq6cOMNrApNLSfT03ZwMgkzJ2IJeQ+3+jo+gY/hTlw5rRW84YGBj//8M3POH6VSGPL9Z6c0NDwyOjY+4U5OTc/Mzs0vXJgk0xyrPJGJvorAoBQKqyRI4lWqEeJI4mV0s9/rX96iNiJR59RNsR5DW4mW4EBWug6pgwTX+XHRyKlozJX9it8v7zcEAyizQZ025p2VsJnwLEZFXIIxtcBPqZ6DJsElFm6YGUyB30AbaxYVxGjqeX/swluxStNrJdoeRV5f/Xojh9iYbhxZZwzUMT97PfGvXi2j1k49FyrNCBX/+KiVSY8Sr5eB1xQaOcmuBeBa2Fk93gENnGxSbniAdheNR/bdkxQ1UKLX8hB0O4a7wu7WDtd79J9RqE+jJde1wQY/Y/wNFxuVYKuyebZZ3t0bRDzOltgyW2UB22a77JCdsirjTLN79sAenSfnxXl13j6sJWdwZ5F9q9LQO2ERssQ=
✓ N
t
Prioritized Replay Buffer
AAACUHicfVBNSxxBEK3ZxK/RmDUecxmyCCGEZUYkehTNwYtEIavCzrLU9NbuNvZ0D9014jLsb/Ga/J3c/CfetGddIdFgQdOP9151V72sUNJxHN8GjTdvFxaXllfC1bV36++bGx/OnCmtoI4wytiLDB0pqanDkhVdFJYwzxSdZ5eHtX5+RdZJo3/ypKBejiMth1Ige6rf3EyNl+vuKuUxMU773G+24nY8q+glSOagBfM66W8EW+nAiDInzUKhc90kLrhXoWUpFE3DtHRUoLjEEXU91JiT61Wz6afRlmcG0dBYfzRHM/bvjgpz5yZ55p058tg912ryf1q35OFer5K6KJm0ePxoWKqITVRHEQ2kJcFq4gEKK/2skRijRcE+sDD9Tn4XS8f+3R8FWWRjv1Qp2lGO11O/2yj9WqPXjFI/GT0KQx9s8jzGl+Bsu518a++c7rT2D+YRL8NH+ASfIYFd2IcjOIEOCJjADfyC38Gf4C64bwSP1qcbNuGfaoQPg5i0RA==
✓ t
AAACUHicfVBNSxxBEK3ZxK/RmDUecxmyCCGEZUYkehTNwYtEIavCzrLU9NbuNvZ0D9014jLsb/Ga/J3c/CfetGddIdFgQdOP9151V72sUNJxHN8GjTdvFxaXllfC1bV36++bGx/OnCmtoI4wytiLDB0pqanDkhVdFJYwzxSdZ5eHtX5+RdZJo3/ypKBejiMth1Ige6rf3EyNl+vuKuUxMU773G+24nY8q+glSOagBfM66W8EW+nAiDInzUKhc90kLrhXoWUpFE3DtHRUoLjEEXU91JiT61Wz6afRlmcG0dBYfzRHM/bvjgpz5yZ55p058tg912ryf1q35OFer5K6KJm0ePxoWKqITVRHEQ2kJcFq4gEKK/2skRijRcE+sDD9Tn4XS8f+3R8FWWRjv1Qp2lGO11O/2yj9WqPXjFI/GT0KQx9s8jzGl+Bsu518a++c7rT2D+YRL8NH+ASfIYFd2IcjOIEOCJjADfyC38Gf4C64bwSP1qcbNuGfaoQPg5i0RA==
✓ t
AAACUHicfVBNSxxBEK3ZxK/RmDUecxmyCCGEZUYkehTNwYtEIavCzrLU9NbuNvZ0D9014jLsb/Ga/J3c/CfetGddIdFgQdOP9151V72sUNJxHN8GjTdvFxaXllfC1bV36++bGx/OnCmtoI4wytiLDB0pqanDkhVdFJYwzxSdZ5eHtX5+RdZJo3/ypKBejiMth1Ige6rf3EyNl+vuKuUxMU773G+24nY8q+glSOagBfM66W8EW+nAiDInzUKhc90kLrhXoWUpFE3DtHRUoLjEEXU91JiT61Wz6afRlmcG0dBYfzRHM/bvjgpz5yZ55p058tg912ryf1q35OFer5K6KJm0ePxoWKqITVRHEQ2kJcFq4gEKK/2skRijRcE+sDD9Tn4XS8f+3R8FWWRjv1Qp2lGO11O/2yj9WqPXjFI/GT0KQx9s8jzGl+Bsu518a++c7rT2D+YRL8NH+ASfIYFd2IcjOIEOCJjADfyC38Gf4C64bwSP1qcbNuGfaoQPg5i0RA==
✓ t
AAACeHicfVBdaxQxFM2OX3X82uqjL8Gl+IEsM6VooRSK+uCLWsFtC5t1yGTv7IbmY0juiEuYH+Ov8VUf/Ss+mZmuoK14IXDuOecmuaeslfSYZT8GyaXLV65e27ie3rh56/ad4ebdI28bJ2AirLLupOQelDQwQYkKTmoHXJcKjsvTl51+/Amcl9Z8wFUNM80XRlZScIxUMdxjNsrddGC4BORtgfusclyEvA1vW+YbXQS5n7cfu663FAH7rhiOsnHWF70I8jUYkXUdFpuDLTa3otFgUCju/TTPapwF7lAKBW3KGg81F6d8AdMIDdfgZ6HfsqVbkZnTyrp4DNKe/XMicO39SpfRqTku/XmtI/+lTRusdmdBmrpBMOLsoapRFC3tIqNz6UCgWkXAhZPxr1QseYwIY7ApewVxFwdv4r3vanAcrXsSGHcLzT+3cbcFe9qh/xml+W2MKE1jsPn5GC+Co+1x/my8835ndPBiHfEGuU8ekEckJ8/JAXlNDsmECPKFfCXfyPfBz4QmD5PHZ9ZksJ65R/6qZPsXHtnD3g==
✓ t
=
1
N
N
X
i=1
✓ N
t
Figure 5.8: Overall Diagram of Scalable Policy Optimization
where the notations are defined in Table 5.2. We observe that as the number of local learners L increases,
the scaling factor decreases. For example, when the neural network is a 4-layer multi-layer perceptron
and the batch size is 4096, the improvement of computation efficiency of multi-learner data parallel ApeX
compared with single-learner ApeX whenL=2 is1.25× , and the scaling factor is 0.63. WhenL=4, the
improvement of computation efficiency 1.34× , and the scaling factor is 0.34. The decreasing scaling factor
when increasing the number of local learnerL is mainly caused by two factors: i) the sampling and priority
update is performed by the central learner and the execution time scales in linear to the batch size; ii) the
data transfer and weight averaging are performed at each policy update. And their execution time can’t be
ignored when comparing to the execution time of forward and backward propagation.
5.2 ScalablePolicyOptimization
As we motivate in Section 5.1, the scaling factor decreases as the batch size increase dues to i) centralized
sampling and priority update; ii) the synchronization overhead caused by data transfer and gradient
averaging. To tackle the first issue, we propose banked Prioritized Replay Buffer that splits the data storage
into multiple banks. This allows parallel sampling by multiple learners concurrently and avoids transferring
each sub-batch to the local learner. To tackle the second issue, we apply local SGD [71] that only performs
69
Algorithm5 Local SGD [50]
1: Input: training horizonT , number of learnersL, initial weightw
0
, objective functionf, local batch
sizeb, datasetD, weight synchronization horizonH.
2: fortin0,...,T − 1do
3: for workerk in0,...,L− 1 in paralleldo
4: (x
i
,y
i
)
b
i=1
∼D {sample}
5: w
k
t
← w
k
t
− γ (t)
1
b
P
b
i=1
∇
w
k
t
f(x
i
,y
i
;w
k
t
)
6: if t modH == 0then
7: w
k
t
=
1
K
P
K
i=1
w
k
t
{weight averaging}
8: endif
9: endfor
10: endfor
weight averaging after several local SGD iterations performed at each learner in parallel. This significantly
reduces the synchronization overhead and improves the scaling factor. We call our approach Scalable Policy
Optimization (SPO). Like the ApeX, when the learner is DQN [54], our approach is named SPO-DQN and
when the learner is TD3 [21], our approach is named SPO-TD3.
5.2.1 LocalStochasticGradientDescent
In standard Stochastic Gradient Descent [62], the learner samples a mini-batch data of sizeB, performs
forward propagation to compute the loss, performs backward propagation to compute the gradients and
perform weights update using the SGD optimizer. In local SGD, as described in Algorithm 5, each learner
samples a mini-batch of sizeb from the dataset and performs SGD using its local copy of weights. For every
H training iterations, the local weights are synchronized to the average of weights stored in all the workers.
Usually, the local batch sizeb=
B
/L, whereB is the size of the mini-batch in standard SGD andL is the
number of local learners.
Applicability
According to Algorithm 5, local SGD can only be applied when the target valuey used in the loss function
f is stationary. However, when training the Q network using Equation 1.5, the target value is determined
70
Algorithm6 SPO Actor
1: Input: environment instanceenv, local neural network weightsθ , local buffer lb, global buffer gb with
N banksgb
1
,...,gb
N
, initial bank indexi = rand()%N, weight servers. Weight synchronization
frequencyT
sync
.
2: θ ← s.getWeights() {Initialized weights}
3: o=env.reset(); {Reset local environment}
4: fortin0,...,T − 1do
5: a=π θ (o) {Select action}
6: o
′
,r,d=env.step(a) {Execute action}
7: lb.add(o,a,o
′
,r,d) {Add data to local buffer}
8: if lb.full()then
9: d=lb.get() {Remove data from local buffer}
10: p=computePriority(d)
11: gb
i
.add(d,p) {Add data to global buffer}
12: i=(i+1)%N {Choose next bank}
13: endif
14: if t modT
sync
==0then
15: θ ← s.getWeights()
16: endif
17: endfor
by the target Q network, and the weights of the target Q network is synchronized to the Q network after
everyF steps. Thus, we need to ensure that the target Q network is fixed when each local learner performs
local SGD. And we only update the weights of the target Q network after each weight averaging.
5.2.2 OverallDiagram
We show an overview of the system architecture of Scalable Policy Optimization in Figure 5.8. It consists
of parallel actors collecting data concurrently, parallel learners performing local SGD concurrently, a central
Prioritized Replay Buffer with multiple banks and a central weight server. The actors in SPO are almost
identical to ApeX except that the actors update their weights from a central weight server instead of the
learner. A sketch of SPO actor is shown in Algorithm 6.
71
Algorithm7 SPO-DQN Learner
1: Input: global buffer gb with banki, neural network weightsθ , target network weights
ˆ
θ , weight server
s, weight averaging intervalH, target update frequencyF , loss functionL
Q
, update functionU.
2: θ ← s.getWeights() {Initialized weights}
3: forein0,...,E− 1do
4: fortin0,...,H− 1do
5: idx,B =gb
i
.sample() {Sample batch}
6: l
t
,p
t
=L
Q
(B,θ t
,
ˆ
θ t
) {Compute loss, priority}
7: θ t+1
=U(l
t
,θ t
) {Update weights}
8: gb
i
.updatePriority(idx,p
t
) {Update priority}
9: endfor
10: s.send(θ t
) {Send current weights to server}
11: θ t
=s.recv() {Receive averaged weights}
12: if e modF ==0then
13:
ˆ
θ ← θ {Update target network}
14: endif
15: endfor
5.2.2.1 BankedPrioritizedReplayBuffer
In order to alleviate the issues caused by centralized sampling and priority update, we propose to split
the Prioritized Replay Buffer into L banks, whereL is the number of local learners. Each learner only
samples from its designated bank as shown in Figure 5.8. This enables parallel sampling and priority update.
However, each learner samples from a different probability distribution in banked Prioritized Replay Buffer
compared to the Prioritized Replay Buffer. To compensate the difference, an importance weight needs to be
multiplied after the loss is computed. The importance weight of thei-th data point sampled inj-th bank
can be computed as:
w
i,j
=
p
j
(i)
P
i,j
p
j
(i)
· P
i
p
j
(i)
p
j
(i)
≈ 1
N
(5.4)
whereN is the total number of data points in all the banks andp
j
(i) is the priority of thei-th data point in
j-th bank. Note that the approximation is based on the assumption that the sum of all the priorities in each
bank is identical.
72
Algorithm8 SPO-TD3 Learner
1: Input: global buffer gb with bank i, Q network weights θ Q
, policy network weights θ P
, target Q
network weights
ˆ
θ Q
, target policy network weights
ˆ
θ P
, weight servers. weight averaging intervalH.
target update frequencyF , Q loss functionL
Q
, policy loss functionL
P
, Q update functionU
Q
, policy
update functionU
P
.
2: θ Q
,θ P
← s.getWeights() {Initialized weights}
3: forein0,...,E− 1do
4: data=[]
5: fortin0,...,H− 1do
6: idx,B =gb
i
.sample() {Sample batch}
7: data.append(B)
8: l
t
,p
t
=L
Q
(B,θ t
,
ˆ
θ t
) {Compute loss, priority}
9: θ t+1,Q
=U
Q
(l
t
,θ t,Q
) {Update weights}
10: gb
i
.updatePriority(idx,p
t
) {Update priority}
11: endfor
12: s.send(θ t,Q
) {Send current weights to server}
13: θ t,Q
=s.recv() {Receive averaged weights}
14: fordindatado
15: l
t
=L
P
(d,θ t,P
,θ t,Q
) {Compute policy loss}
16: θ t+1,P
=U
Q
(l
t
,θ t,P
) {Update local policy}
17: endfor
18: s.send(θ t,P
) {Send current weights to server}
19: θ t,P
=s.recv() {Receive averaged weights}
20: if e modF ==0then
21:
ˆ
θ Q
← θ Q
{Update target Q network}
22:
ˆ
θ P
← θ P
{Update target policy network}
23: endif
24: endfor
5.2.2.2 SPO-TD3
In this section, we discuss the implementation of SPO-TD3. A sketch of SPO-TD3 learner is shown in
Algorithm 8.
OptimizingtheQnetwork Directly applying local SGD using the Q loss function in Equation 1.5 causes
divergence. This is because the target value of the objective in Equation 1.5 changes after every weight
update. To tackle this issue, the weights of the target Q network needs to be fixed before each weight
averaging. After each weight averaging, the weights of the target Q network is copied from the updated Q
network.
73
Optimizingthepolicynetwork Standard TD3 [21] optimizes the policy network using Equation 1.7
after each weight update of the Q network. However, directly optimizing the local policy network using the
local Q network and averaging the weights of the local policy network causes divergence. This is because
the objective of each local policy network is different. In SPO-TD3, we split optimizing the Q network and
the policy network into two phases. In the first phase, each learner only updates the local Q network and
performs Q network weight averaging afterH steps. In the second phase, each learner only updates the
local policy network using the Q network after weight averaging and performs policy network weight
averaging afterH steps. This guarantees that the local models always optimize a stationary objective that
leads to convergence.
5.2.2.3 SPO-DQN
We build SPO on top of ApeX-DQN to solve tasks with discrete action space. We use double Q learning [27]
and dueling architecture [79] and the same neural network architecture in [54]. Note that using double Q
learning causes the target value to fluctuate during training of each local model and violate the assumption
of local SGD. To fix it, we use the last averaged Q network to select target actions. A sketch of SPO-DQN
learner is shown in Algorithm 7.
5.2.3 AnalyticalPerformanceAnalysis
By using banked Prioritized Replay Buffer, we reduce the time to perform sampling from T
B
sample
to
T
b
sample
=
T
B
sample/L, the time to perform priority update fromT
B
priority
toT
b
priority
=
T
B
priority/L, whereL is
the number of local learners. Meanwhile, it avoids transferring sub-batch data from the centralized learner
to the local learners. By applying local SGD [71], we reduce the time to perform weight transfer, weight
74
averaging per policy update fromT
Gradient
Trans
andT
L
Average
to
T
Gradient
Trans /H and
T
L
Average/H. Thus, the total time
to perform a policy update in SPO is:
T
SPO
total
=T
b
sample
++T
b
for
+T
b
back
+T
Update
+T
b
Priority
+
1
H
(T
Weight
Trans
+T
L
Average
+T
Weight
Trans
) (5.5)
Note that the dimension of the weight is always equal to its gradient. Thus,T
Weight
Trans
=T
Gradient
Trans
. And the
scaling factor of SPO is:
SF
SPO
=
T
ApeX
total
T
SPO
total
=
1
1+
(2T
Weight
Trans
+T
L
Average
)L
/(T
ApeX
total
·H)
(5.6)
As we observe from Equation 5.6, the scaling factor of SPO increases as the weight averaging frequencyH
increases. However, the accuracy of the Q values decrease as the weight averaging frequencyH increases.
Thus, we have to carefully choose the weight averaging frequencyH such that the policy quality degradation
is negligible while the scaling factor is maximized.
5.2.4 ApplicabilityoflocalSGD
The approach proposed in this work is only applicable when the batch size is large enough such that the
execution time to perform forward and backward propagation is proportional to the batch size. Otherwise,
the computation efficiency of multi-learner data parallel ApeX may be even worse than the computation
efficiency of single-learner ApeX.
5.2.5 ComparisonswithOtherParallelStochasticGradientDescentMethods
Besides synchronous SGD and local SGD, asynchronous SGD [4] is another major approach to accelerate
Deep Reinforcement Learning. The major advantage of synchronous SGD is its unmodified algorithmic
75
structure as in single process SGD. The major disadvantage of synchronous SGD is the synchronization
overhead that causes low scaling factor as discussed in Section 5.1.4. The major advantage of local SGD is
the reduced synchronization overhead compared with synchronous SGD. However, local SGD only works
for non-convex functions such as neural networks when the data distribution used to train each local learner
is similar [71]. The major advantage of asynchronous SGD is its linear scalability as no synchronization
is required between local learners. However, the “stale” gradient problem causes convergence issues as
shown in existing DRL framework [56].
5.3 Experiments
We conduct experiments using simulated benchmark environments to answer the following questions:
• How does SPO compare with ApeX in terms of learning efficiency?
• How does SPO compare with multi-learner data parallel ApeX in terms of scaling factor?
• How does the weight averaging frequency H and the number of learners L impact the learning
efficiency and the scaling factor of SPO?
5.3.1 ExperimentalSetup
5.3.1.1 HardwarePlatform
We conduct experiments on a AMD EPYC 7763 CPU with 256 cores @ 1.5GHz. It also contains 4 Nvidia
RTX A5000 GPUs and 2 Nvidia RTX A4000 GPUs. We map each local learner onto 1 GPU and randomly
select a GPU for the central learner.
76
Humanoid Pen Hammer Door
(a) Mujoco [75] tasks with high dimensional state and action space including Humanoid-v3, Pen-v0, Hammer-v0 and
Door-v0.
Pong Seaquest MsPacman
(b) Mujoco [75] tasks with high dimensional state and action space including Humanoid-v3, Pen-v0, Hammer-v0 and
Door-v0.
Figure 5.9: Benchmark Environments
5.3.1.2 Software
We implement Scalable Policy Optimization using Ray framework [46]. We use Pytorch [58] to implement
neural network training.
5.3.1.3 BenchmarkEnvironments
We conduct experiments using 3 environments from Arcade Learning Environment (Atari) [5] for SPO-DQN
and 4 environments from Mujoco [75] for SPO-TD3.
MujocoEnvironments We evaluate SPO-TD3 using Mujoco [75] tasks with high dimensional state and
action space including Humanoid-v3 (|S|=376,|A|=17) and three Adroit tasks [60] involving controlling
a 24-DoF simulated Shadow Hand robot tasked with hammering a nail (Hammer-v0,|S|=46,|A|=26),
77
opening a door (Door-v0,|S|=39,|A|=28) and twirling a pen (Pen-v0,|S|=45,|A|=24). A screenshot
is taken for each task as shown in Figure 5.9a. For each Mujoco task, we train our agent for 3 hours.
ArcadeLearningEnvironment We evaluate SPO-DQN on three Arcade Learning Environments [5]
including Pong, Seaquest and MsPacman. The version is NoFrameskip-v4. We apply the same preprocessing
steps as in [54]. A screenshot is taken for each task as shown in Figure 5.9b. For each Atari task, we train
our agent for 5 hours.
5.3.2 Evaluation
5.3.2.1 LearningEfficiency
Impactofnumberoflocallearners We show the learning efficiency of SPO-DQN on Atari bench-
marks [5] of various number of learners in Figure 5.10 and the learning efficiency of SPO-TD3 on Mujoco
benchmarks [75] of various number of learners in Figure 5.11. First, we observe that increasing the number
of learners improves the computation efficiency as shown by the total number of policy updates performed.
Second, we observe that the learning efficiency is almost identical for most tasks. However, we observe the
learning efficiency decreases when the number of learners increases in Seaquest, MsPacman and Hammer
task. We hypothesize that this is caused because the distance between the weights after weight averaging
and the local optimum becomes larger as the number of local learners increases.
Impactofweightaveragingfrequency We show the learning efficiency of SPO-DQN on Atari bench-
marks [5] of various weight averaging frequencies in Figure 5.12 and the learning efficiency of SPO-TD3 on
Mujoco benchmarks [75] of various weight averaging frequencies in Figure 5.13. First, we observe that
increasing the number of learners improves the computation efficiency as shown by the total number of
policy updates performed. Second, we observe that increasing the weight averaging frequency typically
lowers the learning efficiency due to larger discrepancy between the Q values after weight averaging and
78
0.0 0.2 0.4 0.6 0.8 1.0
PolicyUpdates 1e7
0
500
1000
1500
2000
2500
3000
3500
AverageTestEpRet
SeaquestNoFrameskip-v4
0 2 4 6
PolicyUpdates 1e6
−20
−10
0
10
20
AverageTestEpRet
PongNoFrameskip-v4
0.0 0.2 0.4 0.6 0.8 1.0
PolicyUpdates 1e7
1000
2000
3000
4000
AverageTestEpRet
MsPacmanNoFrameskip-v4
learners=1 learners=2 learners=4 learners=6
Figure 5.10: Learning efficiency of SPO-DQN on Atari benchmarks [5] of various number of learners. The
number of actors is 16 and the weight averaging frequency is 10. Each data point in the curve represents
the average episode returns at test time of 20 episodes. Each curve shows the result of a single random seed.
0 2 4 6
PolicyUpdates
1e6
0
2000
4000
6000
8000
AverageTestEpRet
Humanoid-v3
0 2 4 6
PolicyUpdates
1e6
0
1000
2000
3000
AverageTestEpRet
Door-v0
0 2 4 6
PolicyUpdates
1e6
0
5000
10000
15000
20000
AverageTestEpRet
Hammer-v0
0 2 4 6
PolicyUpdates
1e6
1000
2000
3000
4000
5000
AverageTestEpRet
Pen-v0
learners=1 learners=2 learners=4 learners=6
Figure 5.11: Learning efficiency of SPO-TD3 on Mujoco benchmarks [75] of various number of learners. The
number of actors is 16 and the weight averaging frequency is 10. Each data point in the curve represents
the average episode returns at test time of 30 episodes. Each curve shows the average of 5 independent
runs with different random seeds. The shaded area shows the standard deviation.
79
0.0 0.2 0.4 0.6 0.8 1.0
PolicyUpdates 1e7
−20
−10
0
10
20
AverageTestEpRet
PongNoFrameskip-v4
0.0 0.5 1.0 1.5
PolicyUpdates 1e7
0
500
1000
1500
2000
2500
3000
3500
AverageTestEpRet
SeaquestNoFrameskip-v4
0.0 0.5 1.0 1.5
PolicyUpdates 1e7
1000
2000
3000
4000
AverageTestEpRet
MsPacmanNoFrameskip-v4
H=5 H=10 H=25
Figure 5.12: Learning efficiency of SPO-DQN on Atari benchmarks of various weight averaging frequencies.
The number of actors is 16 and the number of learners is 4. Each data point in the curve represents the
average episode returns at test time of 20 episodes. Each curve shows the result of a single random seed.
the true Q values. However, increasing the weight averaging frequency improves the computation efficiency.
For most of the environments, the time used to converge decreases.
5.3.2.2 ScalingFactor
We show the scaling factor of SPO-TD3 of various number of learners and weight averaging frequencies in
Figure 5.14 and scaling factor of SPO-DQN of various number of learners and weight averaging frequencies
in Figure 5.15. We observe that as the number of local learners increases, the scaling factor decreases due to
increased synchronization overhead of weight averaging. As the the weight averaging frequency increases,
the scaling factor improves due to reduces synchronization overhead per policy update. When the number
of operations in training increases (e.g., from multi-layer perceptron to Convolution Neural Network), the
scaling factor increases. Thus, Scalable Policy Optimization is mostly applicable for environments with
large neural network training workload including more operations and larger batch sizes. Note that when
the scaling factor is close to 1 when the batch size is 8192 and the neural network is a multi-layer perceptron
and when the batch size is 4096 and the neural network is a CNN.
80
0 2 4 6
PolicyUpdates
1e6
0
2000
4000
6000
8000
AverageTestEpRet
Humanoid-v3
0 2 4 6
PolicyUpdates
1e6
0
1000
2000
3000
AverageTestEpRet
Door-v0
0 2 4 6
PolicyUpdates
1e6
0
5000
10000
15000
AverageTestEpRet
Hammer-v0
0 2 4 6
PolicyUpdates
1e6
1000
2000
3000
4000
5000
AverageTestEpRet
Pen-v0
H=5 H=10 H=25
Figure 5.13: Learning efficiency of SPO-TD3 on Mujoco benchmarks of various weight averaging frequencies.
The number of actors is 16 and the number of learners is 4. Each data point in the curve represents the
average episode returns at test time of 30 episodes. Each curve shows the average of 5 independent runs
with different random seeds. The shaded area shows the standard deviation.
2 4 6
Number of learners
0.0
0.2
0.4
0.6
0.8
1.0
Scaling factor
Batch=2048, NN=MLP
2 4 6
Number of learners
0.0
0.2
0.4
0.6
0.8
1.0
Scaling factor
Batch=4096, NN=MLP
2 4 6
Number of learners
0.0
0.2
0.4
0.6
0.8
1.0
Scaling factor
Batch=8192, NN=MLP
5 10 25
Figure 5.14: Scaling factor of SPO-TD3 of various number of learners and weight averaging frequencies.
81
2 4 6
Number of learners
0.0
0.2
0.4
0.6
0.8
1.0
Scaling factor
Batch=256, NN=CNN
2 4 6
Number of learners
0.0
0.2
0.4
0.6
0.8
1.0
Scaling factor
Batch=1024, NN=CNN
2 4 6
Number of learners
0.0
0.2
0.4
0.6
0.8
1.0
Scaling factor
Batch=4096, NN=CNN
5 10 25
Figure 5.15: Scaling factor of SPO-DQN of various number of learners and weight averaging frequencies.
82
Chapter6
ConclusionandFutureWork
In this chapter, we conclude the dissertation by discussing the broader impact and future directions.
6.1 BroaderImpact
The rapid advances of Deep Reinforcement Learning algorithms including offline learning [43] and Safe
Reinforcement Learning [22] makes DRL a natural solution for many real-world problems. For example,
most state-of-the-art AI game solutions are based on DRL [6, 69]. Also, Deep Reinforcement Learning
solutions are emerging in recommendation systems [1]. Unlike supervised learning, where the data is
pre-collected, DRL learns a policy and collects data at the same time. The amount of data to collect depends
on the state space of the environment and how well the agent explores. It indicates that training DRL in
general requires a large number of policy updates and takes a very long time. This makes acceleration of
DRL an interesting and important research direction.
83
6.2 FutureDirections
6.2.1 FPGA-acceleratedEnvironmentEmulation
Environment emulation can be the bottleneck when it requires to execute computationally expensive
operations on CPU. Also, transferring data between the CPU and the accelerator used for training can be
expensive in distributed settings. GPU-based environment emulation has recently been successfully applied
in robotics [47]. GPU-based environment emulation is typically suitable for massive parallel computation
workload. However, majorities of the environment emulation perform computation in serial. This makes
GPU-based emulation less suitable. On the contrary, FPGA is suitable for the serial executed workload due
to the flexible hardware pipelining provided by the FPGA. Thus, FPGA-accelerated environment emulation
is a promising future direction.
6.2.2 AutomaticDecompositionofPrimitives
In Chapter 4, we manually decompose the computation into 3 primitives including neural network training,
sampling and priority update. The granularity of the primitive decomposition is the existing computation
unit such as Replay Management Module and learner. The current mapping methodology provides optimal
computation efficiency of these primitives. However, the performance can be further improved if the
granularity of the primitive decomposition is finer. For example, when the learner is training Convolutional
Neural Network, the current mapping methodology outputs a design where the utilization of the GPU is
100% and the utilization of the CPU is less than 10%. In this case, the neural network training time scales in
linear as the batch size increases. If we decompose the Convolutional Neural Network into convolutional
layers and fully-connected layers and map the convolutional layers onto GPU and the fully-connected
layers onto CPU, the training time can be further decreased and the utilization of both the CPU and the
GPU can reach 100%.
As a future work, the current methodology steps can be improved as follows:
84
1. Primitivelibrary: build a library of primitives, where each primitive only consists of basic operations
that can’t be further decomposed.
2. Primitiveaccelerator: design an accelerator for each primitive in the library
3. Primitivemapping: given a computational graph, map each primitive onto an accelerator such that
the overall performance is maximized.
85
Bibliography
[1] Mohammad Mehdi Afsar, Trafford Crump, and Behrouz H. Far. “Reinforcement learning based
recommender systems: A survey”. In: CoRR abs/2101.06286 (2021). arXiv: 2101.06286.url:
https://arxiv.org/abs/2101.06286.
[2] Abien Fred Agarap. “Deep Learning using Rectified Linear Units (ReLU)”. In: ArXiv abs/1803.08375
(2018).
[3] S. Ahmad and M. O. Tokhi. “Linear Quadratic Regulator (LQR) approach for lifting and stabilizing of
two-wheeled wheelchair”. In: 2011 4th International Conference on Mechatronics (ICOM). May 2011,
pp. 1–6.doi: 10.1109/ICOM.2011.5937119.
[4] Karl Bäckström, Marina Papatriantafilou, and Philippas Tsigas. “MindTheStep-AsyncPSGD: Adaptive
Asynchronous Parallel Stochastic Gradient Descent”. In: CoRR abs/1911.03444 (2019). arXiv:
1911.03444.url: http://arxiv.org/abs/1911.03444.
[5] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. “The Arcade Learning
Environment: An Evaluation Platform for General Agents”. In: CoRR abs/1207.4708 (2012). arXiv:
1207.4708.url: http://arxiv.org/abs/1207.4708.
[6] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak,
Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Christopher Hesse, Rafal Józefowicz,
Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto,
Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever,
Jie Tang, Filip Wolski, and Susan Zhang. “Dota 2 with Large Scale Deep Reinforcement Learning”. In:
CoRR abs/1912.06680 (2019). arXiv: 1912.06680.url: http://arxiv.org/abs/1912.06680.
[7] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Two Volume Set. 2nd. Athena
Scientific, 2001. isbn: 1886529086.
[8] JOHN T. BETTS and WILLIAM P. HUFFMAN. “Path-constrained trajectory optimization using
sparse sequential quadratic programming”. In: Journal of Guidance, Control, and Dynamics 16.1
(1993), pp. 59–68.doi: 10.2514/3.11428. eprint: https://doi.org/10.2514/3.11428.
[9] Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y. Rubinstein. “A tutorial on the
cross-entropy method”. In: ANNALS OF OPERATIONS RESEARCH 134 (2004).
86
[10] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. OpenAI Gym. 2016. eprint: arXiv:1606.01540.
[11] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. “Sample-Efficient
Reinforcement Learning with Stochastic Ensemble Value Expansion”. In: CoRR abs/1807.01675 (2018).
arXiv: 1807.01675.url: http://arxiv.org/abs/1807.01675.
[12] Hyungmin Cho, Pyeongseok Oh, Jiyoung Park, Wookeun Jung, and Jaejin Lee. “FA3C:
FPGA-Accelerated Deep Reinforcement Learning”. In: Proceedings of the Twenty-Fourth International
Conference on Architectural Support for Programming Languages and Operating Systems. ACM. 2019,
pp. 499–513.
[13] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. “Deep Reinforcement
Learning in a Handful of Trials using Probabilistic Dynamics Models”. In: CoRR abs/1805.12114
(2018). arXiv: 1805.12114.url: http://arxiv.org/abs/1805.12114.
[14] Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters”. In:
Commun. ACM 51.1 (Jan. 2008), pp. 107–113.issn: 0001-0782.doi: 10.1145/1327452.1327492.
[15] Marc Peter Deisenroth and Carl Edward Rasmussen. “PILCO: A Model-Based and Data-Efficient
Approach to Policy Search”. In: Proceedings of the 28th International Conference on International
Conference on Machine Learning. ICML’11. Bellevue, Washington, USA: Omnipress, 2011,
pp. 465–472.isbn: 9781450306195.
[16] C. Ekaputri and A. Syaichu-Rohman. “Implementation model predictive control (MPC) algorithm-3
for inverted pendulum”. In: 2012 IEEE Control and System Graduate Research Colloquium. July 2012,
pp. 116–122.doi: 10.1109/ICSGRC.2012.6287146.
[17] Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. “SEED RL:
Scalable and Efficient Deep-RL with Accelerated Central Inference”. In: CoRR abs/1910.06591 (2019).
arXiv: 1910.06591.url: http://arxiv.org/abs/1910.06591.
[18] Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward,
Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu.
“IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures”. In:
CoRR abs/1802.01561 (2018). arXiv: 1802.01561.url: http://arxiv.org/abs/1802.01561.
[19] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, and Sergey Levine.
“Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning”. In: CoRR
abs/1803.00101 (2018). arXiv: 1803.00101.url: http://arxiv.org/abs/1803.00101.
[20] Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. “Transformations of
high-level synthesis codes for high-performance computing”. In: IEEE Transactions on Parallel and
Distributed Systems 32.5 (2020), pp. 1014–1029.
[21] Scott Fujimoto, H. V. Hoof, and David Meger. “Addressing Function Approximation Error in
Actor-Critic Methods”. In: ArXiv abs/1802.09477 (2018).
87
[22] Javier Garcia and Fernando Fernández. “A Comprehensive Survey on Safe Reinforcement Learning”.
In: J. Mach. Learn. Res. 16.1 (Jan. 2015), pp. 1437–1480.issn: 1532-4435.
[23] Ce Guo, Wayne Luk, Stanley Qing Shui Loh, Alexander Warren, and Joshua Levine. “Customisable
Control Policy Learning for Robotics”. In: 2019 IEEE 30th International Conference on
Application-specific Systems, Architectures and Processors (ASAP) . Vol. 2160. IEEE. 2019, pp. 91–98.
[24] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic: Off-Policy
Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. In: CoRR abs/1801.01290
(2018). arXiv: 1801.01290.url: http://arxiv.org/abs/1801.01290.
[25] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan,
Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic
Algorithms and Applications”. In: CoRR abs/1812.05905 (2018). arXiv: 1812.05905.url:
http://arxiv.org/abs/1812.05905.
[26] Marshall Hargrave. Deep Learning. Apr. 2019.url:
https://www.investopedia.com/terms/d/deep-learning.asp.
[27] Hado V. Hasselt. “Double Q-learning”. In: Advances in Neural Information Processing Systems 23.
Ed. by J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta. Curran Associates,
Inc., 2010, pp. 2613–2621.url: http://papers.nips.cc/paper/3964-double-q-learning.pdf.
[28] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and
David Silver. “Distributed Prioritized Experience Replay”. In: CoRR abs/1803.00933 (2018). arXiv:
1803.00933.url: http://arxiv.org/abs/1803.00933.
[29] Intel Stratix 10 MX FPGAs.url:
https://www.intel.com/content/www/us/en/products/programmable/sip/stratix-10-mx.html.
[30] Intel. GPU Memory Latency’s Impact, and Updated Test. 2021.url:
https://chipsandcheese.com/2021/05/13/gpu-memory-latencys-impact-and-updated-test/.
[31] Intel. SkyLake Specification . 2018.url: https://www.7-cpu.com/cpu/Skylake.html.
[32] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. “When to Trust Your Model:
Model-Based Policy Optimization”. In: CoRR abs/1906.08253 (2019). arXiv: 1906.08253.url:
http://arxiv.org/abs/1906.08253.
88
[33] Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa,
Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao,
Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,
Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, Richard C. Ho,
Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,
Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon,
James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean,
Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni,
Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross,
Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter,
Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,
Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. “In-Datacenter
Performance Analysis of a Tensor Processing Unit”. In: CoRR abs/1704.04760 (2017). arXiv:
1704.04760.url: http://arxiv.org/abs/1704.04760.
[34] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell,
Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi,
George Tucker, and Henryk Michalewski. “Model-Based Reinforcement Learning for Atari”. In:
CoRR abs/1903.00374 (2019). arXiv: 1903.00374.url: http://arxiv.org/abs/1903.00374.
[35] Sham Kakade. “A Natural Policy Gradient”. In: Proceedings of the 14th International Conference on
Neural Information Processing Systems: Natural and Synthetic. NIPS’01. Vancouver, British Columbia,
Canada: MIT Press, 2001, pp. 1531–1538.
[36] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. “Recurrent
Experience Replay in Distributed Reinforcement Learning”. In: May 2019.
[37] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and
Ping Tak Peter Tang. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
Minima”. In: CoRR abs/1609.04836 (2016). arXiv: 1609.04836.url: http://arxiv.org/abs/1609.04836.
[38] Patrice Simard Kumar Chellapilla Sidd Puri. “High Performance Convolutional Neural Networks for
Document Processing”. In: Tenth InternationalWorkshop on Frontiersin Handwriting Recognition.url:
https://hal.inria.fr/file/index/docid/112631/filename/p1038112283956.pdf.
[39] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. “Model-Ensemble
Trust-Region Policy Optimization”. In: CoRR abs/1802.10592 (2018). arXiv: 1802.10592.url:
http://arxiv.org/abs/1802.10592.
[40] Large FPGA methodology guide.url:
https://www.xilinx.com/support/documentation/sw_manuals/xilinx14_7/ug872_largefpga.pdf.
[41] Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. “Data
center cooling using model-predictive control”. In: Advances in Neural Information Processing
Systems 31. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett.
Curran Associates, Inc., 2018, pp. 3814–3823.url:
http://papers.nips.cc/paper/7638-data-center-cooling-using-model-predictive-control.pdf.
89
[42] Sergey Levine and Vladlen Koltun. “Guided Policy Search”. In: Proceedings of the 30th International
Conference on Machine Learning. Ed. by Sanjoy Dasgupta and David McAllester. Vol. 28. Proceedings
of Machine Learning Research 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1–9.url:
http://proceedings.mlr.press/v28/levine13.html.
[43] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. “Offline Reinforcement Learning:
Tutorial, Review, and Perspectives on Open Problems”. In: ArXiv abs/2005.01643 (2020).
[44] Weiwei Li and Emanuel Todorov. “Iterative Linear Quadratic Regulator Design for Nonlinear
Biological Movement Systems.” In: vol. 1. Jan. 2004, pp. 222–229.
[45] Yuxi Li and Dale Schuurmans. “MapReduce for Parallel Reinforcement Learning”. In: Recent
Advances in Reinforcement Learning. Ed. by Scott Sanner and Marcus Hutter. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2012, pp. 309–320.isbn: 978-3-642-29946-9.
[46] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg,
and Ion Stoica. “Ray RLLib: A Composable and Scalable Reinforcement Learning Library”. In: CoRR
abs/1712.09381 (2017). arXiv: 1712.09381.url: http://arxiv.org/abs/1712.09381.
[47] Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and
Dieter Fox. “GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning”. In: CoRR
abs/1810.05762 (2018). arXiv: 1810.05762.url: http://arxiv.org/abs/1810.05762.
[48] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez,
Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement
learning”. In: CoRR abs/1509.02971 (2016).
[49] Long-Ji Lin. “Reinforcement Learning for Robots Using Neural Networks”. PhD thesis. USA, 1992.
[50] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. “Federated Learning
of Deep Networks using Model Averaging”. In: CoRR abs/1602.05629 (2016). arXiv: 1602.05629.url:
http://arxiv.org/abs/1602.05629.
[51] Yuan Meng, Sanmukh Kuppannagari, and Viktor Prasanna. “Accelerating proximal policy
optimization on cpu-fpga heterogeneous platforms”. In: 2020 IEEE 28th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2020, pp. 19–27.
[52] Nick Metropolis. “THE BEGINNING of the MONTE CARLO METHOD”. In.
[53] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,
Tim Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous Methods for Deep Reinforcement
Learning”. In: CoRR abs/1602.01783 (2016). arXiv: 1602.01783.url: http://arxiv.org/abs/1602.01783.
[54] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
Daan Wierstra, and Martin A. Riedmiller. “Playing Atari with Deep Reinforcement Learning”. In:
CoRR abs/1312.5602 (2013). arXiv: 1312.5602.url: http://arxiv.org/abs/1312.5602.
90
[55] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. “Neural Network
Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning”. In: CoRR
abs/1708.02596 (2017). arXiv: 1708.02596.url: http://arxiv.org/abs/1708.02596.
[56] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria,
Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg,
Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. “Massively Parallel Methods for Deep
Reinforcement Learning”. In: CoRR abs/1507.04296 (2015). arXiv: 1507.04296.url:
http://arxiv.org/abs/1507.04296.
[57] OpenMP Architecture Review Board. OpenMP Application Program Interface Version 3.0. May 2008.
url: http://www.openmp.org/mp-documents/spec30.pdf.
[58] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,
Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style, High-Performance Deep
Learning Library”. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Curran Associates, Inc., 2019,
pp. 8024–8035.url: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-
performance-deep-learning-library.pdf.
[59] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st. USA:
John Wiley & Sons, Inc., 1994.isbn: 0471619779.
[60] A. Rajeswaran, V. Kumar, Abhishek Gupta, John Schulman, E. Todorov, and S. Levine. “Learning
Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations”. In:
ArXiv abs/1709.10087 (2018).
[61] Anil Rao. “A Survey of Numerical Methods for Optimal Control”. In: Advances in the Astronautical
Sciences 135 (Jan. 2010).
[62] H. Robbins and S. Monro. “A stochastic approximation method”. In: Annals of Mathematical Statistics
22 (1951), pp. 400–407.
[63] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. cite
arxiv:1511.05952Comment: Published at ICLR 2016. 2015.url: http://arxiv.org/abs/1511.05952.
[64] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. “Trust Region
Policy Optimization”. In: CoRR abs/1502.05477 (2015). arXiv: 1502.05477.url:
http://arxiv.org/abs/1502.05477.
[65] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy
Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347.url:
http://arxiv.org/abs/1707.06347.
[66] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to
Algorithms. USA: Cambridge University Press, 2014.isbn: 1107057132.
91
[67] Shengjia Shao and Wayne Luk. “Customised pearlmutter propagation: A hardware architecture for
trust region policy optimisation”. In: 2017 27th International Conference on Field Programmable Logic
and Applications (FPL). IEEE. 2017, pp. 1–6.
[68] Shengjia Shao, Jason Tsai, Michal Mysior, Wayne Luk, Thomas Chau, Alexander Warren, and
Ben Jeppesen. “Towards hardware accelerated reinforcement learning for application-specific
robotic control”. In: 2018 IEEE 29th International Conference on Application-specific Systems,
Architectures and Processors (ASAP). IEEE. 2018, pp. 1–8.
[69] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre,
George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever,
Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis.
“Mastering the game of Go with deep neural networks and tree search”. In: Nature 529 (2016),
pp. 484–503.url: http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html.
[70] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,
Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. “Mastering the game
of Go without human knowledge”. In: Nature 550.7676 (2017), pp. 354–359. doi: 10.1038/nature24270.
[71] Sebastian U. Stich. Local SGD Converges Fast and Communicates Little. 2019. arXiv: 1805.09767
[math.OC].
[72] Adam Stooke and Pieter Abbeel. “Accelerated Methods for Deep Reinforcement Learning”. In: (Mar.
2018).
[73] Richard S. Sutton. “Dyna, an Integrated Architecture for Learning, Planning, and Reacting”. In:
SIGART Bull. 2.4 (July 1991), pp. 160–163.issn: 0163-5719.doi: 10.1145/122344.122377.
[74] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Cambridge, MA,
USA: A Bradford Book, 2018.isbn: 0262039249.
[75] E. Todorov, T. Erez, and Y. Tassa. “MuJoCo: A physics engine for model-based control”. In: 2012
IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, pp. 5026–5033.
[76] Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel,
Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. “dm_control: Software and tasks for
continuous control”. In: Software Impacts 6 (2020), p. 100022.issn: 2665-9638.doi:
https://doi.org/10.1016/j.simpa.2020.100022.
[77] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg,
Wojtek Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds,
Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard,
David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden,
Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Dani Yogatama, Julia Cohen,
Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu,
Demis Hassabis, and David Silver. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II.
https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/. 2019.
92
[78] Tingwu Wang and Jimmy Ba. “Exploring Model-based Planning with Policy Networks”. In: CoRR
abs/1906.08649 (2019). arXiv: 1906.08649.url: http://arxiv.org/abs/1906.08649.
[79] Ziyu Wang, Nando de Freitas, and Marc Lanctot. “Dueling Network Architectures for Deep
Reinforcement Learning”. In: CoRR abs/1511.06581 (2015). arXiv: 1511.06581.url:
http://arxiv.org/abs/1511.06581.
[80] Grady Williams, Andrew Aldrich, and Evangelos A. Theodorou. “Model Predictive Path Integral
Control using Covariance Variable Importance Sampling”. In: CoRR abs/1509.01149 (2015). arXiv:
1509.01149.url: http://arxiv.org/abs/1509.01149.
[81] Ronald J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement
learning”. In: Machine Learning 8.3 (1992), pp. 229–256.doi: 10.1007/BF00992696.
[82] Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. “Algorithmic Framework
for Model-based Reinforcement Learning with Theoretical Guarantees”. In: CoRR abs/1807.03858
(2018). arXiv: 1807.03858.url: http://arxiv.org/abs/1807.03858.
[83] Chi Zhang, Sanmukh R. Kuppannagari, Rajgopal Kannan, and Viktor K. Prasanna. “Building HVAC
Scheduling Using Reinforcement Learning via Neural Network Based Model Approximation”. In:
Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities,
and Transportation. BuildSys ’19. New York, NY, USA: Association for Computing Machinery, 2019,
pp. 287–296.isbn: 9781450370059.doi: 10.1145/3360322.3360861.
93
Abstract (if available)
Abstract
Despite the recent success of Deep Reinforcement Learning (DRL) in game playing, robotics manipulation and data center cooling, training DRL agents takes a tremendous amount of time and computation resources. This is because it requires collecting a large amount of data by interacting with the environment and a large amount of policy updates via Stochastic Gradient Descent (SGD) to converge.
To reduce the amount of data to collect, existing work adopts model-based DRL that learns a world model using the data collected by interacting with the environment. Then, it uses the world model to generate synthetic data to perform policy updates. State-of-the-art approaches generate synthetic data by uniformly sampling initial states. This generates a large amount of similar data and makes each policy update less efficient. To accelerate performing policy updates, state-of-the-art hardware mappings of DRL propose efficient customized hardware designs on FPGA. However, most of the work is only applicable for a specific range of input parameters. To further increase the speed to perform policy updates when the input batch size of the neural network is large, existing works split the input batch into multiple sub-batches and adopt multiple learners to process each sub-batch on a learner concurrently. However, the synchronization overhead including data transfer and gradient averaging significantly impairs the scalability of existing approaches.
In this work, we address these limitations by developing efficient algorithms and hardware mappings. First, we propose Maximum Entropy Model Rollouts (MEMR) that generates diverse synthetic data by prioritized sampling of the initial states such that the entropy of the generated synthetic data is maximized. We mathematically derived the maximum entropy sampling criteria assuming that the synthetic data distribution is Gaussian. To accomplish this criteria, we utilize a Prioritized Replay Buffer. Second, we propose a framework for mapping DRL algorithms with a Prioritized Replay Buffer onto heterogeneous platforms consisting of a multi-core CPU, a GPU and a FPGA.
We develop specific accelerators for each primitive on CPU, FPGA and GPU. Given a DRL algorithm input parameters, our design space exploration automatically chooses the optimal mapping of various primitives based on an analytical performance model. Finally, we propose Scalable Policy Optimization (SPO) that improves the scalability of existing multi-learner DRL by reducing the synchronization overhead via local Stochastic Gradient Descent. Our experimental evaluations on widely used benchmark environments suggest i) MEMR reduces the number of policy updates to converge compared with state-of-the-art model-based DRL; ii) our framework for hardware mapping achieves superior policy updates per second compared with other mapping methods; iii) SPO achieves nearly linear scalability as the number of learners increases.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Accelerating scientific computing applications with reconfigurable hardware
PDF
Learning distributed representations of cells in tables
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Emotional appraisal in deep reinforcement learning
PDF
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Biological geometry-aware deep learning frameworks for enhancing medical cyber-physical systems
Asset Metadata
Creator
Zhang, Chi
(author)
Core Title
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
10/18/2022
Defense Date
08/25/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
accelerators,CPU,deep reinforcement learning,FPGA,GPU,hardware mapping,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Bogdan, Paul (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
czhangseu@gmail.com,zhan527@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112124637
Unique identifier
UC112124637
Identifier
etd-ZhangChi-11272.pdf (filename)
Legacy Identifier
etd-ZhangChi-11272
Document Type
Dissertation
Format
theses (aat)
Rights
Zhang, Chi
Internet Media Type
application/pdf
Type
texts
Source
20221019-usctheses-batch-987
(),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
accelerators
CPU
deep reinforcement learning
FPGA
GPU
hardware mapping