Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-throughput methods for simulation and deep reinforcement learning
(USC Thesis Other)
High-throughput methods for simulation and deep reinforcement learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
High-Throughput Methods for Simulation and Deep Reinforcement Learning
by
Aleksei Petrenko
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Aleksei Petrenko
Acknowledgements
This thesis would not be possible without many people who helped me along the way. I would like to
express my deepest respect and appreciation to all those who have contributed to this work.
First and foremost, I would like to express my gratitude to my advisor, Gaurav Sukhatme, for their
guidance, support, and encouragement throughout the entire process. Thank you for creating this unique
atmosphere in the lab where people are free to experiment and work on problems they find the most
interesting. But most of all, I am grateful for your empathy and understanding that got me through difficult
times. I consider myself incredibly fortunate to have had you as my advisor.
I would also like to extend my heartfelt thank you to Vladlen Koltun, who was an advisor and mentor
during my internship at Intel Labs. Your guidance and encouragement were very empowering and helped
me believe in my ability to do world class work. Our first collaboration helped to lay the foundation of my
entire PhD and to this day the infrastructure we created powers many projects at USC and beyond.
My gratitude also goes to the members of my thesis committee, Professors Mike Zyda, Rahul Jain, Jesse
Thomason, and Stefanos Nikolaidis for their feedback, advice, and support. I am grateful for their expertise
and guidance, which were instrumental in shaping the direction of my research and the final form of this
thesis.
I am extremely grateful to have had exceptional collaborators who worked tirelessly on our research
projects, motivated me, and from whom I learned a lot. These people include Zhehui Huang, Sumeet
Batra, Artem Molchanov, Erik Wijmans, Brennan Shacklett, Shashank Hegde, Anssi Kanervisto, Ankur
ii
Handa, Arthur Allshire, Viktor Makoviychuk, Gavriel State, and many others. I would also like to ex-
plicitly mention many people who helped me with or contributed to my open-source projects: Edward
Beeching, Andrew Zhang, Ming Wang, Thomas Wolf, Erik Wijmans, Tushar Kumar, Costa Huang, Denys
Makoviichuk, Eugene Vinitsky, and others.
I am exceptionally grateful for all of the people who became my friends during this journey. I would
not be able to do it without you. My special thank you goes to David Millard, Gautam Salhotra, Karl
Pertsch, and Sumeet Batra, with whom I shared some of the best moments. I also want to thank Artem
Molchanov, who was there when I needed him most, and who guided me all this time, especially at the
beginning of the program. I am grateful to Jonathan Mitchell for his encouragement and humor, and for
helping me with the thesis and defense. Finally, I want to express my gratitude to Julia Abramova for her
infinite kindness, support, and understanding.
Last but not least, I want to say thank you to my family who believed in me all these years.
iii
TableofContents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: High-Throughput Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Sample Factory: Fast Asynchronous Reinforcement Learning . . . . . . . . . . . . . . . . . 5
2.1.1 Prior work in Accelerated Reinforcement Learning . . . . . . . . . . . . . . . . . . 5
2.1.2 Sample Factory: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Sample Factory: High-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Environment Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Communication Between System Components . . . . . . . . . . . . . . . . . . . . 11
2.1.6 Policy Lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.7 Multi-agent Learning and Self-play . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.8 Experiments: Computational Performance . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.9 DMLab-30 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.10 VizDoom Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.11 Self-play Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.12 Sample Factory: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Sample Factory 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 3: High-Throughput Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Megaverse: High-Throughput Simulation Using Large Batch Rendering . . . . . . . . . . . 27
3.1.1 Prior Work: Simulation for Reinforcement Learning . . . . . . . . . . . . . . . . . 30
3.1.2 Megaverse Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Discretized Continuous Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.4 Large Batch Simulation and Rasterization . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5 3D Geometry Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.6 Megaverse-8 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.7 Experiments: computational performance . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.8 Single-Agent Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.9 Multi-Agent Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iv
3.1.10 Megaverse: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Vectorized Physics Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 4: Simulation-Only Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Agents that Listen: High-Throughput Reinforcement Learning with Multiple Sensory
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.2 Related Work: Reinforcement Learning with Audio Inputs . . . . . . . . . . . . . . 50
4.1.3 ViZDoom Environment with Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.4 Audio Encoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.6 Environment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.7 Training Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.9 Training Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.10 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 DexPBT: Learning Dexterous Robotic Manipulation with Population-Based Training . . . 59
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.4 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.5 Dual-Arm Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.6 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.7 Population-Based Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5: Sim-to-Real Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Decentralized Control of Quadrotor Swarms with End-to-end Deep Reinforcement Learning 77
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 Coordinated Quadrotor Flight: Related Work . . . . . . . . . . . . . . . . . . . . . 78
5.1.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.4 Simulation and Sim-to-Real Considerations . . . . . . . . . . . . . . . . . . . . . . 81
5.1.5 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.6 Model architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1.7 Neighborhood Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.8 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.9 Model Architecture Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.10 Attention Weights Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.11 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.12 Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.13 Additional Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.14 Physical Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality . . . . . . . 94
5.2.1 Robotic Manipulation with Multi-Fingered Hands: Introduction . . . . . . . . . . . 95
5.2.2 Reorientation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.3 Reinforcement Learning in Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 98
v
5.2.4 Domain Randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.5 Experiments and Real-World Performance . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
vi
ListofTables
2.1 Peak throughput of various RL algorithms on System #2 in environment frames per
second and as percentage of the optimal frame rate. . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Pure sampling and training throughput with mainstream RL simulators vs. Megaverse.
The performance is reported in observations per second observed by the agent, i.e. after
frameskip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Influence of optimized geometry and batched rendering optimizations on the overall
sampling throughput. Performance measured on a 10-core 1xGTX1080Ti system in
Megaverse-8 “Collect” scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Results of 1v1 matches between our agent that has access to the sound and a vision-only
agent. “Sound (dis.)" is the main agent with sound inputs disabled during the evaluation. . 57
4.2 Actor/critic observations and their dimensionality. HereN
arm
= 1 for single-arm and
N
arm
= 2 for dual-arm tasks. N
kp
= 1 for regrasping and grasp-and-throw tasks (since
rotation tracking is not required), andN
kp
=4 for reorientation. . . . . . . . . . . . . . . . 69
4.3 RL hyperparameters and reward function coefficients. Rightmost column shows final
parameter values for a single PBT experiment, dual-arm reorientation (parameters not
optimized by PBT are omitted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Scaling up the attention policy with and without additional training. The size of the visible
local neighborhood is fixed at K =6 drones. The metrics are averaged over 20 episodes. . 88
5.2 Observations of the policy and value networks. The input vector is 50D in size for policy
and 265D for the value function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Reward terms are computed, multiplied by their weight, and summed to produce the
reward at each timestep. d represents the rotational distance from the object’s current to
the target orientation. p
object
and p
goal
are the position of the object and goal respectively.
a is the current action. targ
curr
and targ
prev
are the current and previous joint position
targets. v
joints
is the current joint velocity vector. . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Domain randomisation parameter ranges for policy learning . . . . . . . . . . . . . . . . . 100
vii
5.5 The results of running different models on the real robot. We run 10 trials per policy [97]
to benchmark the average consecutive successes. Individual rows within each experiment
indicate running the experiment on different days [42] and ± indicates 90% confidence
interval. Our best model was trained with ADR while non-ADR experiments had DR
ranges manually tuned. The second experiment shows results when the cube is held at a
goal for additional consecutive frames once the target cube pose is reached. . . . . . . . . 104
viii
ListofFigures
2.1 Overview of the Sample Factory architecture. N parallel rollout workers simulate k
environments each, collecting observations. These observations are processed by M
policy workers, which generate actions and new hidden states via an accelerated forward
pass on the GPU. Complete trajectories are sent from rollout workers to the learner. After
the learner completes the backpropagation step, the model parameters are updated in
shared CUDA memory and immediately fetched by the policy workers. . . . . . . . . . . . 7
2.2 a) Batched sampling enables forward pass acceleration on GPU, but rollout workers have
to wait for actions before the next environment step can be simulated, underutilizing the
CPU. b) Double-buffered sampling splits k environments on the rollout worker into two
groups, alternating between them during sampling, which practically eliminates idle time
on CPU workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Training throughput, measured in environment transitions generated per second
(considering 4-frameskip, a quarter of all transitions is observed and processed by the agent). 15
2.4 Direct comparison of wall-time performance. We show the mean and standard deviation
of four training runs for each experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Mean capped human-normalized training score [35] for a single-machine DMLab-30 PBT
run with Sample Factory. (Compared to cluster-scale IMPALA deployment.) . . . . . . . . 19
2.6 Training curves for standard VizDoom scenarios. We show the mean and standard
deviation for ten independent experiments conducted for each scenario. . . . . . . . . . . . 19
2.7 VizDoom Battle/Battle2 experiments. We show the mean and standard deviation for
four independent runs. Here as baselines we provide scores reported for Direct Future
Prediction (DFP) [32], and a version of DFP with additional input modalities such as depth
and segmentation masks, produced by a computer vision subsystem [148]. The latter
work only reports results for Battle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Populations of 8 agents trained inDeathmatch andDuel scenarios. On the y-axis we report
the average number of adversaries defeated in an 4-minute match. Shown are the means
and standard deviations within the population, as well as the performance of the best agent. 21
ix
2.9 Behavior of an agent trained via self-play. Left: Agents tend to choose the chaingun to
shoot at their opponents from longer distance. Right: Agent opening a secret door to get
a more powerful weapon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Megaverse parallel architecture in the context of a reinforcement learning system.
Synchronous or asynchronous training sequentially interacts withK ≥ 1 instances of
Megaverse, each hostingN ≥ 1 environments withM ≥ 1 agents. Megaverse parallelizes
physics computations on CPU, after which the entire vector of observations is rendered in
a single pass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 After procedurally generating the environment layout, we minimize the number of
primitives by coalescing adjacent voxels where possible. . . . . . . . . . . . . . . . . . . . 35
3.3 Overview of procedurally generated environments in Megaverse-8. At the beginning of
each episode, the visual appearance and layout of environments are sampled randomly.
See Sec. 3.1.6 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Single-agent performance. Megaverse-8 has a diversity of challenges, with envi-
ronments where model-free RL is able to make progress (top four) and environments
where it fails to achieve non-trivial performance (bottom four). Even for the simplest
environments, there is still considerable room for improvement. The results are reported
for 3 random seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Multi-agent performance. For the majority of tasks in Megaverse-8, increasing the
number of agents does not yield better results with our model-free RL framework. For all
tasks except tower building, a simple Team Spirit strategy of sharing rewards has either
no impact or negative impact on performance due to the increased difficulty of credit
assignment. The results are reported for 3 random seeds. . . . . . . . . . . . . . . . . . . . 43
3.6 Examples of environments supported by Isaac Gym [79] . . . . . . . . . . . . . . . . . . . 46
4.1 Trained agent follows visual and sound cues to reach the target object in a ViZDoom
environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Illustration of the network architecture and audio encoders used (see Appendix of [105]
for complete details). Here K stands for kernel size, F for number of filters, S for stride, FC
for fully connected layers and STFT for short-term Fourier transform. All convolutional
layers are followed by max-pool layers with kernel size two. . . . . . . . . . . . . . . . . . 51
4.3 Results on the main testing scenarios. Fig. 4.3a shows the comparison of different encoders
in a sound source finding task. Figs. 4.3b and 4.3c show the performance of the FFT (Fourier
transform) encoder on theInstruction andInstructionOnce environments respectively. For
each experiment we report mean and standard deviation of five independent training runs. 52
4.4 Illustrations of the Music Recognition (left) and Sound Instruction scenarios (right).
Locations of target objects and the player starting position are sampled randomly in each
episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
x
4.5 Tasks trained with DexPBT. Left-to-right: regrasping,throwing,single-handedreorientation,
two-handed reorientation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 An illustration of the system used to solve our complex manipulation tasks using a
combination of RL, highly parallelized robotic simulation, and Population Based Training
(PBT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Training curves with and without PBT for single-arm + hand tasks. Shaded area
is between the best and the worst policy among 8 agents inP or 8 seeds in non-PBT
experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8 Training curves with and without PBT for dual-arm + hand tasks. Shaded area is
between the best and the worst policy among 8 agents inP or 8 seeds in non-PBT
experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Drones are trained on formations randomly sampled from one of six pre-defined scenarios
on the left. Drones employ a self- and neighbor encoder that learn a mapping from a
combination of proprioceptive and extereoceptive inputs to thrust outputs. The combined
intermediate embeddings enable policies to perform high speed, aggressive maneuvers
and learn collision avoidance behaviors e.g. formation creation, formation swaps, evader
pursuit. These are shown on the right both in simulation and physical experiments. . . . 78
5.2 Detailed neural network architectures. Left: Deepsets architecture used in simulation.
Middle: Attention architecture used in simulation. Right: Smaller deepsets architecture
deployed on Crazyflie2.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 (Left) Evader pursuit, (Middle) N = 16 quadrotors in a dense formation after fine-
tuning (see Sec. 5.1.11), and (Right) A swarm breaking formation to avoid a collision
with an obstacle. Videos of the learned swarm behaviors in different scenarios are at:
https://sites.google.com/view/swarm-rl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Comparison of different model architectures for N =8 quadrotors withK =6 neighbors
visible to each. For each architecture we show mean and standard deviation of four
independent training runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Attention weights between drones. Left: original velocities, right: velocities set to 0. . . . . 88
5.6 Time lapse of eight drones swapping goals (left) and performing a formation change (right). 90
5.7 Learning to maintain dense formations while avoiding collisions with moving external
obstacles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.8 (Left) Two teams of four quadrotors swap goals twice and then land at their original
starting positions. (Middle) Four quadrotors follow a moving goal, starting from the
bottom right going counter-clockwise. The goal positions are marked with red X’s. Both
plots show 3D positions over time of the four quadrotors. (Right) two teams of four
quadrotors swap goals using PID controllers and buffered Voronoi cells for collision
avoidance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xi
5.9 The DeXtreme system using an Allegro Hand in action in the real world. . . . . . . . . . . 94
5.10 High level overview of the training and inference systems. . . . . . . . . . . . . . . . . . . 97
5.11 The functioning of the Random Network Adversary . . . . . . . . . . . . . . . . . . . . . . 103
5.12 Policies trained with manual DR exhibited ‘stuck’ behaviours, where the cube remained
stuck in certain configurations and was unable to recover. An example of such behaviour
can be observed here: https://www.youtube.com/watch?v=tJgq18VbL3k. . . . . . . . . . . . . 105
xii
Abstract
Advances in computing hardware and machine learning have enabled a data-driven approach to robotic
autonomy where control policies are discovered by analyzing raw data via interactive experience collection
and learning. In this thesis we discuss a specific implementation of this approach: we show how policies
can be trained in simulated environments using deep reinforcement learning techniques and then deployed
on real robotic systems via the sim-to-real paradigm.
We build towards this vision by developing tools for efficient simulation and learning under a con-
strained computational budget. We improve systems design of reinforcement learning algorithms and
simulators to create high-throughput GPU-accelerated infrastructure for rapid experimentation.
This learning infrastructure is then applied to continuous control problems in challenging domains. We
scale up training in a CPU-based quadrotor flight simulator to find robust policies that are able to control
physical quadrotors flying in tight formations. We then use large batch reinforcement learning in a mas-
sively parallel physics simulator IsaacGym to learn dexterous object manipulation with a multi-fingered
robotic hand and we transfer these skills from simulation to reality using automatic domain randomization.
We distill the lessons learned in these and other projects into a high-throughput learning system "Sam-
ple Factory" and release it as an open-source codebase to facilitate and accelerate further progress in the
field, as well as to democratize reinforcement learning research and make it accessible to a wider commu-
nity.
xiii
Chapter1
Introduction
The long-standing goal of robotics research is to develop intelligent machines that can assist humans in
various tasks involving movement of objects and matter in the physical world. These machines should have
the ability to analyze high-bandwidth sensory data and make decisions in order to achieve an objective
specified by the user. Examples of such machines might include self-driving cars transporting passen-
gers to their destination, home robots doing household tasks, or quadrotor drones navigating in outdoor
environments.
Decision making apparatus of a robot can be seen as a task-conditioned mapping from sensory obser-
vations and robot’s internal state to actions. In the most general case the robot’s observation and action
spaces can be quite complex (i.e. observing raw camera data, controlling dozens of robotic joints), which
makes the task of finding the optimal mapping extremely challenging. One popular approach to this task
is to learn the decision making mechanism by observing patterns in interactions between the robot and
the environment, e.g. by increasing probabilities of interactions that lead to desirable trajectories in state
space. This approach is known as reinforcement learning.
While interactive reinforcement learning in the physical world can be achieved, there are several com-
plicating factors. Physical robotic devices are brittle and are subject to wear and tear which limits the
amount of data that can be collected. Besides, learning in the physical world requires constant supervision
1
to prevent the system from getting stuck in unrecoverable states. Last but not least, the only way to ac-
celerate data collection in the real world is to increase the number of robots that simultaneously interact
with the environment. While such real-world scaling has been demonstrated [57], it increases an already
significant cost of robotic experimentation.
An alternative method involves learning robotic behavior in simulation with subsequent transfer of
learned control policies to real robots (akasim-to-real approach). This simultaneously solves multiple chal-
lenges mentioned previously: simulated environments eliminate the danger of damaging the equipment,
can be reset to any state at any time, and can be scaled to run hundreds or thousands of experiments in
silico. At the same time, sim-to-real approach introduces the problem ofrealitygap, a discrepancy between
the simulated world in which the robot controller is trained and the physical world where it is deployed.
While this discrepancy is difficult to address, there are numerous methods designed to reduce the reality
gap, fromdomainrandomization [97, 96] (training on a diverse distribution of simulators) to optimization
of simulator parameters (real-to-sim) [21] and fine-tuning on the real system.
In this thesis we are building essential elements of the infrastructure required for robotic learning in
simulation. We specifically focus on computational efficiency of both learning systems and simulators.
Computational performance in learning systems greatly affects the outcome of sim-to-real projects. Faster
simulators and learning algorithms can reduce experiment turnaround time, allowing the researchers to
quickly iterate on ideas. They can also reduce the cost of compute hardware required to support the
experiments, or increase the amount of collected data at the same cost thus improving the performance of
the final policy.
The following chapters break down the sim-to-real pipeline into its constituent components and dis-
cuss various methods designed by the authors to improve the simulation and learning speed and efficiency.
Outline. This thesis is structured as follows:
2
• Chapter 2 discusses high-throughput reinforcement learning with focus on a specific implementa-
tion, a system called "Sample Factory" [105]. We consider different types of reinforcement learning
workloads and show how synchronous and asynchronous methods can be used for efficient learning
in different scenarios.
• Chapter 3 is focused on a problem of high-throughput environment simulation with efficient uti-
lization of computational resources. We develop an immersive simulation engine called "Mega-
verse" [107] capable of rendering10
6
observations per second on a single node and introduce GPU-
accelerated simulators such as Isaac Gym [73].
• In Chapter 4 we consider simulation-only learning setups that demonstrate how various challenging
control problems can be solved via high-throughput simulation and learning.
• In Chapter 5 we apply high-throughput learning in simulation, domain randomization, and zero
shot sim-to-real transfer to control real robotic systems: a swarm of quadrotor drones achieving
coordinated flight and a dexterous multi-fingered robotic hand.
• In Chapter 6 we discuss how efficient learning systems fit into the bigger landscape of future AI in-
frastructure and discuss challenges that need to be overcome for the wide adoption of reinforcement
learning technology across domains.
3
Chapter2
High-ThroughputReinforcementLearning
In this chapter we focus on tools and methods for accelerated reinforcement learning. We first present
Sample Factory (Sec. 2.1), a system focused on asynchronous parallelization of heterogeneous workloads,
i.e. data collection with CPU-based simulators and GPU-accelerated training.
We then discuss how emergence of the new class of vectorized simulators [79, 38, 107] motivates devel-
opment of a different kind of learning system, a large-batch synchronous RL architecture that maximally
utilizes the throughput of the accelerator (i.e. GPU). We develop a new, more flexible variant of Sample Fac-
tory, version 2.0 [115] (see Sec. 2.2), that supports both asynchronous setup for heterogeneous workloads,
and streamlined synchronous approach for vectorized environments.
Parts of this chapter appeared in:
• AlekseiPetrenko, Zhehui Huang, Tushar Kumar, Gaurav Sukhatme, and Vladlen Koltun. Sample Factory: Egocentric 3D
Control from Pixels at 100,000 FPS with Asynchronous Reinforcement Learning. In Proceedings of the 37th International
Conference on Machine Learning (ICML), 2020.
4
2.1 SampleFactory: FastAsynchronousReinforcementLearning
In this section we present and discuss "Sample Factory" [105], a fast on-policy reinforcement learning
framework that focuses on an asynchronous implementation of a policy gradient algorithm [141]. Sec-
tions below describe prior work in this area, the proposed method and architecture, and the experiments
conducted to test the system.
2.1.1 PriorworkinAcceleratedReinforcementLearning
The quest for performance and scalability has been ongoing since before the advent of deep RL [70]. Higher
throughput algorithms allow for faster iteration and wider hyperparameter sweeps for the same amount
of compute resources, and are therefore highly desirable.
The standard implementation of a policy gradient algorithm is fairly straightforward. It involves a
(possibly vectorized) sampler that collects environment transitions from N
envs
≥ 1 copies of the envi-
ronment for a fixed number of timesteps T . The collected batch of experience – consisting ofN
envs
× T
samples – is aggregated and an iteration of SGD is performed, after which the experience can be collected
again with an updated policy. This method has acquired the name Advantage Actor-Critic (A2C) in the
literature [14]. While it is straightforward to implement and can be accelerated with batched action gen-
eration on the GPU, it has significant disadvantages. The sampling process has to halt when the actions
for the next step are being calculated, and during the backpropagation step. This leads to a significant
under-utilization of system resources during training. Other on-policy algorithms such as TRPO [121] and
PPO [123] are usually also implemented in this synchronous A2C style [31].
Addressing the shortcomings of the naive implementation, the Asynchronous Advantage Actor-Critic
(A3C) [86] proposed a distributed scheme consisting of a number of independent actors, each with its own
copy of the policy. Every actor is responsible for environment simulation, action generation, and gradient
calculation. The gradients are asynchronously aggregated on a single parameter server, and actors query
5
the updated copy of the model after each collected trajectory. While A3C design enables massive paral-
lelization, it has significant disadvantages. In a typical implementation each actor process does forward
and backward pass through the agent parameterization (a neural network) with an effective batch size of
one, which significantly limits the learning throughput. Besides, the policy that collects the experience on
the actors lags behind the central policy on the parameter server by many updates (phenomenon known
as policy lag), making gradient estimates inaccurate at the time they reach the central policy.
GA3C [6] recognized the potential of using GPUs in an asynchronous implementation for both action
generation and learning. A separate learner component is introduced, and trajectories of experience are
communicated between the actors and the learner instead of parameter vectors. GA3C outperforms CPU-
only A3C by a significant margin, although the high communication cost between CPU actors and GPU
predictors prevents the algorithm from reaching optimal performance.
IMPALA [35] uses an architecture conceptually similar to GA3C, extended to support distributed train-
ing. An efficient implementation of GPU batching for action generation leads to increased throughput,
with reported training frame rate of 24K FPS for a single machine with 48 CPU cores, and up to 250K FPS
on a cluster with 500 CPUs. Sample Factory architecture draws inspiration from IMPALA, therefore we
thoroughly compare the two systems in a single-machine learning scenario in terms of both throughput
and sample efficiency.
The need for ever larger-scale experiments has focused attention on high-throughput reinforcement
learning in recent publications. Decentralized Distributed PPO [140] optimizes the distributed policy gra-
dient setup for multi-GPU clusters and resource-intensive environments [119] by parallelizing the learn-
ers and significantly reducing the network throughput required. Concurrent with Sample Factory, SEED
RL [34] improves upon the IMPALA architecture and achieves high throughput in both single-machine and
multi-node scenarios, although unlike our work it focuses on more expensive hardware setups involving
multiple accelerators.
6
Rollout worker#1
Rollout worker#2
...
Rollout worker#N
Policy worker#1
...
Policy worker#M
Shared
memory
observations
actions actions
Sampler
Shared
memory
Learner
full trajectories policy updates
GPU
memory
observations
CPU components GPU components
Figure 2.1: Overview of the Sample Factory architecture. N parallel rollout workers simulate k envi-
ronments each, collecting observations. These observations are processed by M policy workers, which
generate actions and new hidden states via an accelerated forward pass on the GPU. Complete trajectories
are sent from rollout workers to the learner. After the learner completes the backpropagation step, the
model parameters are updated in shared CUDA memory and immediately fetched by the policy workers.
Deep RL frameworks also provide high-throughput implementations of policy gradient algorithms. RL-
lib [72], based on the distributed computation framework Ray [91], and TorchBeast [65] provide optimized
implementations of the IMPALA architecture. Rlpyt [131] implements highly-efficient asynchronous GPU
samplers that share some ideas with our work, although currently it does not include asynchronous policy
gradient methods such as IMPALA or APPO.
Methods such as APE-X [52] and R2D2 [58] demonstrate the great scalability of off-policy RL. While off-
policy algorithms exhibit state-of-the-art performance in domains such as Atari [15], they may be difficult
to extend to the full complexity of more challenging problems [138], since Q-functions may be hard to
learn for large multi-headed and autoregressive action spaces. In this work, we focused on policy gradient
methods, although there is great potential in off-policy learning. Hybrid methods such as LASER [120]
promise to combine high scalability, flexibility, and sample efficiency.
2.1.2 SampleFactory: Overview
Sample Factory is an architecture for high-throughput reinforcement learning on a single machine. The
main focus when designing the system was on making all key computations fully asynchronous, as well
7
as minimizing the latency and the cost of communication between components, taking full advantage of
fast local messaging.
A typical reinforcement learning experiment involves three major computational workloads: envi-
ronment simulation, model inference, and backpropagation. The overall performance of an RL system is
ultimately defined by the workload with the lowest throughput, therefore a key motivation was to build a
system in which the slowest of three workloads never has to wait for any other processes to provide the
data necessary to perform the next computation. In order to minimize the amount of time processes spend
waiting, we need to guarantee that the new portion of the input is always available, even before the next
step of computation is about to start. The system in which the most compute-intensive workload never
idles can reach the highest resource utilization, thereby approaching optimal performance.
2.1.3 SampleFactory: High-LevelDesign
The desire to minimize the idle time for all key computations motivates the high-level design of the system
(Fig. 2.1). We associate each computational workload with one of three dedicated types of components.
These components communicate with each other using a fast protocol based on FIFO queues and shared
memory. The queueing mechanism provides the basis for continuous and asynchronous execution, where
the next computation step can be started immediately as long as there is something in the queue to process.
The decision to assign each workload to a dedicated component type also allows us to parallelize them
independently, thereby achieving optimized resource balance. This is different from prior work [86, 35],
where a single system component, such as an actor, typically has multiple responsibilities. The three types
of components involved are rollout workers, policy workers, and learners.
Rolloutworkers are solely responsible for environment simulation. Each rollout worker hostsk≥ 1
environment instances and sequentially interacts with these environments, collecting observationsx
t
and
8
rewardsr
t
. Note that the rollout workers do not have their own copy of the policy, which makes them very
lightweight, allowing us to massively parallelize the experience collection on modern multi-core CPUs.
The observationsx
t
and the hidden statesh
t
of the agent are then sent to the policy worker, which
collects batches ofx
t
,h
t
from multiple rollout workers and calls the policyπ , parameterized by the neural
network θ π to compute the action distributions µ (a
t
|x
t
,h
t
), and the updated hidden states h
t+1
. The
actionsa
t
are then sampled from the distributionsµ , and along withh
t+1
are communicated back to the
corresponding rollout worker. In turn, this rollout worker uses the actions a
t
to advance the simulation
and collect the next set of observationsx
t+1
and rewardsr
t+1
.
Rollout workers save every environment transition to a trajectory buffer in shared memory. Once T
environment steps are simulated, a trajectory of observations, hidden states, actions, and rewardsT =
(x
1
,h
1
,a
1
,r
1
,...,x
T
,h
T
,a
T
,r
T
) becomes available to the learner. The learner continuously processes
batches of trajectories and updates the parameters of the actor θ π and the critic θ V
. These parameter
updates are sent to the policy worker as soon as they are available, which reduces the amount of experience
collected by the previous version of the model, minimizing the average policy lag. This completes one
training iteration.
As mentioned previously, the rollout workers do not own a copy of the policy and therefore are es-
sentially thin wrappers around the environment instances. This allows them to be massively parallelized.
Additionally, Sample Factory also parallelizes policy workers. This can be achieved because all of the
current trajectory data (x
t
,h
t
,a
t
,...) is stored in shared tensors that are accessible by all processes. This
allows the policy workers themselves to be stateless, and therefore consecutive trajectory steps from a
single environment can be easily processed by any of them. In practical scenarios,2 to4 policy worker in-
stances easily saturate the rollout workers with actions, and together with a special sampler design (section
Sec. 2.1.4) allow us to eliminate this potential bottleneck.
9
The learner is the only component of which we run a single copy, at least as long as single-policy train-
ing is concerned (multi-policy training is discussed in section Sec. 2.1.7). We can, however, utilize multiple
accelerators on the learner through data-parallel training and Hogwild-style parameter updates [110]. To-
gether with large batch sizes typically required for stable training in complex environments, this gives
the learner sufficient throughput to match the experience collection rate, unless the policy computational
graph is highly non-trivial.
2.1.4 EnvironmentSampling
Rollout workers and policy workers together form the sampler - a component that generates trajectories
T . The sampling subsystem most critically affects the throughput of the RL algorithm, since it is often
the bottleneck. We propose a specific way of implementing the sampler that allows for optimal resource
utilization through minimizing the waiting time on the rollout workers.
First, note that training and experience collection are decoupled, so new environment transitions can
be collected during the backpropagation step and neither rollout nor policy workers need to wait for
the learner to finish a training iteration. There are no interruptions caused by parameter updates on the
rollout workers either, since the job of action generation is off-loaded to the policy worker. However, if not
addressed, this still leaves the rollout workers waiting for the actions to be generated by policy workers
and transferred back through interprocess communication.
To alleviate this inefficiency we use Double-Buffered Sampling (Figure Fig. 2.2). Instead of storing
only a single environment on the rollout worker, we instead store a vector of environments E
1
,...,E
k
,
wherek is even for simplicity. We split this vector into two groups,{E
1
,...,E
k/2
} and{E
k/2+1
,...,E
k
},
and alternate between them as we go through the rollout. While the first group of environments is being
stepped through, the actions for the second group are calculated on the policy worker, and vice versa. With
a fast enough policy worker and a correctly tuned value fork we can completely mask the communication
10
x
t
a
t
Policy worker step Rollout worker step
a) GPU-accelerated batched sampling (k envs per iteration)
b) "Double-buffered" sampling ( k/2 envs per iteration)
x
t
a
t
t
Figure 2.2: a) Batched sampling enables forward pass acceleration on GPU, but rollout workers have to
wait for actions before the next environment step can be simulated, underutilizing the CPU. b) Double-
buffered sampling splits k environments on the rollout worker into two groups, alternating between them
during sampling, which practically eliminates idle time on CPU workers.
overhead and ensure full utilization of the CPU cores during sampling, as illustrated in Fig. 2.2. For maximal
performance with double-buffered sampling we want k/2>
t
inf
/t
env
, wheret
inf
andt
env
are average
inference and simulation time, respectively.
2.1.5 CommunicationBetweenSystemComponents
The key to unlocking the full potential of the local, single-machine setup is to utilize fast communica-
tion mechanisms between system components. As suggested by Fig. 2.1, there are four main pathways
for information flow: two-way communication between rollout and policy workers, transfer of complete
trajectories to the learner, and transfer of parameter updates from the learner to the policy worker. For
the first three interactions we use a mechanism based on PyTorch [101] shared memory tensors. We note
that most data structures used in an RL algorithm can be represented as tensors of fixed shape, whether
11
they are trajectories, observations, or hidden states. Thus we preallocate a sufficient number of tensors in
system RAM. Whenever a component needs to communicate, we copy the data into the shared tensors,
and send only the indices of these tensors through fast FIFO queues [106], making messages tiny compared
to the overall amount of data transferred.
For the parameter updates we use memory sharing on the GPU. Whenever a model update is required,
the policy worker simply copies the weights from the shared memory to its local copy of the model.
Unlike many popular asynchronous and distributed implementations, we do not perform any kind of
data serialization as a part of the communication protocol. At full throttle, Sample Factory generates and
consumes more than 1 GB of data per second, and even the fastest serialization/deserialization mechanism
would severely hinder throughput.
2.1.6 PolicyLag
Policy lag is an inherent property of asynchronous RL algorithms, a discrepancy between the policy
that collected the experience (behavior policy) and the target policy that is learned from it. The exis-
tence of this discrepancy conditions the off-policy training regime. Off-policy learning is known to be
hard for policy gradient methods, in which the model parameters are usually updated in the direction of
∇logµ (a
s
|x
s
)q(x
s
,a
s
), whereq(x
s
,a
s
) is an estimate of the policy state-action value. The bigger the pol-
icy lag, the harder it is to correctly estimate this gradient using a set of samplesx
s
from the behavior policy.
Empirically this gets more difficult in learning problems that involve recurrent policies, high-dimensional
observations, and complex action spaces, in which even very similar policies are unlikely to exhibit the
same performance over a long trajectory.
Policy lag in an asynchronous RL method can be caused either by acting in the environment using an
old policy, or collecting more trajectories from parallel environments in one iteration than the learner can
ingest in a single minibatch, resulting in a portion of the experience becoming off-policy by the time it is
12
processed. We deal with the first issue by immediately updating the model on policy workers, as soon as
new parameters become available. In Sample Factory the parameter updates are cheap because the model
is stored in shared memory. A typical update takes less than 1 ms, therefore we collect a very minimal
amount of experience with a policy that is different from the most up-to-date copy.
It is however not necessarily possible to eliminate the second cause. It is beneficial in RL to collect
training data from many environment instances in parallel. Not only does this decorrelate the experiences,
it also allows us to utilize multi-core CPUs, and with larger values fork (environments per core), take full
advantage of the double-buffered sampler. In one “iteration” of experience collection, n rollout workers,
each running k environments, will produce a total of N
iter
= n× k× T samples. Since we update the
policy workers immediately after the learner step, potentially in the middle of a trajectory, this leads to
the earliest samples in trajectories lagging behindN
iter
/N
batch
− 1 policy updates on average, while the
newest samples have no lag.
One can minimize the policy lag by decreasingT or increasing the minibatch sizeN
batch
. Both have
implications for learning. We generally want larger T , in the 2
5
–2
7
range for backpropagation through
time with recurrent policies, and large minibatches may reduce sample efficiency. The optimal batch size
depends on the particular environment, and larger batches were shown to be suitable for complex problems
with noisy gradients [82].
Additionally, there are two major classes of techniques designed to cope with off-policy learning. The
first idea is to apply trust region methods [121, 123]: by staying close to the behavior policy during learning,
we improve the quality of gradient estimates obtained using samples from this policy. Another approach
is to use importance sampling to correct the targets for the value functionV
π to improve the approxima-
tion of the discounted sum of rewards under the target policy [48]. IMPALA [35] introduced the V-trace
algorithm that uses truncated importance sampling weights to correct the value targets. This was shown
to improve the stability and sample-efficiency of off-policy learning.
13
Both methods can be applied independently, as V-trace corrects our training objective and the trust
region guards against destructive parameter updates. Thus we implemented both V-trace and PPO clipping
in Sample Factory. Whether to use these methods or not can be considered a hyperparameter choice for a
specific experiment. We find that a combination of PPO clipping and V-trace works well across tasks and
yields stable training, therefore we decided to use both methods in all experiments reported in the paper.
2.1.7 Multi-agentLearningandSelf-play
Some of the most advanced recent results in deep RL have been achieved through multi-agent reinforce-
ment learning and self-play [9, 17]. Agents trained via self-play are known to exhibit higher levels of skill
than their counterparts trained in fixed scenarios [54]. As policies improve during self-play they generate
a training environment of gradually increasing complexity, naturally providing a curriculum for the agents
and allowing them to learn progressively more sophisticated skills. Complex behaviors (e.g. cooperation
and tool use) have been shown to emerge in these training scenarios [8].
There is also evidence that populations of agents training together in multi-agent environments can
avoid some failure modes experienced by regular self-play setups, such as early convergence to local optima
or overfitting. A diverse training population can expose agents to a wider set of adversarial policies and
produce more robust agents, reaching higher levels of skill in complex tasks [138, 54].
To unlock the full potential of our system we add support for multi-agent environments, as well as
training populations of agents. Sample Factory naturally extends to multi-agent and multi-policy learning.
Since the rollout workers are mere wrappers around the environment instances, they are totally agnostic
to the policies providing the actions. Therefore to add more policies to the training process we simply
spawn more policy workers and more learners to support them. On the rollout workers, for every agent
in every multi-agent environment we sample a random policyπ i
from the population at the beginning of
each episode. The action requests are then routed to their corresponding policy workers using a set of
14
FIFO queues, one for everyπ i
. The population-based setup that we use in this work is explained in more
detail in Sec. 2.1.10.
2.1.8 Experiments: ComputationalPerformance
0 100 200 300 400 500 600 700
10K
20K
30K
40K
50K
FPS, frameskip = 4
Atari throughput, System #1
0 100 200 300 400 500 600 700
10K
20K
30K
40K
50K
60K
VizDoom throughput, System #1
0 100 200 300 400 500 600 700
2K
4K
6K
8K
10K
12K
14K
16K
DMLab throughput, System #1
0 500 1000 1500 2000
Num. environments
20K
40K
60K
80K
100K
120K
140K
FPS, frameskip = 4
Atari throughput, System #2
0 500 1000 1500 2000
Num. environments
20K
40K
60K
80K
100K
120K
140K
VizDoom throughput, System #2
0 500 1000 1500 2000
Num. environments
10K
20K
30K
40K
50K
DMLab throughput, System #2
SampleFactory APPO SeedRL V-trace RLlib IMPALA DeepMind IMPALA rlpyt PPO
Figure 2.3: Training throughput, measured in environment transitions generated per second (considering
4-frameskip, a quarter of all transitions is observed and processed by the agent).
Since increasing throughput and reducing experiment turnaround time was the major motivation be-
hind our work, we start by investigating the computational aspects of system performance. We measure
training frame rate on two hardware systems that closely resemble commonly available hardware setups
in deep learning research labs. In our experiments, System #1 is a workstation-level PC with a 10-core
CPU and a GTX 1080 Ti GPU. System#2 is equipped with a server-class 36-core CPU and a single RTX
2080 Ti.
As our testing environments we use three simulators: Atari [15], VizDoom [60], and DeepMind Lab [12].
While that Atari Learning Environment is a collection of 2D pixel-based arcade games, VizDoom and DM-
Lab are based on the rendering engines of immersive 3D first-person games, Doom and Quake III. Both
VizDoom and DMLab feature first-person perspective, high-dimensional pixel observations, and rich con-
figurable training scenarios. For our throughput measurements in Atari we used the game Breakout, with
15
grayscale frames in84× 84 resolution and 4-framestack. In VizDoom we chose the environmentBattle de-
scribed in section 2.1.10, with the observation resolution of128× 72× 3. Finally, for DeepMind Lab we used
the environment rooms_collect_good_objects from DMLab-30, also referred to as seekavoid_arena_01 [35].
The resolution for DeepMind Lab is kept at standard96× 72× 3. We follow the original implementation
of IMPALA and use a CPU-based software renderer for Lab environments. We noticed that higher frame
rate can be achieved when using GPUs for environment rendering, especially on System#1. The reported
throughput is measured in simulated environment steps per second, and in all three testing scenarios we
used traditional 4-frameskip, where the RL algorithm receives a training sample every 4 simulation steps.
We compare performance of Sample Factory to other high-throughput policy gradient methods. Our
first baseline is an original version of the IMPALA algorithm [35]. The second baseline is IMPALA imple-
mented in RLlib [72], a high-performance distributed RL framework. Third is a recent evolution of IMPALA
from DeepMind, SeedRL [34]. Our final comparison is against a version of PPO with asynchronous sam-
pling from the rlpyt framework [131], one of the fastest open-source RL implementations. We use the
same model architecture for all methods, a ConvNet with three convolutional layers, an RNN core, and
two fully-connected heads for the actor and the critic.
Figure 2.3 illustrates the training throughput in different configurations averaged over five minutes of
continuous training to account for performance fluctuations caused by episode resets and other factors.
Aside from showing the experience collection rate we also demonstrate how the performance scales with
the increased number of environments sampled in parallel.
Sample Factory outperforms the baseline methods in most of the training scenarios. Rlpyt [131] and
SeedRL [34] follow closely, matching Sample Factory performance in some configurations with a small
16
number of environments. Both IMPALA implementations fail to efficiently utilize the resources in a single-
machine deployment: they hit performance bottlenecks related to data serialization and transfer. Addi-
tionally, their higher per-actor memory usage did not allow us to sample as many environments in parallel.
We omitted data points for configurations that failed due to lack of memory or other resources.
0 20M 40M 60M 80M 100M
Env. frames, skip=4
0.0
0.5
1.0
Average return
VizDoom, Find My Way Home
0 20M 40M 60M 80M 100M
Env. frames, skip=4
0
10
20
VizDoom, Defend the Center
0 10 20 30 40 50
Training time, minutes
0.0
0.5
1.0
Average return
0 10 20 30 40 50
Training time, minutes
0
10
20
SampleFactory APPO
SeedRL V-trace
Figure 2.4: Direct comparison of wall-time performance. We show the mean and standard deviation of
four training runs for each experiment.
Figure 2.4 demonstrates how the system throughput translates into raw wall-time training perfor-
mance. Sample Factory and SeedRL implement similar asynchronous architectures and demonstrate very
close sample efficiency with equivalent sets of hyperparameters. We are therefore able to compare the
training time directly. We trained agents on two standard VizDoom environments. The plots demonstrate
a 4x advantage of Sample Factory over the state-of-the-art baseline. Note that direct fair comparison with
the fastest baseline, rlpyt, is not possible since it does not implement asynchronous training. In rlpyt the
learner waits for all workers to finish their rollouts before each iteration of SGD, therefore increasing the
number of sampled environments also increases the training batch size, which significantly affects sam-
ple efficiency. This is not the case for SeedRL and Sample Factory, where a fixed batch size can be used
regardless of the number of environments simulated.
17
Atari, FPS VizDoom, FPS DMLab, FPS
Pure simulation 181740 (100%) 322907 (100%) 49679 (100%)
DeepMind IMPALA 9961 (5.3%) 10708 (3.3%) 8782 (17.7%)
RLlib IMPALA 22440 (12.3%) 12391 (3.8%) 13932 (28.0%)
SeedRL V-trace 39726 (21.9%) 34428 (10.7%) 34773 (70.0%)
rlpyt PPO 68880 (37.9%) 73544 (22.8%) 32948 (66.3%)
SampleFactory APPO 135893(74.8%) 146551(45.4%) 42149(84.8%)
Table 2.1: Peak throughput of various RL algorithms on System#2 in environment frames per second and
as percentage of the optimal frame rate.
Finally, we also analyzed the theoretical limits of RL training throughput. By stripping away all compu-
tationally expensive workloads from our system we can benchmark a bare-bones sampler that just executes
a random policy in the environment as quickly as possible. The framerate of this sampler gives us an upper
bound on training performance, emulating an ideal RL algorithm with infinitely fast action generation and
learning. Table 2.1 shows that Sample Factory gets significantly closer to this ideal performance than the
baselines. This experiment also shows that further optimization may be possible. For VizDoom, for exam-
ple, the sampling rate is so high that the learner loop completely saturates the GPU even with relatively
shallow models. Therefore performance can be further improved by using multiple GPUs in data-parallel
mode, or, alternatively, we can train small populations of agents, with learner processes of different policies
spread across GPUs (see Sec. 2.1.11).
2.1.9 DMLab-30Experiment
IMPALA [35] showed that with sufficient computational power it is possible to move beyond single-task
RL and train one agent to solve a set of 30 diverse pixel-based environments at once. Large-scale multi-
task training can facilitate the emergence of complex behaviors, which motivates further investment in
this research direction. To demonstrate the efficiency and flexibility of Sample Factory we use our system
to train a small population of four agents on DMLab-30 (Figure 2.5). While the original implementation
18
0.0 0.2 0.4 0.6 0.8 1.0
Env. frames, skip=4 1e10
0
10
20
30
40
50
60
Mean capped normalized score, %
DMLab-30
Population mean
Population best
DeepMind IMPALA
Figure 2.5: Mean capped human-normalized training score [35] for a single-machine DMLab-30 PBT run
with Sample Factory. (Compared to cluster-scale IMPALA deployment.)
relied on a distributed multi-server setup, our agents were trained on a single 36-core 4-GPU machine and
the final performance matched the reported score of the cluster-scale IMPALA deployment.
2.1.10 VizDoomExperiments
0 200M 400M
Env. frames, skip=4
0.0
0.5
1.0
Average return
Find My Way Home
SampleFactory
A2C
0 200M 400M
Env. frames, skip=4
0
250
500
750
1000
Deadly Corridor
SampleFactory
A2C
0 200M 400M
Env. frames, skip=4
0
10
20
Defend the Center
SampleFactory
A2C
0 200M 400M
Env. frames, skip=4
0
5
10
15
20
Health Gathering
SampleFactory
A2C
0 200M 400M
Env. frames, skip=4
0
5
10
15
20
Health Gathering Supreme
SampleFactory
A2C
0 200M 400M
Env. frames, skip=4
0
10
20
30
40
Defend the Line
SampleFactory
A2C
Figure 2.6: Training curves for standard VizDoom scenarios. We show the mean and standard deviation
for ten independent experiments conducted for each scenario.
We further use Sample Factory to train agents on a set of VizDoom environments. VizDoom provides
challenging scenarios with very high potential skill cap. It supports rapid experience collection at fairly
high input resolution. With Sample Factory, we can train agents on billions of environment transitions in
a matter of hours (see Fig. 2.3). Despite substantial effort put into improving VizDoom agents, including
several years of AI competitions, the best reported agents are still far from reaching expert-level human
performance [143]. This means there is still room for improvement, and we are highly motivated to set
the new benchmarks in this challenging domain.
19
We start by examining agent performance in a set of basic environments included in the VizDoom
distribution (Figure 2.6). Our algorithm matches or exceeds the performance reported in prior work on the
majority of these tasks [14]. Note that we were able to collect 10 data points for each environment, which
required simulating a total of3× 10
10
environment transitions (27 years of gameplay) on a single server.
0 1 2 3 4
Env. frames, skip=4 1e9
0
10
20
30
40
50
Kills per episode
Battle
SampleFactory
DFP
DFP+CV
0 1 2 3
Env. frames, skip=4 1e9
0
5
10
15
20
Battle2
SampleFactory
DFP
Figure 2.7: VizDoomBattle/Battle2 experiments. We show the mean and standard deviation for four inde-
pendent runs. Here as baselines we provide scores reported for Direct Future Prediction (DFP) [32], and
a version of DFP with additional input modalities such as depth and segmentation masks, produced by a
computer vision subsystem [148]. The latter work only reports results for Battle.
We then investigate the performance of Sample Factory agents in four advanced single-player game
modes: Battle, Battle2, Duel, and Deathmatch. In Battle and Battle2, the goal of the agent is to defeat
adversaries in an enclosed maze while maintaining health and ammunition. The maze in Battle2 is a lot
more complex, with monsters and healthpacks harder to find. The action set in the battle scenarios includes
five independent discrete action heads for moving, aiming, strafing, shooting, and sprinting. As shown in
Fig. 2.7, our final scores on these environments significantly exceed those reported in prior work [32, 148].
We also introduce two new environments, Duel and Deathmatch, based on popular large multiplayer
maps often chosen for competitive matches between human players. Single-player versions of these envi-
ronments include scripted in-game opponents (bots) and can thus emulate a full Doom multiplayer game-
play while retaining high single-player simulation speed. We used in-game opponents that are included in
standard Doom distributions. These bots are programmed by hand and have full access to the environment
state, unlike our agents, which only receive pixel observations and auxiliary info such as the current levels
of health and ammunition.
20
0.0 0.5 1.0 1.5 2.0 2.5
Env. frames, skip=2 1e9
0
20
40
60
80
Kills per episode
Deathmatch vs bots
0.0 0.5 1.0 1.5 2.0 2.5
Env. frames, skip=2 1e9
0
10
20
30
40
Duel vs bots
Population mean Population best Avg. scripted bot Best scripted bot
Figure 2.8: Populations of 8 agents trained in Deathmatch and Duel scenarios. On the y-axis we report
the average number of adversaries defeated in an 4-minute match. Shown are the means and standard
deviations within the population, as well as the performance of the best agent.
For Duel and Deathmatch we extend the action space to also include weapon switching and object
interaction, which allows the agent to open doors and call elevators. The augmented action space fully
replicates a set of controls available to a human player. This brings the total number of possible actions to
∼ 1.2× 10
4
, which makes the policies significantly more complex than those typically used for Atari or
DMLab. We find that better results can be achieved in these environments when we repeat actions for two
consecutive frames instead of the traditional four [15], allowing the agents to develop precise movement
and aim. In Duel and Deathmatch experiments we use a 36-core PC with four GPUs to harness the full
power of Sample Factory and train a population of 8 agents with population-based training. The final
agents beat the in-game bots on the highest difficulty in 100% of the matches in both environments. In
Deathmatch our agents defeat scripted opponents with an average score of80.5 versus12.6. In Duel the
average score is34.7 to3.6 frags per episode (Figure 2.8).
2.1.11 Self-playExperiments
Using the networking capabilities of VizDoom we created a Gym interface [19] for full multiplayer versions
of Duel and Deathmatch environments. In our implementation we start a separate environment instance
for every participating agent, after which these environments establish network connections using UDP
sockets. The simulation proceeds one step at a time, synchronizing the state between the game instances
21
Figure 2.9: Behavior of an agent trained via self-play. Left: Agents tend to choose the chaingun to shoot at
their opponents from longer distance. Right: Agent opening a secret door to get a more powerful weapon.
connected to the same match through local networking. This environment allows us to evaluate the ulti-
mate configuration of Sample Factory, which includes both multi-agent and population-based training.
We use this configuration to train a population of eight agents playing against each other in 1v1
matches in aDuel environment, using a setup similar to the “For The Win” (FTW) agent described in [54].
As in scenarios with scripted opponents, within one episode our agents optimize environment reward
based on game score and in-game events, including positive reinforcement for scoring a kill or picking up
a new weapon and penalties for dying or losing armor. The agents are meta-optimized through hyperpa-
rameter search via population-based training. The meta-objective in the self-play case is simply winning,
with a reward of+1 for outscoring the opponent and0 for any other outcome. This is different from our
experiments with scripted opponents, where the final objective for PBT was based on the total number of
kills, because agents quickly learned to win 100% of the matches against scripted bots and simple binary
reward was not a useful objective anymore.
During population-based training we randomly mutate the bottom 70% of the population every5× 10
6
environment frames, altering hyperparameters such as learning rate, entropy coefficient, and reward
weights. If the win rate of the policy is less than half of the best-performing agent’s win rate, we sim-
ply copy the model weights and hyperparameters from the best agent to the underperforming agent and
continue training.
22
As in our experiments with scripted opponents, each of the eight agents was trained for 2.5× 10
9
environment frames on a single 36-core 4-GPU server, with the whole population consuming∼ 18 years of
simulated experience. We observe that despite a relatively small population size, a diverse set of strategies
emerges. We then simulated 100 matches between the self-play (FTW) agent and the agent trained against
scripted bots, selecting the agent with the highest score from both populations. The results were 78 wins for
the self-play agent, 3 losses, and 19 ties. This demonstrates that population-based training resulted in more
robust policies (Figure 2.9), while the agent trained against bots ultimately overfitted to a single opponent
type. Video recordings of our agents can be found at https://sites.google.com/view/sample-factory.
2.1.12 SampleFactory: Discussion
In the chapters above we discussed an efficient high-throughput reinforcement learning architecture that
can process more than 10
5
environment frames per second on a single machine. The main goal when
creating this system was to democratize deep RL and make it possible to train whole populations of agents
on billions of environment transitions using widely available commodity hardware. We believe this is an
important area of research, as it can benefit any project that leverages model-free RL. With our system
architecture, researchers can iterate on their ideas faster, thus accelerating progress in the field.
We also want to point out that maximizing training efficiency on a single machine is equally important
for distributed systems. In fact, Sample Factory can be used as a single node in a distributed setup, where
each machine has a sampler and a learner. The learner computes gradients based on locally collected
experience only, and learners on multiple nodes can then synchronize their parameter updates after every
training iteration, akin to DD-PPO [140].
We showed the potential of our architecture by training highly capable agents for a multiplayer con-
figuration of the immersive 3D game Doom. We chose the most challenging scenario that exists in first-
person shooter games – a duel. Unlike multiplayer deathmatch, which tends to be chaotic, the duel mode
23
requires strategic reasoning, positioning, and spatial awareness. Despite the fact that our agents were able
to convincingly defeat scripted in-game bots of the highest difficulty, they are not yet at the level of expert
human players. One of the advantages human players have in a duel is the ability to perceive sound. An
expert human player can hear the sounds produced by the opponent (ammo pickups, shots fired, etc.) and
can integrate these signals to determine the opponent’s position (learning with multiple sensory systems
with high-throughput RL is further investigated in [49]). Recent work showed that RL agents can beat
humans in pixel-based 3D games in limited scenarios [54], but the task of defeating expert competitors in
a full game, as played by humans, requires additional research.
2.2 SampleFactory2.0
Sample Factory (Sec. 2.1) focuses on a particular kind of learning system: a heterogeneous design where
simulation is performed on the CPU and inference/backpropagation is done on GPU. This motivates the
asynchronous execution protocol: running components in separate processes allows both CPU and GPU
to be used at the same time which maximizes hardware utilization and overall throughput.
Asynchronous learning is not without its disadvantages, it introduces policy lag which can affect sam-
ple efficiency of the RL method. Besides, with the new types of accelerated environments [107, 79, 38]
(see Chapter 3) we can keep computational landscape completely homogeneous, i.e. run all calculations
entirely on the GPU. Thus the optimal mode of operation for an RL algorithm can actually become syn-
chronous: we just chain all calculations (simulation, inference, backpropagation) on the same accelerator,
with no downtime, delays, or policy lag introduced by asynchronicity.
We can notice that there are two different types of learning systems:
• Type1: Heterogeneous, with CPU-based scalar simulation (simulating a single scene/environment at
a time) and GPU-accelerated learning. This is a typical "classic" RL scenario, e.g. when the environ-
ment is derived from an existing game engine [15, 60].
24
• Type 2: Homogeneous, with vectorized GPU-accelerated simulation and learning.
Type 1 systems work best with an asynchronous algorithm design and Type 2 systems can use the
benefits of simplistic synchronous learning such as reduced policy lag. While there are RL libraries that
implement efficient asynchronous [131, 72, 34] and GPU-accelerated synchronous methods [78], we would
like to have a codebase that supports both.
We extend Sample Factory to support seamless switching between synchronous and asynchronous
learning, accelerating both Type 1 and Type 2 systems. This new version Sample Factory 2.0 [115] is
available as an open-source codebase with the following features:
• Asynchronous regime for heterogeneous learning systems with CPU-based simulators.
• We also support a regime where experience collection within the rollout is asynchronous, but learn-
ing and sampling are never done in parallel. In many tasks this mode can strike the perfect balance
between sample efficiency and throughput.
• Efficient, entirely GPU-side synchronous regime for homogeneous learning systems with vectorized
simulators.
• Seamless switching between parallel and serial execution for easy debugging. GPU-accelerated syn-
chronous RL can run in a single process without any loss of performance.
• Various improvements in the learning algorithm: added observation and return normalizations,
adaptive learning rate, value bootstrapping, etc.
• Single- & multi-agent training, self-play, supports training multiple policies at once on one or many
GPUs.
• Population-Based Training [55].
• Discrete, continuous, hybrid action spaces.
25
• Vector-based, image-based, dictionary observation spaces.
• Automatically creates a model architecture by parsing action/observation space specification. Sup-
ports custom model architectures.
• Library is designed to be imported into other projects, custom environments are first-class citizens.
• Detailed WandB and Tensorboard summaries, custom metrics.
• HuggingFace integration (upload trained models and metrics to the Hub)
• Customizable, detailed documentation, example environment integrations, CI, etc. See more details
at https://samplefactory.dev
By releasing Sample Factory 2.0 we hope to accelerate the progress in deep reinforcement learning by
giving a broad community of researches access to exciting large-scale RL experiments on limited hardware,
regardless of their chosen simulation platform.
26
Chapter3
High-ThroughputSimulation
In this chapter we focus on the simulation part of the sim-to-real learning pipeline. A traditional
approach to simulation involves running scalar simulation engines (typically repurposed game engines
or physics engines), meaning one instance of the engine simulates a single stream of experience for a
single agent or a few agents at a time. This approach optimizes the computational task of simulation for
human consumption, as humans are typically interacting with a single simulated world at a time. An
alternative approach isvectorized simulation. In a vectorized simulation engine hundreds or thousands of
virtual scenes and agents can be simulated at any given time, taking full advantage of the parallel nature
of modern compute hardware.
3.1 Megaverse:High-ThroughputSimulationUsingLargeBatchRendering
While efficient RL architectures enable large-scale learning on limited hardware, the throughput of the
whole training system is ultimately limited by the speed at which the experience can be generated. Al-
though certain environments, such as VizDoom [60], allow us to train agents at more than 140,000 FPS, or
Parts of this chapter appeared in:
• Aleksei Petrenko, Erik Wijmans, Brennan Shacklett, and Vladlen Koltun. Megaverse: Simulating Embodied Agents at
One Million Experiences per Second. In Proceedings of the 38th International Conference on Machine Learning (ICML),
2021.
27
35,000 observations per second when adjusted for 4-frameskip (see Fig. 2.3), such high-throughput training
requires expensive server-grade CPUs since the rendering is not GPU-accelerated. Besides, VizDoom does
not support full 3D rasterization: graphics in Doom-derived games is considered 2.5D as it relies on 2D
sprites and ray casting and cannot handle vertical camera rotation, or parts of the environment that are
on top of one another. Simulators with full 3D rendering are preferred when studying embodied intelli-
gence as they are conceptually much closer to the real world, but even the fastest environments used in
RL research, such as DeepMind Lab [12], will limit the overall throughput of a learning system. Tab. 2.1
demonstrates that throughput of Sample Factory when training agents in DeepMind Lab is almost entirely
determined by the speed of the simulator.
The majority of RL environments used in contemporary Deep RL research with high-dimensional
observations are based on repurposed game engines [15, 60, 12]. While reusing existing graphics pipelines
reduces development costs, it can be a suboptimal basis for RL environments.
Game engines have been designed and optimized to render a single complex scene at resolutions and
framerates tuned for human perception. The simulation of game physics proceeds in very small steps such
that the resulting motion and animations appear smooth when the end user interacts with the game world
in real time. Environments for RL research, however, have drastically different requirements. State-of-the-
art learning algorithms typically consume large amounts of low-resolution observations rendered much
faster than real time. Turning a game engine into a high-throughput simulator for RL thus requires using
OS-level parallelism, where individual game instances are run at the same time in parallel processes. This
prevents the engines from efficiently sharing resources, increases memory consumption, and ultimately
fails to fully utilize the throughput of modern hardware accelerators. In addition, due to the high granu-
larity of physics simulation, researchers often use techniques like frame-skipping [15] to simplify credit
assignment, thereby wasting computation on synthesizing views that are never observed by agents.
28
We argue that a purpose-built rendering and simulation engine for RL research can eliminate inef-
ficiencies of the existing environments and therefore accelerate research in reinforcement learning and
artificial intelligence. We thus present Megaverse, a new platform for embodied AI. Built from scratch as a
lightweight simulator, it leverages batched rendering [125] to fully utilize the throughput of modern GPUs
and render up to1.1× 10
6
observations per second on a single 8-GPU node at128× 72× 3 resolution. By
default we simulate physics at lower frequency compared to traditional 3D engines, which eliminates the
need for frame-skipping and thus enables an experience collection rate up to∼ 20x faster than Atari and
∼ 70x faster than DeepMind Lab on hardware commonly found in deep learning research labs.
Since Megaverse was designed to render an arbitrary number of viewpoints and scenes in parallel, it
naturally supports multi-agent simulation. While traditional RL engines either do not support multi-agent
training [15] or require a slow network setup to enable it [60], Megaverse can simulate the experiences of
dozens of agents in the same environment without loss of performance, in both collaborative and compet-
itive (self-play) settings.
We introduce Megaverse as a simulation platform that can be used to create virtual worlds for deep RL
and embodied AI research. We also implement eight environments built on top of Megaverse that cover
an array of embodied cognitive tasks and prove to be hard for modern RL algorithms. Our benchmark,
Megaverse-8, addresses challenges such as navigation, exploration, and memory. All of the environments
are procedurally generated, and can therefore be used to investigate generalization of trained agents [27].
Many of the challenges require our agents to learn nontrivial physics-based environment manipulation,
which has previously been a feature of resource-intensive high-fidelity simulators [64].
A key goal of Megaverse is to democratize deep RL research. State-of-the-art results in RL have primar-
ily been a prerogative of large research labs with access to vast computational resources. A fast and efficient
29
simulation engine that supports interactive immersive environments that call for advanced embodied cog-
nition can enable rapid community-wide experimentation and iteration, thus accelerating progress in the
field.
3.1.1 PriorWork: SimulationforReinforcementLearning
The first artificial agents were typically confined to miniature grid worlds or board games [133]. These
relatively simple environments were nevertheless a challenge for early intelligent systems. They allowed
researchers to hone the foundations of reinforcement learning theory and general-purpose learning algo-
rithms [132].
With the advent of powerful function approximators, researchers turned their attention to 2D computer
games as a new challenge for artificial agents. The DQN algorithm [87] demonstrated the ability to learn
directly from high-dimensional pixel observations, matching or exceeding human-level performance on
multiple Atari games.
Rapid progress on Atari-like benchmarks led to a phase transition: a new generation of AI research
platforms brought immersive simulators. In contrast to the flatland of arcade games, the real world is
immersive and 3-dimensional. In order to successfully operate in the real world, artificial agents have to
master skills such as spatiotemporal reasoning and object manipulation. Simulators derived from first-
person 3-dimensional video games, such as DeepMind Lab [12], and MineRL [46] were among the first to
offer virtual embodiment and egocentric perception.
Immersive simulators vary in their fidelity and throughput. Advanced simulation platforms built on
top of modern 3D engines, such as Unity [64] or Unreal [33], trade simulation speed for high-fidelity
graphics. These are useful in studying sim-to-real transfer and perceptual aspects of embodied AI, but end-
to-end learning of non-trivial skills and behaviors in these environments requires massive computational
resources [140].
30
Other efforts focus on increasing the behavioral complexity of simulated worlds. Interactions arising
in such environments allow researchers to study non-trivial behaviors and strategies learned by artificial
agents. Environments such as Dota 2 [17] and StarCraft II [138] are complex modern strategy games, and
playing these games at a human level requires advanced long-term planning. However, these games are
not fully immersive, they provide only a structured top-down view of the environment. The OpenAI Hide-
and-Seek project [8] investigates sophisticated behaviors emerging in an environment with full physics
simulation. Unlike the strategy games mentioned above, this environment simulates egocentric perception,
although only 1D Lidar-like sensing is supported.
Capture the Flag [54], which is based on the Quake III engine, demonstrated great potential of rein-
forcement learning in immersive environments, but the project relied on complex closed-source compute
infrastructure. With Megaverse we aim to build a platform that enables the exploration of advanced em-
bodied cognition at or beyond the level of Capture the Flag and Hide-and-Seek, with full physics simulation
and high-dimensional pixel observations, without requiring tens or hundreds of servers for experience col-
lection.
Megaverse is not the first initiative that aims at building an RL simulator from scratch. For exam-
ple, MINOS [118] and its successor Habitat [119] use a purpose-built rendering engine and support high-
resolution textured scenes based on 3D scans of real environments [20]. These and other environments can
be used as testbeds for embodied challenges such as navigation [4] and rearrangement [10]. Most related
to our work is Shacklett et al. [125] who demonstrate that considerable speedups can be gained via batched
simulation, synthesizing the observations in many environments simultaneously. Whereas Shacklett et al.
[125] only examined very simplistic physics (simple collisions with static geometry), with a single agent
per environment, we apply batched simulation to multi-agent environments with complex interactivity.
The idea of batched GPU-accelerated simulation has also been applied in the context of continuous
control such as robot locomotion and dexterous manipulation [73]. Perhaps a combination of batched
31
VoxelWorld
VoxelWorld
Megaverse Policy Worker Learner
N x M x K
observations
N x M x K
actions
Rollouts
Weights
Megaverse
Env 1
Agents 1…M
Batched
Renderer
Thread Pool
Training System
Env N
Agents 1…M
Env 2
Agents 1…M
N x M
actions
A
(t)
N x M
states
S
(t+1)
N x M
observations
O
(t+1)
Figure 3.1: Megaverse parallel architecture in the context of a reinforcement learning system. Syn-
chronous or asynchronous training sequentially interacts withK≥ 1 instances of Megaverse, each host-
ingN ≥ 1 environments withM ≥ 1 agents. Megaverse parallelizes physics computations on CPU, after
which the entire vector of observations is rendered in a single pass.
rendering and batched physics can pave the way for a new generation of fast high-fidelity simulators that
will become a core part of future AI research infrastructure.
3.1.2 MegaverseOverview
Megaverse is a purpose-built simulation platform for embodied AI research. Our engine can simulate
fully immersive 3D worlds with multiple agents interacting with each other and manipulating physical
objects, at more than10
6
experiences per second on a single node. The agents perceive the world through
high-dimensional observations, rendered with dynamic shading and simulated lighting.
We introduce a number of performance optimizations that are instrumental in unlocking this perfor-
mance regime. Our discretized physics approach (Section 3.1.3) allows us to streamline the computation
of non-trivial physical interactions between simulated objects. The parallel architecture built around the
batched Vulkan renderer [125] (Section 3.1.4) enables the production of hundreds of observations in a
single pass, drastically reducing the required amount of communication between hardware components.
Another algorithmic optimization is 3D geometry simplification (Section 3.1.5). The following chapters
describe these architectural choices and optimizations in more detail.
32
3.1.3 DiscretizedContinuousPhysics
Full simulation of physical contact and collisions between dozens of objects can be computationally expen-
sive. Most physics engines require a small simulation step to avoid unrealistic interpenetration of objects
caused by movement interpolation. Megaverse works around this by discretizing some of the physical in-
teractions. Even though movement and collision checking for agents are fully continuous, we simplify the
simulation of more complex interactions, such as object stacking. This enables non-trivial object manipu-
lation, e.g. building staircases and bridges using physical objects, or moving objects around while solving
rearrangement tasks without prohibitively expensive high-frequency simulation of full contact forces.
Specifically, we use a voxel grid data structure: the agents are free to pick up objects anywhere and
move them continuously, but placement of objects is only allowed at discretized locations in space, similar
to MineRL [46]. This has two major benefits. First, proximity checks and other spatial queries become
trivialO(1) operations. More importantly, collision checking even with hundreds of objects also becomes
extremely fast. We take advantage of caching mechanisms based on axis-aligned bounding boxes imple-
mented in the Bullet physics engine [28]. Since most of the interactive objects reside in axis-aligned voxels,
a simple check based on bounding box intersections eliminates the vast majority of potential collision can-
didates.
3.1.4 LargeBatchSimulationandRasterization
One of the key design features of our implementation is an efficient parallel architecture (Fig. 3.1). A
single instance of the traditional game-based RL environment can simulate only a single virtual world
and render a single observation per simulation step. One instance of Megaverse can advance hundreds of
environments, each containing multiple agents, in parallel. This allows our engine to take full advantage
of the massive parallelism of modern computing platforms.
33
Formally, for environmenti our system maintains an internal states
(t)
i
= [s
(t)
(i,1)
,...,s
(t)
(i,N
(t)
i
)
], where
s
(t)
(i,n)
∈ Sim(3)
∗
andN
(t)
i
is the number of entities in the environmenti at time stept (i.e. interactive objects,
walls, agent bodies, cameras, etc.). Then, given a tensor of actionsA
(t)
∈A
NumEnvs× NumAgents
, whereA
is the set of all available actions, we update the internal state to produceS
(t+1)
=[s
(t+1)
i
,...,s
(t+1)
NumEnvs
].
Taken together, the dynamics can be summarized as follows:
S
(t+1)
= Physics(S
(t)
,A
(t)
) (3.1)
O
(t+1)
= Render(S
(t+1)
) (3.2)
whereO
(t+1)
is the rendered observations while Physics and Render arebatched modules that are respon-
sible for their own parallelization. See Fig. 3.1 for a visual depiction.
Physics simulation is parallelized on the CPU by scheduling state updates for individial environments
on a thread pool with a configurable number of threads. In order to parallelize the rasterization step we
adopt the optimized batched rendering pipeline proposed by Shacklett et al. [125]. This technique takes ad-
vantage of the fact that modern GPUs excel at rendering relatively small numbers of ultra-high-resolution
images. The renderer bundles together all the rendering commands corresponding to individual agents and
makes a single request to the GPU to render all the observations. This massively cuts down the required
amount of communication between CPU and GPU, and helps the renderer achieve high GPU utilization.
The impact of this technique on sampling performance is investigated in Tab. 3.1.7. For compatibility with
existing training systems, by default we transfer the resulting rendered images to CPU memory, although
it is also possible to expose them directly as PyTorch GPU-side tensors [101].
∗
The Lie group of all rotations, translations, and scalings in 3D.
34
Figure 3.2: After procedurally generating the environment layout, we minimize the number of primitives
by coalescing adjacent voxels where possible.
3.1.5 3DGeometryOptimization
Voxelized geometry allows us to speed up physics calculations, but it is not the most efficient way to repre-
sent procedurally generated environment layouts, especially in scenarios with non-trivial 3D landscapes.
A naive way to visualize such layouts would require rendering thousands of individual voxels that make
up the environment. Instead, at the beginning of every episode, after the random landscape is generated,
we merge adjacent voxels into a small number of enclosing parallelepipeds (Fig. 3.2). While finding the
optimal solution for this problem is NP-hard, a greedyO(n) algorithm (wheren is a total number of non-
empty voxels) is sufficient to significantly reduce the number of geometric primitives in the environment.
We study the impact of this technique on sampling throughput in Tab. 3.1.7.
3.1.6 Megaverse-8Benchmark
Using the Megaverse simulation platform we created a benchmark called Megaverse-8 (Fig. 3.3), designed
for training and evaluation of embodied agents. The benchmark comprises eight tasks, each aiming to test
different cognitive abilities of intelligent agents including exploration and navigation in 3-dimensional
spaces, tool use and object manipulation, and long-term planning and memory. The benchmark is built
with multi-agent support in mind, and all scenarios are suitable for teams of agents.
35
(a) HexExplore (b) Collect
(c) TowerBuilding (d) Sokoban
(e) HexMemory (f) ObstaclesEasy
(g) ObstaclesHard (h) Rearrange
Figure 3.3: Overview of procedurally generated environments in Megaverse-8. At the beginning of each
episode, the visual appearance and layout of environments are sampled randomly. See Sec. 3.1.6 for details.
36
The environments in the benchmark are relatively simple for human-level intelligence, yet present a
serious challenge for artificial agents (Section 3.1.7). To score well, agents must demonstrate common-
sense comprehension of the physical world, such as understanding object permanence and occlusion, and
the ability to interactively manipulate objects. Creating a research platform that provides scenarios that
elicit these skills is one of our motivations.
Megaverse-8 environments are procedurally generated and each task has a practically infinite number
of instantiations. We randomize the parameters of the task, 3D geometry of the environment, starting
positions of interactable objects and agents, etc. We randomize the visual appearance by sampling ran-
dom monochrome materials. Procedural synthesis mitigates overfitting, allows us to evaluate the perfor-
mance of the agents on unseen environments, and may facilitate the emergence of generalizable skills.
The following paragraphs provide high-level descriptions of the tasks in Megaverse-8. Please also refer to
www.megaverse.info for video demonstrations.
Megaverse“HexExplore” (Fig. 3.3a). Agents are placed in a randomized hexagonal maze and tasked
to find a target object. The episode is considered solved when the target object is touched by the agent.
This environment tests agents’ episodic exploration abilities [117].
Megaverse“Collect” (Fig. 3.3b). Agents navigate in a procedurally generated 3D landscape and collect
objects. Green objects provide positive reward, while red objects generate an equal-magnitude penalty.
The episode is considered solved when all positive-reward objects are collected. To generate the level
geometry a 2D fractal noise texture [104] is synthesized and then interpreted as an elevation map at each
location in a discretized space. This environment tests agents’ skills in traversing 3D landscapes, including
the ability to control the gaze direction both vertically and horizontally.
Megaverse “TowerBuilding” (Fig. 3.3c). Agents are challenged to construct structures made of in-
teractable boxes. Agents can pick up boxes scattered in the environment and place them in the building
zone, which is marked by a distinct color. Agents receive positive reward for placing a new block in the
37
building zone. The reward function grows asr
(t)
=2
h
with the heighth at which the block is placed, thus
in order to maximize the score the agents are incentivized to construct structures with as many levels as
possible. The building process is subject to realistic constraints: blocks can only be placed on top of other
blocks and can only be removed if they have no blocks above them. To build non-trivial tall structures the
agents need to maintain scaffolding pathways that allow them to carry the blocks to higher levels.
Megaverse“Sokoban” (Fig. 3.3d). Fast immersive version of the classic Sokoban puzzle, inspired by
the Mujoban environment [85]. At the beginning of every episode, a random puzzle with 4 boxes is sampled
from the Boxoban dataset [43]. The agents are required to push the boxes into target positions, marked
green. As in classic Sokoban, some of the moves are irreversible, therefore the agents must strategically
plan ahead in order to succeed.
Megaverse “HexMemory” (Fig. 3.3e). Agents are placed in a randomized hexagonal maze in front
of a reference object with randomly sampled shape and material. Smaller copies of the reference object
are scattered throughout the environment, alongside other objects that do not match the reference. The
agent’s task is to collect objects matching the reference, while avoiding other types of objects. When all
matching objects are collected, the episode is terminated and the puzzle is considered solved. The scenario
requires agents to memorize the visual appearance of an object and keep it in memory for long periods of
time, as the reference object inevitably disappears from view as the agent navigates the maze. HexMemory
challenges agents’ ability to form and retain memories. This environment is inspired by Beeching et al.
[14].
Megaverse“Obstacles” (Figures 3.3f and 3.3g). Procedurally generated 3D obstacle course presented
in two versions: ObstaclesEasy and ObstaclesHard. Agents are spawned on one side of the course and
are required to reach a target location on the other side. In order to get there they need to overcome
different types of obstacles, such as pits, lava lakes, and high walls. Good coordination and movement
is not sufficient to overcome most obstacles. For example, a wall can be too tall for the agent to jump
38
over. Agents must use interactive objects placed in the environment to build bridges, staircases, and other
artificial tools that help them accomplish the task. ObstaclesEasy and ObstaclesHard differ in both the
length of the obstacle course and the difficulty of individual puzzles. ObstaclesHard is particularly difficult
due to reward sparsity. As the agents are unlikely to discover the sophisticated construction behaviors by
mere random exploration, the obstacle course environments can be a good test for advanced exploration
strategies such as intrinsic curiosity [102].
Megaverse“Rearrange” (Fig. 3.3h). Inspired by the classic MIT Copy Demo, this environment chal-
lenges the agents to replicate a reference structure made out of colored objects. In order to successfully
complete the task the agents have to recognise and remember the object arrangement and replicate it in a
designated area by rearranging interactive objects in a specific way. The task is considered solved when
the reference arrangement is replicated precisely.
We expect that some of the environments, such as “Rearrange” and “ObstaclesHard”, will be too chal-
lenging for present-day end-to-end learning systems. Agents not only need to discover the low-level object
manipulation skills, but must also form appropriate internal representations and explore compositions of
skills in order to succeed.
3.1.7 Experiments: computationalperformance
We start by benchmarking the performance of the Megaverse platform. We examine pure simulation
speed, when no inference or learning are done, as well as performance of Megaverse environments as a
part of a full RL training system (Sample Factory [105]). We use three different hardware setups that are
representative of systems commonly found in deep learning research labs. We compare performance of
Megaverse to other fast environments used in reinforcement learning, namely Atari [15], VizDoom [60],
and DMLab [12].
39
Environment Simulationthroughput Trainingthroughput
System#1 (12xCPU, 1xRTX3090)
Atari (84× 84 grayscale) 19.4K (16.8x) 15.0K (2.8x)
VizDoom (128× 72 RGB) 38.1K (8.6x) 18.9K (2.3x)
DMLab (96× 72 RGB) 6.1K (53.5x) 4.6K (9.3x)
Megaverse (128× 72 RGB) 327K 42.7K
System#2 (36xCPU, 4xRTX2080Ti)
Atari (84× 84 grayscale) 47.1K (18.2x) 31.4K (2.9x)
VizDoom (128× 72 RGB) 79.5K (10.8x) 38.5K (2.3x)
DMLab (96× 72 RGB) 12.4K (68.7x) 7.7K (11.6x)
Megaverse (128× 72 RGB) 856K 90.1K
System#3 (48xCPU, 8xRTX2080Ti)
Atari (84× 84 grayscale) 53.7K (21.4x) 34.6K (3.9x)
VizDoom (128× 72 RGB) 100.1K (11.5x) 44.7K (3x)
DMLab (96× 72 RGB) 15.8K (72.6x) 9.8K (13.7x)
Megaverse (128× 72 RGB) 1148K 134K
Table 3.1: Pure sampling and training throughput with mainstream RL simulators vs. Megaverse. The
performance is reported in observations per second observed by the agent, i.e. after frameskip.
Optimizedgeometry Batchedrendering[125] Simulationthroughput
X X 20.7K (10.1x slower)
✓ X 29.6K (7.1x slower)
X ✓ 45.7K (4.6x slower)
✓ ✓ 210K
Table 3.2: Influence of optimized geometry and batched rendering optimizations on the overall sampling
throughput. Performance measured on a 10-core 1xGTX1080Ti system in Megaverse-8 “Collect” scenario.
We find that in simulation throughput, Megaverse is an order of magnitude faster than the next fastest
environment, VizDoom (Tab. 3.1), while supporting considerably more complex interactions. Our platform
is between 50x and 70x faster than the most comparable environment, DMLab. In end-to-end training,
Megaverse is entirely bottlenecked by learning and inference throughput and enables training speeds 2-3
times faster than VizDoom and up to 14x faster than DMLab.
Ablationstudy. We examine the impact of two key performance optimizations in Megaverse: batched
rendering (Sec. 3.1.4) and geometry optimization (Section Sec. 3.1.5). The results show that both of these
techniques are required to achieve high throughput (Tab. 3.2). Without geometry optimization the system
40
With CPC|A Without CPC|A
Figure 3.4: Single-agent performance. Megaverse-8 has a diversity of challenges, with environments
where model-free RL is able to make progress (top four) and environments where it fails to achieve non-
trivial performance (bottom four). Even for the simplest environments, there is still considerable room for
improvement. The results are reported for 3 random seeds.
would be heavily bottlenecked by physics calculations on the CPU, and without batched rendering the
communication between CPU and GPU is a major bottleneck.
3.1.8 Single-AgentBaseline
Setup. In this section we present RL training results on the Megaverse-8 benchmark. We train agents
using asynchronous proximal policy optimization (PPO) [123] with V-trace off-policy correction [35] using
the Sample Factory implementation [105]. Given the challenges of learning good representation with
model-free RL from scratch, we also experiment with using Action Conditional Contrastive Predictive
Coding (CPC|A) [44] as an auxiliary loss. We train both standard PPO and a CPC|A-augmented version
on 2× 10
9
environment steps
†
. We find that CPC|A augmentation leads to considerable performance
improvements on TowerBuilding and Exploration tasks without significantly affecting other scenarios,
therefore we decided to use it in all other experiments.
†
Note that each transition here is observed by an agent since there is no frameskip in Megaverse.
41
Results. We establish that the proposed benchmark has considerable diversity in task difficulty. While
all tasks are far from being solved, reasonable progress can be made on four of the eight (Fig. 3.4). Obstacle-
sEasy, Collect, and HexExplore require robust 3D navigation skills and basic object manipulation abilities.
Model-free RL was able to achieve non-trivial performance in these scenarios, although none of the agents
approached 100% success.
Our agents demonstrated surprisingly high level of performance in the TowerBuilding scenario. Com-
paratively dense reward allowed the agents to master object stacking and consistently construct structures
up to ten levels high during training. We also observed generalization of construction behavior: when we
gave the agent twice as many blocks compared to training it kept building higher and higher towers up to
14 levels high. Video demonstrations of agent performance can be found at www.megaverse.info.
Other scenarios in the benchmark have proven to be a much harder challenge. Both Rearrange and
Sokoban agents improved at the beginning of training, but ultimately failed to reach satisfactory perfor-
mance levels. In the Rearrange scenario, the agents learned to randomly shuffle the objects in the hope of
matching the target arrangement by accident, and never learned to pay attention to the reference arrange-
ment. In Sokoban, the agents tend to push blocks to nearby targets which happens to be sufficient to solve
some of the puzzles. Full completion of the task requires long-term planning, and the agents ultimately
failed to demonstrate the ability to do that.
HexMemory and ObstaclesHard turned out to be the most challenging scenarios. HexMemory is a hard
credit assignment and memory challenge. Agents failed to capture the relationship between temporally
distal events, such as observation of the reference object and collection of similar/dissimilar objects in
the environment. This experiment shows that GRU policy networks that we used in our experiments
are not sufficient for this type of task, although there is potential for other policy architectures, such as
transformers [99].
42
4 Agents, No Team Spirit 2 Agents, No Team Spirit 4 Agents, Team Spirit 2 Agents, Team Spirit
Figure 3.5:Multi-agentperformance. For the majority of tasks in Megaverse-8, increasing the number of
agents does not yield better results with our model-free RL framework. For all tasks except tower building,
a simple Team Spirit strategy of sharing rewards has either no impact or negative impact on performance
due to the increased difficulty of credit assignment. The results are reported for 3 random seeds.
ObstaclesHard is perhaps the most challenging scenario in the Megaverse-8 benchmark. To traverse
the obstacle course completely the agents need to master multiple skills and combine them in intelligent
ways to overcome obstacles. Learning individual skills via a curriculum of simpler environments may be
a promising research direction.
3.1.9 Multi-AgentBaseline
Setup. We continue by examining multi-agent performance with two and four agents on our proposed
benchmark. In all Megaverse-8 tasks, the agents must work together to perform well. To encourage
cooperation, we experiment with Team Spirit reward shaping, inspired by OpenAI Dota 2 experiments
[17]. Team Spirit modifies the credit assignment such that agents are rewarded both for their own ac-
tions and the actions of other agents in the team. Formally, the reward for agent i as time t is r
(t)
i
=
(1− TeamSpirit)˜ r
(t)
i
+
TeamSpirit
NumAgents
P
j
˜ r
(t)
j
, where ˜ r
(t)
i
is the individual agent rewards before incorporating
Team Spirit. This reward makes credit assignment harder, thus we gradually increase Team Spirit from 0.0
to 1.0 over the first one billion steps of training as a form of curriculum.
43
Results. For two tasks, HexExplore and Collect, we find that having more agents is beneficial (Fig. 3.5).
In these tasks, relatively high score can be achieved even if agents ignore each other and focus on maximiz-
ing their own reward. This is confirmed by the fact that Team Spirit hurts performance. For HexMemory,
a team of agents has a higher chance of completing the task randomly. Even though the results are better
than for a single agent, teams of agents fail to make further progress. In TowerBuilding, we discover that
two agents perform approximately as well as one agent, and four agents consistently perform worse. The
agents end up competing for rewards with each other instead of working together. In this case the addition
of Team Spirit encourages cooperation and improves performance in both two- and four-agent teams.
3.1.10 Megaverse: Discussion
In Sec. 3.1 we presented a new research platform Megaverse, capable of achieving simulation throughput
over 1,000,000 observations per second on a single node – unprecedented for immersive 3D simulators.
Aside from the ability to simulate embodied agents tens of thousands of times faster than real time, our
engine can also match the throughput of existing simulators while using only a fraction of computational
resources. Our dedicated simulation platform can make large-scale RL experiments more accessible, thus
accelerating progress in AI research.
Hard problems and good metrics for evaluating progress on these problems are instrumental for sci-
ence. While the traditional benchmarks in Deep RL are definitely not trivial, they are getting tantalizingly
close to being solved [7, 99, 105]. We use the Megaverse platform to buildMegaverse-8, a new suite of hard
challenges for embodied AI. Complete solution of all tasks in Megaverse-8 requires the agents to master
object manipulation, rearrangement, and composition of different low-level skills. We hope that solving
these challenges in a robust and principled way will advance our understanding of embodied intelligence.
Extremely fast simulation provided by Megaverse can have impact beyond deep reinforcement learn-
ing. For example, contemporary derivative-free optimization methods are known for their supremacy in
44
Mujoco-like environments [124], but evaluating them in scenarios with high-dimensional observations has
previously been very costly. With more than an order-of-magnitude improvement in simulation through-
put, evaluation of derivative-free methods in immersive 3D environments may be feasible.
Another research direction that can leverage fast simulation is meta-learning. With highly optimized
learning systems [131, 34, 105], entire training sessions in simple Megaverse environments can be com-
pleted in mere seconds, and thus can be used as a part of a larger meta-learning process. While partial
learning of optimizer features and loss functions has been demonstrated [13], access to sufficiently fast
training may enable the optimization of whole learners parameterized by neural networks.
Megaverse opens new possibilities in the field of multi-agent learning. Megaverse is one of the first
open-source platforms that allows fast simulation of multiple agents interacting in immersive environ-
ments. The accessibility of such a platform can have important implications for studying multi-agent co-
operation, autocurricula emerging from self-play [8], and the emergence of communication and language
[90].
45
3.2 VectorizedPhysicsEngines
Figure 3.6: Examples of environments supported by Isaac Gym [79]
Physics engines are essential for robotic simulation. Traditionally, this class of software has been
dominated by CPU-based "scalar" physics engines, most notable examples being "MuJoCo" [134] and "Bul-
let/PyBullet" [29]. These engines are designed to simulate a single scene (i.e. a robot interacting with an
object) per instance of the simulator. Although simulation can be accelerated using OS-level parallelism
(i.e. running multiple copies of the simulator in different processes), the nature of modern compute hard-
ware dictates new demands for physics engines.
Modern GPUs contain thousands of individual cores and simulation running directly on the GPU
can provide radically higher throughput than a CPU-based simulation of comparable accuracy. Multiple
projects take advantage of this, most notably Brax [38] and Isaac Gym [79].
Brax is a differentiable physics engine based on JAX [18] (autograd engine and optimized linear algebra
compiler). Brax can be executed not only on modern GPUs, but also on TPU accelerators, simulating
simple environments at millions of steps per second on a single TPU. While Brax is a prominent emergent
technology, it is not extensively discussed in this thesis.
46
Unlike Brax, Isaac Gym is not a differentiable physics engine (i.e. more akin to traditional "black box"-
like simulators used in RL). It uses PhysX [94] technology and is accelerated by Nvidia GPUs, achieving
up to 10
6
simulation steps per second on a single device. Isaac Gym provides a rich API which allows
developers to implement sophisticated scenes including high-DoF robots for locomotion and contact-rich
object manipulation tasks (see Fig. 3.6).
In the following chapters we take advantage of the massive throughput provided by IsaacGym to tackle
challenging robotic manipulation problems in simulation and in the real world.
47
Chapter4
Simulation-OnlyApplications
In this chapter we demonstrate how accelerated learning in simulation can be used to conduct research
in embodied navigation and robotic manipulation. We present two projects titled "Agents that Listen"
(Sec. 4.1) and "DexPBT" (Sec. 4.2) respectively.
"Agents that Listen" is an experiment based on a heterogeneous learning system with a CPU-based
simulator (VizDoom [60]). This is a typical example of a Type 1 RL system, as described in Sec. 2.2.
DexPBT is based on a homogeneous, entirely GPU-side simulation and learning setup, which makes it
a Type 2 learning system.
In both cases billions of simulated environment transitions were required to achieve successful results,
and these experiments would not be possible without high-throughput learning algorithms.
In DexPBT we additionally explore a scenario where we can use a very efficient learning system ca-
pable of generating billions of environment rollouts while at the same time not being constrained to a
Parts of this chapter appeared in:
• Shashank Hegde, Anssi Kanervisto, andAlekseiPetrenko. Agents that Listen: High-Throughput Reinforcement Learn-
ing with Multiple Sensory Systems. In IEEE Conference on Games (CoG), 2021.
• AlekseiPetrenko, Arthur Allshire, Gavriel State, Ankur Handa, and Viktor Makoviychuk. DexPBT: Scaling up Dexterous
Manipulation for Hand-Arm Systems with Population Based Training. Under review at IEEE International Conference on
Robotics and Automation (ICRA), 2023.
48
single machine. We use a Population-Based Training [55] algorithm to scale learning on the cluster which
significantly amplifies the power of the learning system.
4.1 AgentsthatListen:High-ThroughputReinforcementLearningwith
MultipleSensorySystems
4.1.1 Introduction
Reinforcement learning (RL) algorithms have reached tremendous success in the field of embodied intel-
ligence, including human-level control in Atari games [87, 7] and in first-person games [143, 105], and
super-human control in competitive games [138, 17]. These state-of-the-art learning methods allow artifi-
cial agents to discover efficient policies that map high-dimensional unstructured observations to actions.
While the general framework of deep RL enables learning from arbitrary sources of data, so far the major-
ity of research in embodied AI focused on learning only from visual input (for example, see all the previous
citations). We argue that another important sensor modality, sound, is largely overlooked.
Sound represents a highly salient signal rich with information about the environment. Sound cues
correspond to discrete events such as contacts and collisions which might be difficult to identify from
visual data alone. Stereo sound encodes important spatial information that can reveal objects and events
outside of the agent’s field of view. Finally, sound could be used to establish a natural communication
channel between agents in the form of speech and hearing, which is one of the distinguishing features of
higher forms of intelligence.
In computer games, especially in the first-person shooter (FPS) genre, the ability to perceive and un-
derstand game sounds is one of the essential skills. This is particularly important in tactical duel scenarios
in games like Quake or Doom: in order to gain an advantage skilled players listen to their opponent’s
actions to understand where they are on the game level and what resources they possess.
49
Figure 4.1: Trained agent follows visual and sound cues to reach the target object in a ViZDoom environ-
ment.
Reinforcement learning on combined auditory and visual inputs is complicated by the lack of infras-
tructure. The existing learning environments either do not support sound, or do not allow high-throughput
parallel simulation necessary for large-scale experiments. We attempt to improve the situation by releasing
an augmented version of the popular ViZDoom environment [60] where in-game stereo sound is avail-
able to the agents. Our implementation is decoupled from dedicated sound hardware typically used for
audio rendering, and thus allows faster-than-realtime parallel simulation. We proceed to train agents in
our environment in a series of increasingly complex scenarios designed to test various aspects of sound
perception.
4.1.2 RelatedWork: ReinforcementLearningwithAudioInputs
A number of prior projects explored RL with audio observations. Gaina and Stephenson [40] augmented
General Video Game AI framework to support sound, focusing on 2D sprite-based games. Chen et al.
introduced SoundSpaces [22], a version of Habitat environment which focuses on audio-visual navigation
in photorealistic scenes. SoundSpaces was further used in [77] to investigate the problem of separating
sound sources from background noise. Parketal. [100] introduced a general-purpose simulation platform
based on Unity engine with both auditory and visual observations.
While ViZDoom [60] supports in-game stereo sounds, the default audio subsystem is not designed for
faster-than-realtime experience collection, and thus can only be used in relatively basic scenarios [142]. To
50
1D Conv.
Encoder
2D CNN
FC 512
RNN
actions, values
1D Conv.
(16F, 16K, 8S)
Subsample 1/8
1D Conv.
(32F, 16K, 8S)
Full network
Left audio
features
Audio
features
Fourier transform
encoder
FFT
FC 256
1D Maxpool
(8K)
FC 256
Audio
features
Mel-spectrogram
encoder
STFT
2D Conv.
(16F, 3K, S1)
Mel-filterbanks
2D Conv.
(32F, 3K, S1)
Audio
features
Frequency (N=552)
Right audio
features
Time
Time
Frequency (N=80)
Figure 4.2: Illustration of the network architecture and audio encoders used (see Appendix of [105] for
complete details). Here K stands for kernel size, F for number of filters, S for stride, FC for fully connected
layers and STFT for short-term Fourier transform. All convolutional layers are followed by max-pool layers
with kernel size two.
our best knowledge, the version of ViZDoom presented in this work is the first simulation platform that
enables accelerated embodied simulation with sounds at tens of thousands of actions per second, enabling
large-scale training. Our experiments with the Doom duel scenario represent one of the first deployments
of an agent with auditory and visual perception in a full first-person computer game.
4.1.3 ViZDoomEnvironmentwithSound
We generate the audio observations for the agents through the OpenAL
∗
sound subsystem supported
by ViZDoom. OpenAL implementation offers many modern features, such as 3D sounds, reverberation,
Doppler shift, and dampening of the sounds based on the agent’s gaze direction with respect to the source.
Normally the sound engine is designed for human perception and plays back the sound samples in real
time, prohibiting fast simulation. We circumvent this issue by using OpenAL Soft
†
with the ALC_SOFT_-
loopback extension which completely decouples the in-game sound from the device audio and enables
∗
https://openal.org/
†
https://github.com/kcat/openal-soft
51
0 100M 200M 300M 400M 500M
Env. frames, skip=4
0.0
0.5
1.0
Success rate
No sound
1D Conv
Mel-spectrogram
Fourier transform
(a) Music Recognition
0 200M 400M 600M 800M 1000M
Env. frames, skip=4
0.00
0.25
0.50
0.75
Success rate
No sound
Fourier transform
(b) Sound Instruction
0 200M 400M 600M 800M 1000M
Env. frames, skip=4
0.0
0.5
1.0
Success rate
No sound
Fourier transform
(c) Sound Instruction Once
Figure 4.3: Results on the main testing scenarios. Fig. 4.3a shows the comparison of different encoders
in a sound source finding task. Figs. 4.3b and 4.3c show the performance of the FFT (Fourier transform)
encoder on theInstruction andInstructionOnce environments respectively. For each experiment we report
mean and standard deviation of five independent training runs.
software rendering of sounds on the CPU. This allows us to generate both visual and auditory observations
at a maximum rate, enabling the environment simulation in the lock-step fashion typical for a RL setup.
In addition to that, an ALC_EXT_thread_local_context extension allows us to spawn a large number
of game instances generating sound samples simultaneously. We leverage that in our experiments by
starting hundreds of concurrent processes to achieve high training throughput with an asynchronous RL
framework [105].
By directly accessing OpenAL sound buffers we expose raw audio observations through ViZDoom API.
To give the agents access to all available sound data, we implement configurable audio frame-stacking,
independent of the ViZDoom frame-skipping parameters. By default, if the agent chooses its action in
the environment once everyN simulation steps, we provide the audio observation containing the sounds
for the previousN steps. The length of this window can be increased if needed, for example to facilitate
training of feed-forward policies.
Another configuration parameter we expose is the audio sampling rate. A larger sampling frequency
is analogous to a higher screen resolution, it enables more detailed observations at a cost of increased
computation time. In this work we used a fixed sampling rate of 22050 Hz, which provides fast rendering
and high sound quality.
52
"Go to the
Green Pillar "
Figure 4.4: Illustrations of theMusicRecognition (left) andSoundInstruction scenarios (right). Locations of
target objects and the player starting position are sampled randomly in each episode.
4.1.4 AudioEncoderArchitectures
Our focus is on finding a general approach for processing sound with neural network-based policies. We
seek models that are powerful and general enough to solve different, complex tasks, yet compact enough
to facilitate fast learning. Using deep learning to process raw image pixels has been successful in RL [87,
143], however processing raw audio samples usually takes very large models to do efficiently [95], and to
this day many state-of-the-art audio systems rely on some form of feature engineering (see Garcia et al.
[41] for an example). These features are applicable to different tasks, with varying levels of performance
depending on the task at hand.
For this reason, we propose three different encoders, which we compare in our experiments. The task
of the audio encoder is to generate a compact representation of the raw sound data. This representation is
then concatenated with the features from the image processing network. The resulting vector of features
is fed to the rest of the network to generate actions and value estimates (see Fig. 4.2).
The raw audio input is a vector s ∈ R
n
, s
i
∈ [− 1,1] containing n normalized audio samples. ViZ-
Doom runs at a fixed 35 frames per (realtime) second, so for each simulation step this input contains audio
corresponding to 29ms of gameplay. With the fixed 22050Hz sampling rate and standard 4-frameskip,
our audio observation consists of 2520 samples, or 114ms of audio. We process both left and right audio
53
channels separately and concatenate channel features into a single output vector. Fig. 4.2 illustrates the
high-level structure of the encoders.
1DConv. We downsize the audio by taking every 8
th
sample and then feed the samples through two 1D
convolutional layers. While this removes high-frequency components (anything above≈ 3000Hz), most of
the information lies below this frequency threshold. This downsampling allows us to reduce computational
complexity. The convolutional encoder can be considered a naive baseline approach
Fourier transform. We transform the audio buffer to frequency domain using fast Fourier transform
(FFT) and take the natural logarithm of the magnitudess
FFT
=log FFT(s)∈R
n/2
, downsample with a 1D
max-pool layer and then feed it through a two-layer, fully connected network. This discards the temporal
information inside the 114ms of audio, but enables robust performance and a simple network architecture.
Mel-spectrogram. Motivated by the success of the mel-spectrogram approach in speech processing [41],
we transform samples into frequency domain spectrogram with short-term Fourier transform (SFTF). STFT
works by sliding a window over the audio samples, computing FFT on that window and then moving the
window by a given hop. Depending on the window size and the hop length, the resulting spectrogram can
have a large number of feature vectors (depending on the length of the audio) and frequency bins, with
most high-frequency components containing only a minimal amount of useful information. Motivated by
studies of human audio perception, mel-frequency scale emphasises higher resolution at lower frequencies,
usually computed by using triangular overlapping windows [30]. This comes with the additional benefit
of reducing the size of the spectrogram. We compute the spectrogram using parameters from [41], with a
window size of 25ms, 10ms hop size and 80 mel-frequency components. See Fig. 4.2 for an example of the
spectrograms produced with these hyperparameters. The resulting spectrogram is processed by two 2D
convolutional layers.
54
4.1.5 ExperimentalSetup
We train our agents using an asynchronous RL framework Sample Factory. We follow ViZDoom experi-
ments in the original paper [105] and use the same algorithm, hyperparameters, and model architectures.
In particular, we use the asynchronous proximal policy optimization (PPO) algorithm with clip loss [123]
and V-trace off-policy correction [35]. Our model consists of a three-layer, convolutional network to pro-
cess the RGB image, a chosen audio encoder, a gated recurrent unit layer [26], and a fully-connected layer
to produce action probabilities and value estimates. We ran all our experiments on a single 36-core server
with four Nvidia RTX 2080Ti GPUs.
4.1.6 EnvironmentScenarios
In order to test the audio encoders and the agent’s overall problem-solving abilities we designed three dif-
ferent scenarios based on the map layout depicted in the Figs. 4.1 and 4.4, where six visually distinct pillars
are placed in four different rooms. The ordering of pillars and the agent starting position is randomized in
each episode. Our fourth and final scenario is a self-play duel in a full game of Doom.
MusicRecognition. Each pillar plays a different music track in a loop throughout the episode. One pillar
is randomly chosen to play an unique target track. The agent is given a+1 reward upon touching the pillar
that plays the target track. Touching other pillars terminates the episode with a0 reward. The attenuation
value of the sound sources is high, therefore the agent has to move close to a pillar to hear the sound.
Thus the agent’s strategy shall be to move from pillar to pillar and listen, until it finds the pillar playing
the target track.
SoundInstruction. During the episode the agent repeatedly hears a command in spoken English, which
instructs it to go to a particular object. The agent is rewarded for touching the correct object and receives
zero reward otherwise. Unlike the previous scenario where the decision to move close to an object was
purely based on the sound, here the agent has to use both visual and auditory input to complete the task.
55
Sound Instruction Once. A more complex version of the Sound Instruction environment where the in-
struction is only given once at the beginning of the episode. This scenario tests a combination of multi-
modal perception and the ability to memorize instructions.
Duel. Finally, we train our agents in a 1v1 self-play matchup in the full game of Doom, following a setup
similar to [105] except with full access to in-game sounds. We evaluate the agent against a separately
trained agent that is not equipped with the sound encoder, and we hypothesise that the agent with access
to sound can outperform the deaf agent.
4.1.7 TrainingSettings
We test all audio encoders in the Music Recognition scenario, where training converges within 5× 10
8
environment steps. We then choose the best-performing encoder for other experiments, where we train
for 10
9
steps in Sound Instruction scenarios and for 2× 10
9
steps in Duel scenario. We fixed the image
resolution to 128x72 and set the frameskip to 4 for all environments except Duel which was run with
2-frameskip.
4.1.8 Results
After the initial testing on theMusicRecognition scenario we found that the Fourier transform encoder was
the most efficient (Fig. 4.3). We continued to test the FFT encoder on SoundInstruction andSoundInstruction
Once scenarios. In the majority of the training runs the agent was able to reach optimal performance in
each of these scenarios. The high variance in the results suggests that 10
9
steps of training are still not
sufficient for all seeds to converge. We also found that the agent showed slightly better final performance
on the supposedly harder taskSoundInstructionOnce. Although it is likely this is just a statistical anomaly
given the high variance and low number of independent runs (limited by the computation budget), we
leave full explanation of this surprising result for future work.
56
Table 4.1: Results of 1v1 matches between our agent that has access to the sound and a vision-only agent.
“Sound (dis.)" is the main agent with sound inputs disabled during the evaluation.
Match Wins Losses Draws
Sound vs No sound 53 31 16
Sound vs Sound (dis.) 74 17 9
In theMusicRecognition scenario, we saw the agent achieve the expected behaviour, where it uses the
stereo sound to navigate to the correct pillar. The agent explores the map listening to different music and
moves closer to the source when the target music track is recognised. We also saw the agent move towards
pillars backwards, showing that visual input is often superfluous in this task. In the SoundInstructionOnce
scenario, we noticed that the agent goes to the center of the map early to await the instructions, which
helps minimize the average time to complete the task. While the agent is waiting it keeps turning around
memorizing the locations of the objects. Once the instruction starts the agent would quickly turn and
approach the target object. This behaviour shows the agent’s ability to combine auditory and visual cues
to quickly explore its surroundings and map the sound instruction to the appropriate action. Besides, we
noticed the agent’s ability to memorize the instructions for the entire episode, courtesy of the recurrent
model architecture.
Table 4.1 shows the benefit of having access to sound information in the Duel scenario. Here we trained
two sets of agents using population-based training and self-play, with a small population of 4 policies. The
main population (“Sound”) used the FFT encoder and had access to both auditory and visual observations.
Another set of agents (“No sound”) used only image observations. After training for2× 10
9
steps we chose
the best agents from both populations and ran two series of one hundred 4-minute matches between them.
In the first series of games we compared "Sound" and "No sound" versions of the agents. Our main agent
won in more games, demonstrating the advantage of the enhanced sensorium. In the second series of
games we tested our "Sound" agent against a version of itself with its auditory observations replaced with
57
silence (“Sound (dis.)”). The agent with disabled hearing played significantly worse, demonstrating the
strong reliance of our agent on sound cues.
When analysing the behavior of the main agent in the duel environment we noticed the reduced usage
of loud ammunition. We believe this allows the agent to conceal its position from the opponent which
facilitates surprise attacks. The agent also uses its spatial sound perception to discover the location of the
enemy by listening to the opponent’s gunfire.
4.1.9 TrainingThroughput
To measure the total computation cost added by rendering and processing sound in our experiments we
tracked the average training throughput. In single-agent experiments we collected experience using 72
parallel workers, each worker sampling 8 environments sequentially for a total of72× 8 = 576 parallel
environments per experiment. We ran 4 such experiments at a time on a 36-core machine with 4 GPUs to
maximize the hardware utilization. We observed training throughput of1.5× 10
5
game frames per second
per experiment with disabled sounds and1.2× 10
5
when sound is enabled.
We did not notice a significant difference in performance in Duel scenario. Here we trained a popu-
lation of 4 policies at a combined framerate of6.7× 10
4
both with and without the sound. The training
performance in multi-agent VizDoom envs is bottlenecked by slow network-based communication be-
tween game instances in the multi-agent setup, and thus addition of sound rendering workload does not
have a significant effect.
4.1.10 ConclusionsandFutureWork
In this work we introduced an immersive environment based on ViZDoom that provides access to both
auditory and visual observations while maintaining high simulation throughput. We introduced new sce-
narios that test the agent’s ability to hear and identify sounds, as well as combine sound with visual cues.
58
Our results indicate that transforming the audio samples into frequency domain with FFT is sufficient for
fast and effective RL training when combined with a recurrent neural architecture. This is evident from
the results of our experiments with sound separation and instruction execution, as well as results on a
full game where the agents with augmented sensorium prevail. We hope that access to the efficient envi-
ronment that simulates auditory experience will enable large-scale experiments and can facilitate further
research in this area.
Being a preliminary work, there are still a myriad of open questions and limitations to address. We
only used one RL algorithm in our experiments and only three different audio encoders, without excessive
hyperparameter and/or architecture tuning. The scenarios used in our experiments could also be extended:
we used a limited bank of sounds in the experiments, and to assess how well the agent learned to “under-
stand sound” instead of overfitting to specific cues, we need a larger bank of sounds to pick from. This
can be done by adding more natural sounds and by augmenting the existing ones with random noise and
other transformations to prevent the neural network from memorizing the exact samples.
4.2 DexPBT:LearningDexterousRoboticManipulationwithPopulation-
BasedTraining
4.2.1 Introduction
Figure 4.5: Tasks trained with DexPBT. Left-to-right: regrasping, throwing, single-handed reorientation,
two-handed reorientation.
59
In recent years researchers have started to apply deep reinforcement learning methods in an increasing
number of challenging continuous control domains. Some applications include impressive demonstrations
of skill in virtual environments, such as playing football by directly controlling humanoid characters using
joint torques [75]. In other domains such as agile drone flight [130, 88, 11] and quadruped locomotion [66,
84], control policies trained in simulation are deployed directly on the real hardware.
Learning-based approaches have been particularly promising in the domain of robotic manipulation.
Traditional methods such as direct trajectory optimization may struggle to model complex contact dynam-
ics. Due to these difficulties, researchers working on manipulation problems typically focus their efforts
on robotic arms with end-effectors that simplify contact handling, such as parallel jaw grippers [21, 93,
81]. Even though more capable, human-like robotic hands have the potential to endow robots with far
more advanced manipulation capabilities, this type of end-effector remains unpopular as a result of the
difficulty of controlling high degree-of-freedom (DoF) systems in contact-rich environments.
With the advent of modern deep RL methods that use large amount of data and computation, it has
become possible to learn control policies for multi-fingered robotic hands. At the time of writing, a system
called "Dactyl" [5] remains state-of-the-art in dexterous robotic manipulation, having learned to rotate an
object in-hand or even assemble Rubik’s cube with a standalone statically mounted Shadow Hand manip-
ulator [96].
In this work, we attempt to build towards the next step in dexterous manipulation. We demonstrate
how general learning methods can be utilized to solve complex object manipulation problems in simulation
with a fully actuated hand-arm system, in our case a four finger 16-DoF Allegro Hand mounted on a 7-
DoF Kuka arm. We then scale these methods to train agents that control a pair of arms and hands with
combined 46 degrees of freedom with a single neural network policy.
We observe that despite of the relative success of straightforward end-to-end learning, our RL ex-
periments are characterised by high variance of results and dependency on initial conditions, especially in
60
tasks that require exploration in the vast space of possible behaviors. To that end, we develop a Population-
Based Training (PBT) algorithm [55], an outer optimization loop that can significantly amplify exploration
capabilities of end-to-end RL. We find that the PBT approach demonstrates improved performance in all
scenarios and becomes an enabling factor for successful training of ambidextrous agents that control two
hand-arm systems simultaneously.
Our primary contributions can be summarized as follows:
• To our best knowledge, we provide the first RL-based approach that can learn grasping and dexterous
in-hand object manipulation on a high-DoF hand-arm system.
• We introduce a framework that combines Population Based Training with realistic vectorized robotic
simulation and use this framework to train robust high-performance policies solving continuous
control tasks.
• We plan to release our environments and Population-Based Training code to facilitate further re-
search in dexterous robotic manipulation.
4.2.2 RelatedWork
Producing control policies to perform complex, contact-rich tasks has been a long-standing challenge in
robotics. Classical methods for this have focused on directly leveraging robot kinematics [114, 80, 71].
While useful in free-space or pick-and-place style tasks, these methods struggle as the number of contacts
grows.
Recently, various systems have leveraged learning-based methods to perform contact-rich robotic ma-
nipulation, achieving impressive results both in simulation and reality [67, 57]. In particular, RL-based
methods in simulation have shown the ability to learn complex and robust behaviors in realistic scenarios
61
which transfer to the real world [66, 5]. The advent of high-throughput simulation on GPUs has improved
the speed at which such tasks can be learned using RL [79, 113, 3, 53, 38].
Multiple prior papers have explored the problem of in-hand manipulation. Andrychowicz et al. [5]
and OpenAI et al. [96] showed that it was possible to train the policy to do in-hand cube manipulation
and Rubik’s cube solving entirely in simulation and deploy it on the real robot. Other work has shown
multiple tasks with free-floating hands [25] or with object grasping [145]. Matl et al. [81] demonstrated
methods for learning grasping policies with two robotic arms and different types of end-effectors such
as suction cups and parallel jaw grippers. However, to date no work has tackled the problem of in-hand
object manipulation with single or dual multi-fingered hand-and-arm system.
Jaderberg et al. [55] popularized Population-Based Training in a variety of domains such as RL, adver-
sarial learning, and machine translation, however PBT methods turned out to be particularly promising in
deep RL where exploration is often the bottleneck. PBT provides a way to combine the exploration power
of multiple learners and direct resources towards more promising behaviors. Since then, PBT algorithms
have been exceptionally successful in RL for video game applications and helped to produce state-of-the-
art agents for games such as Quake [54], Doom [105], and Starcraft II [138].
Recently Wan et al. [139] augmented PBT-style methods with trust-region based Bayesian Optimization
and were able to optimize both hyperparameters and model architectures simultaneously. Flajolet et al. [36]
proposed highly efficient JAX implementation of PBT that enables evaluation of multiple agents on one
accelerator. Both Wan et al. [139] and Flajolet et al. [36] demonstrated great results on standard continuous
control benchmarks such as Half-Cheetah and Humanoid, in contrast, we apply our method in complex
dexterous manipulation domains more similar to real robot settings. Additionally, our decentralized PBT
implementation (Sec. 4.2.7) makes it easy to use the algorithm in volatile compute environments such as
shared Slurm clusters.
62
AAAC0XichVFLS8NAEJ7GV+uz6tFLsBQ8laSCeiz4wItQsS9oi2zSbQzNi2RbqEUQr9686h/T3+LBb9dU0CLdsJnZb775dmbHijw3EYbxntEWFpeWV7K51bX1jc2t/PZOIwmHsc3rduiFcctiCffcgNeFKzzeimLOfMvjTWtwKuPNEY8TNwxqYhzxrs+cwO27NhOAWrUw0g+NTvE2XzBKhlr6rGOmToHSVQ3zH9ShHoVk05B84hSQgO8RowRfm0wyKALWpQmwGJ6r4pweaBW5Q7A4GAzoAH8Hp3aKBjhLzURl27jFw46RqVMR+0IpWmDLWzn8BPYT+15hzr83TJSyrHAMa0ExpxSvgAu6A2Nepp8yp7XMz5RdCerTierGRX2RQmSf9o/OGSIxsIGK6HSumA40LHUe4QUC2DoqkK88VdBVxz1YpixXKkGqyKAXw8rXRz0Ys/l3qLNOo1wyj0rl63KhYqQDz9Ie7dMBpnpMFbqkKuqQ03yhV3rTbrSx9qg9fVO1TJqzS7+W9vwFSyeRAQ==
Top 30%
AAAC0XichVFLS8NAEJ7GV+uz6tFLsBQ8laSCeiz4wItQsS9oi2zSbQzNi2RbqEUQr9686h/T3+LBb9dU0CLdsJnZb775dmbHijw3EYbxntEWFpeWV7K51bX1jc2t/PZOIwmHsc3rduiFcctiCffcgNeFKzzeimLOfMvjTWtwKuPNEY8TNwxqYhzxrs+cwO27NhOAWrUw0g+NTvE2XzBKhlr6rGOmToHSVQ3zH9ShHoVk05B84hSQgO8RowRfm0wyKALWpQmwGJ6r4pweaBW5Q7A4GAzoAH8Hp3aKBjhLzURl27jFw46RqVMR+0IpWmDLWzn8BPYT+15hzr83TJSyrHAMa0ExpxSvgAu6A2Nepp8yp7XMz5RdCerTierGRX2RQmSf9o/OGSIxsIGK6HSumA40LHUe4QUC2DoqkK88VdBVxz1YpixXKkGqyKAXw8rXRz0Ys/l3qLNOo1wyj0rl63KhYqQDz9Ie7dMBpnpMFbqkKuqQ03yhV3rTbrSx9qg9fVO1TJqzS7+W9vwFSyeRAQ==
Top 30%
AAAC0XichVFLS8NAEJ7GV+uz6tFLsBQ8laSIeiz4wEuhon1AWyRJt3FpXmzSQi2CePXmVf+Y/hYPfrumghbphs3MfvPNtzM7duTxODGM94y2sLi0vJLNra6tb2xu5bd3GnE4FA6rO6EXipZtxczjAasnPPFYKxLM8m2PNe3BqYw3R0zEPAxuknHEur7lBrzPHSsB1Krynn5odIq3+YJRMtTSZx0zdQqUrlqY/6AO9Sgkh4bkE6OAEvgeWRTja5NJBkXAujQBJuBxFWf0QKvIHYLFwLCADvB3cWqnaICz1IxVtoNbPGyBTJ2K2BdK0QZb3srgx7Cf2PcKc/+9YaKUZYVjWBuKOaVYBZ7QHRjzMv2UOa1lfqbsKqE+nahuOOqLFCL7dH50zhARwAYqotO5YrrQsNV5hBcIYOuoQL7yVEFXHfdgLWWZUglSRQt6Ala+PurBmM2/Q511GuWSeVQqX5ULFSMdeJb2aJ8OMNVjqtAl1VCHnOYLvdKbdq2NtUft6ZuqZdKcXfq1tOcvEQmQ6Q==
Mid 40%
AAAC0XichVFLS8NAEJ7GV+uz6tFLsBQ8laSIeiz4wEuhon1AWyRJt3FpXmzSQi2CePXmVf+Y/hYPfrumghbphs3MfvPNtzM7duTxODGM94y2sLi0vJLNra6tb2xu5bd3GnE4FA6rO6EXipZtxczjAasnPPFYKxLM8m2PNe3BqYw3R0zEPAxuknHEur7lBrzPHSsB1Krynn5odIq3+YJRMtTSZx0zdQqUrlqY/6AO9Sgkh4bkE6OAEvgeWRTja5NJBkXAujQBJuBxFWf0QKvIHYLFwLCADvB3cWqnaICz1IxVtoNbPGyBTJ2K2BdK0QZb3srgx7Cf2PcKc/+9YaKUZYVjWBuKOaVYBZ7QHRjzMv2UOa1lfqbsKqE+nahuOOqLFCL7dH50zhARwAYqotO5YrrQsNV5hBcIYOuoQL7yVEFXHfdgLWWZUglSRQt6Ala+PurBmM2/Q511GuWSeVQqX5ULFSMdeJb2aJ8OMNVjqtAl1VCHnOYLvdKbdq2NtUft6ZuqZdKcXfq1tOcvEQmQ6Q==
Mid 40%
AAAC1HichVFLS8NAEJ7GV1tfVY9egqXgqSQV1GPxhRehgn1AWyRJ1xiaF5ttoVZP4tWbV/1d+ls8+O2aClqkGzYz+803387s2LHvJcIw3jPa3PzC4lI2l19eWV1bL2xsNpJowB1WdyI/4i3bSpjvhawuPOGzVsyZFdg+a9r9YxlvDhlPvCi8EqOYdQPLDb0bz7EEoPZRJEQU6HtGp3RdKBplQy192jFTp0jpqkWFD+pQjyJyaEABMQpJwPfJogRfm0wyKAbWpTEwDs9TcUYPlEfuACwGhgW0j7+LUztFQ5ylZqKyHdziY3Nk6lTCPlOKNtjyVgY/gf3EvlOY++8NY6UsKxzB2lDMKcUL4IJuwZiVGaTMSS2zM2VXgm7oUHXjob5YIbJP50fnBBEOrK8iOp0qpgsNW52HeIEQto4K5CtPFHTVcQ/WUpYplTBVtKDHYeXrox6M2fw71GmnUSmb++XKZaVYNdKBZ2mbdmgXUz2gKp1TDXXIubzQK71pDe1ee9SevqlaJs3Zol9Le/4CzA6SYQ==
Bottom 30%
AAAC1HichVFLS8NAEJ7GV1tfVY9egqXgqSQV1GPxhRehgn1AWyRJ1xiaF5ttoVZP4tWbV/1d+ls8+O2aClqkGzYz+803387s2LHvJcIw3jPa3PzC4lI2l19eWV1bL2xsNpJowB1WdyI/4i3bSpjvhawuPOGzVsyZFdg+a9r9YxlvDhlPvCi8EqOYdQPLDb0bz7EEoPZRJEQU6HtGp3RdKBplQy192jFTp0jpqkWFD+pQjyJyaEABMQpJwPfJogRfm0wyKAbWpTEwDs9TcUYPlEfuACwGhgW0j7+LUztFQ5ylZqKyHdziY3Nk6lTCPlOKNtjyVgY/gf3EvlOY++8NY6UsKxzB2lDMKcUL4IJuwZiVGaTMSS2zM2VXgm7oUHXjob5YIbJP50fnBBEOrK8iOp0qpgsNW52HeIEQto4K5CtPFHTVcQ/WUpYplTBVtKDHYeXrox6M2fw71GmnUSmb++XKZaVYNdKBZ2mbdmgXUz2gKp1TDXXIubzQK71pDe1ee9SevqlaJs3Zol9Le/4CzA6SYQ==
Bottom 30%
AAAC0XichVFLS8NAEJ7GV1tfVY9egkXwVJIe1GPxhRehYl9Qi2zStS7Ni01aqEUQr9686h/T3+LBb9dU0CLdsJnZb775dmbHiTwRJ5b1njHm5hcWl7K5/PLK6tp6YWOzEYcD6fK6G3qhbDks5p4IeD0RicdbkeTMdzzedPrHKt4cchmLMKglo4h3fNYLxK1wWQKoVT2qmXzIvJtC0SpZepnTjp06RUpXNSx80DV1KSSXBuQTp4AS+B4xivG1ySaLImAdGgOT8ISOc3qgPHIHYHEwGNA+/j2c2ika4Kw0Y53t4hYPWyLTpF3sM63ogK1u5fBj2E/se431/r1hrJVVhSNYB4o5rXgBPKE7MGZl+ilzUsvsTNVVQrd0qLsRqC/SiOrT/dE5QUQC6+uISaea2YOGo89DvEAAW0cF6pUnCqbuuAvLtOVaJUgVGfQkrHp91IMx23+HOu00yiV7v1S+LBcrVjrwLG3TDu1hqgdUoXOqog41zRd6pTfjyhgZj8bTN9XIpDlb9GsZz19oEZF4
PBT eval
AAAC0XichVFLS8NAEJ7GV1tfVY9egkXwVJIe1GPxhRehYl9Qi2zStS7Ni01aqEUQr9686h/T3+LBb9dU0CLdsJnZb775dmbHiTwRJ5b1njHm5hcWl7K5/PLK6tp6YWOzEYcD6fK6G3qhbDks5p4IeD0RicdbkeTMdzzedPrHKt4cchmLMKglo4h3fNYLxK1wWQKoVT2qmXzIvJtC0SpZepnTjp06RUpXNSx80DV1KSSXBuQTp4AS+B4xivG1ySaLImAdGgOT8ISOc3qgPHIHYHEwGNA+/j2c2ika4Kw0Y53t4hYPWyLTpF3sM63ogK1u5fBj2E/se431/r1hrJVVhSNYB4o5rXgBPKE7MGZl+ilzUsvsTNVVQrd0qLsRqC/SiOrT/dE5QUQC6+uISaea2YOGo89DvEAAW0cF6pUnCqbuuAvLtOVaJUgVGfQkrHp91IMx23+HOu00yiV7v1S+LBcrVjrwLG3TDu1hqgdUoXOqog41zRd6pTfjyhgZj8bTN9XIpDlb9GsZz19oEZF4
PBT eval
AAAC03ichVFLT8JAEB7qC/CFevTSCCacSMtBPZL4iBcSTCwQkZhtWWpDX2m3JEi8GK/evOr/0t/iwa9rMVFi3GY7s9/MfPMyQ9eJhaa95ZSFxaXllXyhuLq2vrFZ2tpux0ESWdywAjeIuiaLuev43BCOcHk3jDjzTJd3zNFxau+MeRQ7gX8pJiHve8z2naFjMQHoqpkIJrhaCSs3pbJW0+RR5xU9U8qUnVZQeqdrGlBAFiXkESefBHSXGMX4eqSTRiGwPk2BRdAcaed0T0XEJvDi8GBAR/jbePUy1Mc75YxltIUsLm6ESJX2cc8kownvNCuHHkN+4N5JzP4zw1QypxVOIE0wFiRjE7igW3j8F+llnrNa/o9MuxI0pCPZjYP6QomkfVrfPCewRMBG0qLSqfS0wWHK9xgT8CENVJBOecagyo4HkExKLln8jJGBL4JMp496sGb991LnlXa9ph/U6hf1cqOaLTxPu7RHVWz1kBp0Ti3UYSHLM73Qq2IoU+VBefxyVXJZzA79OMrTJyswkh4=
Mutate p
AAAC03ichVFLT8JAEB7qC/CFevTSCCacSMtBPZL4iBcSTCwQkZhtWWpDX2m3JEi8GK/evOr/0t/iwa9rMVFi3GY7s9/MfPMyQ9eJhaa95ZSFxaXllXyhuLq2vrFZ2tpux0ESWdywAjeIuiaLuev43BCOcHk3jDjzTJd3zNFxau+MeRQ7gX8pJiHve8z2naFjMQHoqpkIJrhaCSs3pbJW0+RR5xU9U8qUnVZQeqdrGlBAFiXkESefBHSXGMX4eqSTRiGwPk2BRdAcaed0T0XEJvDi8GBAR/jbePUy1Mc75YxltIUsLm6ESJX2cc8kownvNCuHHkN+4N5JzP4zw1QypxVOIE0wFiRjE7igW3j8F+llnrNa/o9MuxI0pCPZjYP6QomkfVrfPCewRMBG0qLSqfS0wWHK9xgT8CENVJBOecagyo4HkExKLln8jJGBL4JMp496sGb991LnlXa9ph/U6hf1cqOaLTxPu7RHVWz1kBp0Ti3UYSHLM73Qq2IoU+VBefxyVXJZzA79OMrTJyswkh4=
Mutate p
AAAC23ichVFLT8JAEB7qC/BV9eilEUw4kcJBPZL4iBcTNPJIgJBtWUpDX2kXEiSevBmv3rzqf9Lf4sGvazFRYthmO7PffPPtzI4ROHYkdP09pSwtr6yupTPZ9Y3NrW11Z7ce+aPQ5DXTd/ywabCIO7bHa8IWDm8GIWeu4fCGMTyN440xDyPb927FJOAdl1me3bdNJgB1VfWGBw4zuZZviwEXLN9Vc3pRl0ubd0qJk6NkVX31g9rUI59MGpFLnDwS8B1iFOFrUYl0CoB1aAoshGfLOKd7yiJ3BBYHgwEd4m/h1EpQD+dYM5LZJm5xsENkanSIfSEVDbDjWzn8CPYT+05i1r83TKVyXOEE1oBiRipeARc0AGNRppswZ7Uszoy7EtSnE9mNjfoCicR9mj86Z4iEwIYyotG5ZFrQMOR5jBfwYGuoIH7lmYImO+7BMmm5VPESRQa9EDZ+fdSDMZf+DnXeqZeLpaNi+bqcqxSSgadpnw6ogKkeU4UuqYo6TNT0Qq/0pnSUB+VRefqmKqkkZ49+LeX5C0N4lNk=
Replace ✓
AAAC23ichVFLT8JAEB7qC/BV9eilEUw4kcJBPZL4iBcTNPJIgJBtWUpDX2kXEiSevBmv3rzqf9Lf4sGvazFRYthmO7PffPPtzI4ROHYkdP09pSwtr6yupTPZ9Y3NrW11Z7ce+aPQ5DXTd/ywabCIO7bHa8IWDm8GIWeu4fCGMTyN440xDyPb927FJOAdl1me3bdNJgB1VfWGBw4zuZZviwEXLN9Vc3pRl0ubd0qJk6NkVX31g9rUI59MGpFLnDwS8B1iFOFrUYl0CoB1aAoshGfLOKd7yiJ3BBYHgwEd4m/h1EpQD+dYM5LZJm5xsENkanSIfSEVDbDjWzn8CPYT+05i1r83TKVyXOEE1oBiRipeARc0AGNRppswZ7Uszoy7EtSnE9mNjfoCicR9mj86Z4iEwIYyotG5ZFrQMOR5jBfwYGuoIH7lmYImO+7BMmm5VPESRQa9EDZ+fdSDMZf+DnXeqZeLpaNi+bqcqxSSgadpnw6ogKkeU4UuqYo6TNT0Qq/0pnSUB+VRefqmKqkkZ49+LeX5C0N4lNk=
Replace ✓
AAAC03ichVFLT8JAEB7qC/CFevTSCCacSMtBPZL4iBcSTCwQkZhtWWpDX2m3JEi8GK/evOr/0t/iwa9rMVFi3GY7s9/MfPMyQ9eJhaa95ZSFxaXllXyhuLq2vrFZ2tpux0ESWdywAjeIuiaLuev43BCOcHk3jDjzTJd3zNFxau+MeRQ7gX8pJiHve8z2naFjMQHoqpkIJrhaCSs3pbJW0+RR5xU9U8qUnVZQeqdrGlBAFiXkESefBHSXGMX4eqSTRiGwPk2BRdAcaed0T0XEJvDi8GBAR/jbePUy1Mc75YxltIUsLm6ESJX2cc8kownvNCuHHkN+4N5JzP4zw1QypxVOIE0wFiRjE7igW3j8F+llnrNa/o9MuxI0pCPZjYP6QomkfVrfPCewRMBG0qLSqfS0wWHK9xgT8CENVJBOecagyo4HkExKLln8jJGBL4JMp496sGb991LnlXa9ph/U6hf1cqOaLTxPu7RHVWz1kBp0Ti3UYSHLM73Qq2IoU+VBefxyVXJZzA79OMrTJyswkh4=
Mutate p
AAAC03ichVFLT8JAEB7qC/CFevTSCCacSMtBPZL4iBcSTCwQkZhtWWpDX2m3JEi8GK/evOr/0t/iwa9rMVFi3GY7s9/MfPMyQ9eJhaa95ZSFxaXllXyhuLq2vrFZ2tpux0ESWdywAjeIuiaLuev43BCOcHk3jDjzTJd3zNFxau+MeRQ7gX8pJiHve8z2naFjMQHoqpkIJrhaCSs3pbJW0+RR5xU9U8qUnVZQeqdrGlBAFiXkESefBHSXGMX4eqSTRiGwPk2BRdAcaed0T0XEJvDi8GBAR/jbePUy1Mc75YxltIUsLm6ESJX2cc8kownvNCuHHkN+4N5JzP4zw1QypxVOIE0wFiRjE7igW3j8F+llnrNa/o9MuxI0pCPZjYP6QomkfVrfPCewRMBG0qLSqfS0wWHK9xgT8CENVJBOecagyo4HkExKLln8jJGBL4JMp496sGb991LnlXa9ph/U6hf1cqOaLTxPu7RHVWz1kBp0Ti3UYSHLM73Qq2IoU+VBefxyVXJZzA79OMrTJyswkh4=
Mutate p
AAAC5nichVHBTttAEH0xpQVKW0OPuViNKnGKnBxKj0iFqpdKoBJAggit15tkFce21ptIIeLAD/RW9dpbr+3nlG/pgefFIEGEstZ6Zt+8eTuzE+WJLmwY/qt5S8+Wn79YWV17uf7q9Rt/Y/OoyMZGqo7MksycRKJQiU5Vx2qbqJPcKDGKEnUcDT+V8eOJMoXO0kM7zVV3JPqp7mkpLKFzv/5tIIyKAzlQcphnOrVBrI2SNjPTc78RNkO3gnmnVTkNVGs/869xhhgZJMYYQSGFpZ9AoOB3ihZC5MS6mBEz9LSLK1xijbljshQZguiQ/z5PpxWa8lxqFi5b8paE2zAzwHvuz04xIru8VdEvaP9zXzis/+QNM6dcVjiljai46hS/ErcYkLEoc1Qx72pZnFl2ZdHDR9eNZn25Q8o+5b3OLiOG2NBFAuw5Zp8akTtP+AIpbYcVlK98pxC4jmNa4axyKmmlKKhnaMvXZz0cc+vxUOedo3az9aHZPmg3draqga+gjnfY4lS3sYMv2GcdElf4jT/46w28794P7+ct1atVOW/xYHm/bgA58ZpF
Shared checkpoint directory
AAAC5nichVHBTttAEH0xpQVKW0OPuViNKnGKnBxKj0iFqpdKoBJAggit15tkFce21ptIIeLAD/RW9dpbr+3nlG/pgefFIEGEstZ6Zt+8eTuzE+WJLmwY/qt5S8+Wn79YWV17uf7q9Rt/Y/OoyMZGqo7MksycRKJQiU5Vx2qbqJPcKDGKEnUcDT+V8eOJMoXO0kM7zVV3JPqp7mkpLKFzv/5tIIyKAzlQcphnOrVBrI2SNjPTc78RNkO3gnmnVTkNVGs/869xhhgZJMYYQSGFpZ9AoOB3ihZC5MS6mBEz9LSLK1xijbljshQZguiQ/z5PpxWa8lxqFi5b8paE2zAzwHvuz04xIru8VdEvaP9zXzis/+QNM6dcVjiljai46hS/ErcYkLEoc1Qx72pZnFl2ZdHDR9eNZn25Q8o+5b3OLiOG2NBFAuw5Zp8akTtP+AIpbYcVlK98pxC4jmNa4axyKmmlKKhnaMvXZz0cc+vxUOedo3az9aHZPmg3draqga+gjnfY4lS3sYMv2GcdElf4jT/46w28794P7+ct1atVOW/xYHm/bgA58ZpF
Shared checkpoint directory
AAAC23ichVFLS8NAEB7jq/UZ9egl2ArioaQ9qMeCD7woFVpbqKVs4jYuzYvNtqDFkzfx6s2r/if9LR78skZBi7hhM7PffPPtzI4T+yJRtv06YUxOTc/M5vJz8wuLS8vmyup5Eg2kyxtu5Eey5bCE+yLkDSWUz1ux5CxwfN50+vtpvDnkMhFRWFfXMe8EzAtFT7hMAeqaZl0yEVrF0+5IKC5vi12zYJdsvaxxp5w5BcpWLTLf6IIuKSKXBhQQp5AUfJ8YJfjaVCabYmAdGgGT8ISOc7qlOeQOwOJgMKB9/D2c2hka4pxqJjrbxS0+tkSmRZvYR1rRATu9lcNPYN+xbzTm/XnDSCunFV7DOlDMa8UT4IquwPgvM8iYX7X8n5l2pahHe7obgfpijaR9ut86B4hIYH0dsehQMz1oOPo8xAuEsA1UkL7yl4KlO76EZdpyrRJmigx6EjZ9fdSDMZd/D3XcOa+UyjulylmlUN3OBp6jddqgLUx1l6p0TDXU4aKmJ3qmF6Nj3Bn3xsMn1ZjIctboxzIePwCl1ZUE
TrainN
iter
AAAC23ichVFLS8NAEB7jq/UZ9egl2ArioaQ9qMeCD7woFVpbqKVs4jYuzYvNtqDFkzfx6s2r/if9LR78skZBi7hhM7PffPPtzI4T+yJRtv06YUxOTc/M5vJz8wuLS8vmyup5Eg2kyxtu5Eey5bCE+yLkDSWUz1ux5CxwfN50+vtpvDnkMhFRWFfXMe8EzAtFT7hMAeqaZl0yEVrF0+5IKC5vi12zYJdsvaxxp5w5BcpWLTLf6IIuKSKXBhQQp5AUfJ8YJfjaVCabYmAdGgGT8ISOc7qlOeQOwOJgMKB9/D2c2hka4pxqJjrbxS0+tkSmRZvYR1rRATu9lcNPYN+xbzTm/XnDSCunFV7DOlDMa8UT4IquwPgvM8iYX7X8n5l2pahHe7obgfpijaR9ut86B4hIYH0dsehQMz1oOPo8xAuEsA1UkL7yl4KlO76EZdpyrRJmigx6EjZ9fdSDMZd/D3XcOa+UyjulylmlUN3OBp6jddqgLUx1l6p0TDXU4aKmJ3qmF6Nj3Bn3xsMn1ZjIctboxzIePwCl1ZUE
TrainN
iter
AAAC9HichVFNTxRBEH2MX4Bfqx65dFhM9LKZ3QN6JEGJF5MlYYEECOlpmtnO9vRMeno3AeJv8A94M165cdXfIb/Fg2+awUSJoSc9Vf3q1euqrqyypg5p+nMuuXP33v0H8wuLDx89fvK08+z5dl1OvdIjVdrS72ay1tY4PQomWL1beS2LzOqdbLLexHdm2temdFvhpNIHhcydOTZKBkKHndcbpRdaqrGQuXZBGCeqspraGBYr+4UMYyWtGK4cdrppL41L3HT6rdNFu4Zl5xL7OEIJhSkKaDgE+hYSNb899JGiInaAM2KenolxjU9YZO6ULE2GJDrhP+dpr0Udz41mHbMVb7HcnpkCL7k3omJGdnOrpl/T/uI+jVj+3xvOonJT4QltRsWFqPiReMCYjNsyi5Z5XcvtmU1XAcd4G7sxrK+KSNOn+qPzjhFPbBIjAu8jM6dGFs8zvoCjHbGC5pWvFUTs+IhWRqujimsVJfU8bfP6rIdj7v871JvO9qDXX+0NNgfdtbQd+DyWsIxXnOobrOEDhqxD4TMu8B0/klnyJfmafLuiJnNtzgv8tZLz30XMnrw=
For each agent in populationP
AAAC9HichVFNTxRBEH2MX4Bfqx65dFhM9LKZ3QN6JEGJF5MlYYEECOlpmtnO9vRMeno3AeJv8A94M165cdXfIb/Fg2+awUSJoSc9Vf3q1euqrqyypg5p+nMuuXP33v0H8wuLDx89fvK08+z5dl1OvdIjVdrS72ay1tY4PQomWL1beS2LzOqdbLLexHdm2temdFvhpNIHhcydOTZKBkKHndcbpRdaqrGQuXZBGCeqspraGBYr+4UMYyWtGK4cdrppL41L3HT6rdNFu4Zl5xL7OEIJhSkKaDgE+hYSNb899JGiInaAM2KenolxjU9YZO6ULE2GJDrhP+dpr0Udz41mHbMVb7HcnpkCL7k3omJGdnOrpl/T/uI+jVj+3xvOonJT4QltRsWFqPiReMCYjNsyi5Z5XcvtmU1XAcd4G7sxrK+KSNOn+qPzjhFPbBIjAu8jM6dGFs8zvoCjHbGC5pWvFUTs+IhWRqujimsVJfU8bfP6rIdj7v871JvO9qDXX+0NNgfdtbQd+DyWsIxXnOobrOEDhqxD4TMu8B0/klnyJfmafLuiJnNtzgv8tZLz30XMnrw=
For each agent in populationP
AAACzHichVFLT8JAEB7qC/CFevTSSEw4kZaDeiTxES8ajIIkSMi2LNjQV7YLCRKu3rzqb9Pf4sFv12KixLDNdma/+ebbmR0n9r1EWtZ7xlhaXlldy+by6xubW9uFnd1GEg2Fy+tu5Eei6bCE+17I69KTPm/GgrPA8fm9MzhV8fsRF4kXhXdyHPN2wPqh1/NcJgHdso7sFIpW2dLLnHfs1ClSumpR4YMeqEsRuTSkgDiFJOH7xCjB1yKbLIqBtWkCTMDzdJzTlPLIHYLFwWBAB/j3cWqlaIiz0kx0totbfGyBTJMOsS+0ogO2upXDT2A/sZ801v/3holWVhWOYR0o5rTiFXBJj2AsygxS5qyWxZmqK0k9OtHdeKgv1ojq0/3ROUNEABvoiEnnmtmHhqPPI7xACFtHBeqVZwqm7rgLy7TlWiVMFRn0BKx6fdSDMdt/hzrvNCpl+6hcuakUq6V04FnapwMqYarHVKVLqqEOF9W90Cu9GdeGNCbG9JtqZNKcPfq1jOcvFpCPxA==
a
t
AAACzHichVFLT8JAEB7qC/CFevTSSEw4kZaDeiTxES8ajIIkSMi2LNjQV7YLCRKu3rzqb9Pf4sFv12KixLDNdma/+ebbmR0n9r1EWtZ7xlhaXlldy+by6xubW9uFnd1GEg2Fy+tu5Eei6bCE+17I69KTPm/GgrPA8fm9MzhV8fsRF4kXhXdyHPN2wPqh1/NcJgHdso7sFIpW2dLLnHfs1ClSumpR4YMeqEsRuTSkgDiFJOH7xCjB1yKbLIqBtWkCTMDzdJzTlPLIHYLFwWBAB/j3cWqlaIiz0kx0totbfGyBTJMOsS+0ogO2upXDT2A/sZ801v/3holWVhWOYR0o5rTiFXBJj2AsygxS5qyWxZmqK0k9OtHdeKgv1ojq0/3ROUNEABvoiEnnmtmHhqPPI7xACFtHBeqVZwqm7rgLy7TlWiVMFRn0BKx6fdSDMdt/hzrvNCpl+6hcuakUq6V04FnapwMqYarHVKVLqqEOF9W90Cu9GdeGNCbG9JtqZNKcPfq1jOcvFpCPxA==
a
t
AAAC03ichVFLS8NAEJ7GV1tfVY9egkXxVJIe1GPBB16EiqYt1iKbdK1L82KTFGrxIl69edX/pb/Fg1/WVNAi3bCZ2W9mvnnZoSui2DDec9rM7Nz8Qr5QXFxaXlktra03oiCRDrecwA1ky2YRd4XPrVjELm+FkjPPdnnT7h+m9uaAy0gE/mU8DHnHYz1f3AqHxYCuLoSXuEq9KZWNiqGOPqmYmVKm7NSD0gddU5cCcighjzj5FEN3iVGEr00mGRQC69AImIQmlJ3TAxURm8CLw4MB7ePfw6udoT7eKWekoh1kcXElInXaxj1RjDa806wcegT5iXuvsN6/GUaKOa1wCGmDsaAYz4DHdAePaZFe5jmuZXpk2lVMt3SguhGoL1RI2qfzw3MEiwTWVxadjpVnDxy2eg8wAR/SQgXplMcMuuq4C8mU5IrFzxgZ+CRkOn3UgzWbf5c6qTSqFXOvUj2vlms72cLztElbtIut7lONTqmOOhxkeaFXetMsbaQ9ak/frloui9mgX0d7/gI+LJL5
Simulation
AAAC03ichVFLS8NAEJ7GV1tfVY9egkXxVJIe1GPBB16EiqYt1iKbdK1L82KTFGrxIl69edX/pb/Fg1/WVNAi3bCZ2W9mvnnZoSui2DDec9rM7Nz8Qr5QXFxaXlktra03oiCRDrecwA1ky2YRd4XPrVjELm+FkjPPdnnT7h+m9uaAy0gE/mU8DHnHYz1f3AqHxYCuLoSXuEq9KZWNiqGOPqmYmVKm7NSD0gddU5cCcighjzj5FEN3iVGEr00mGRQC69AImIQmlJ3TAxURm8CLw4MB7ePfw6udoT7eKWekoh1kcXElInXaxj1RjDa806wcegT5iXuvsN6/GUaKOa1wCGmDsaAYz4DHdAePaZFe5jmuZXpk2lVMt3SguhGoL1RI2qfzw3MEiwTWVxadjpVnDxy2eg8wAR/SQgXplMcMuuq4C8mU5IrFzxgZ+CRkOn3UgzWbf5c6qTSqFXOvUj2vlms72cLztElbtIut7lONTqmOOhxkeaFXetMsbaQ9ak/frloui9mgX0d7/gI+LJL5
Simulation
AAACzHichVFLT8JAEB7qC/CFevTSSEw4kZaDeiTxES8ajIIkSMi2LNjQV7YLCRKu3rzqb9Pf4sFv12KixLDNdma/+ebbmR0n9r1EWtZ7xlhaXlldy+by6xubW9uFnd1GEg2Fy+tu5Eei6bCE+17I69KTPm/GgrPA8fm9MzhV8fsRF4kXhXdyHPN2wPqh1/NcJgHdio7sFIpW2dLLnHfs1ClSumpR4YMeqEsRuTSkgDiFJOH7xCjB1yKbLIqBtWkCTMDzdJzTlPLIHYLFwWBAB/j3cWqlaIiz0kx0totbfGyBTJMOsS+0ogO2upXDT2A/sZ801v/3holWVhWOYR0o5rTiFXBJj2AsygxS5qyWxZmqK0k9OtHdeKgv1ojq0/3ROUNEABvoiEnnmtmHhqPPI7xACFtHBeqVZwqm7rgLy7TlWiVMFRn0BKx6fdSDMdt/hzrvNCpl+6hcuakUq6V04FnapwMqYarHVKVLqqEOF9W90Cu9GdeGNCbG9JtqZNKcPfq1jOcvP3iP1Q==
r
t
AAACzHichVFLT8JAEB7qC/CFevTSSEw4kZaDeiTxES8ajIIkSMi2LNjQV7YLCRKu3rzqb9Pf4sFv12KixLDNdma/+ebbmR0n9r1EWtZ7xlhaXlldy+by6xubW9uFnd1GEg2Fy+tu5Eei6bCE+17I69KTPm/GgrPA8fm9MzhV8fsRF4kXhXdyHPN2wPqh1/NcJgHdio7sFIpW2dLLnHfs1ClSumpR4YMeqEsRuTSkgDiFJOH7xCjB1yKbLIqBtWkCTMDzdJzTlPLIHYLFwWBAB/j3cWqlaIiz0kx0totbfGyBTJMOsS+0ogO2upXDT2A/sZ801v/3holWVhWOYR0o5rTiFXBJj2AsygxS5qyWxZmqK0k9OtHdeKgv1ojq0/3ROUNEABvoiEnnmtmHhqPPI7xACFtHBeqVZwqm7rgLy7TlWiVMFRn0BKx6fdSDMdt/hzrvNCpl+6hcuakUq6V04FnapwMqYarHVKVLqqEOF9W90Cu9GdeGNCbG9JtqZNKcPfq1jOcvP3iP1Q==
r
t
AAACzXichVFLS8NAEJ7GV1tfVY9egkXoQUrag3os+MCLWME+oJaySbcxNC+STaFWvXrzqn9Nf4sHv11TQYt0w2Zmv/nm25kdM3SdWBjGe0ZbWFxaXsnm8qtr6xubha3tZhwkkcUbVuAGUdtkMXcdnzeEI1zeDiPOPNPlLXN4IuOtEY9iJ/BvxDjkXY/ZvjNwLCYkFPfEQa9QNMqGWvqsU0mdIqWrHhQ+6Jb6FJBFCXnEyScB3yVGMb4OVcigEFiXJsAieI6Kc3qkPHITsDgYDOgQfxunTor6OEvNWGVbuMXFjpCp0z72uVI0wZa3cvgx7Cf2vcLsf2+YKGVZ4RjWhGJOKV4CF3QHxrxML2VOa5mfKbsSNKBj1Y2D+kKFyD6tH51TRCJgQxXR6UwxbWiY6jzCC/iwDVQgX3mqoKuO+7BMWa5U/FSRQS+Cla+PejDmyt+hzjrNarlyWK5eV4u1UjrwLO3SHpUw1SOq0QXVUYeFrl/old60Ky3RHrSnb6qWSXN26NfSnr8A0niQDA==
s
t
,
AAACzXichVFLS8NAEJ7GV1tfVY9egkXoQUrag3os+MCLWME+oJaySbcxNC+STaFWvXrzqn9Nf4sHv11TQYt0w2Zmv/nm25kdM3SdWBjGe0ZbWFxaXsnm8qtr6xubha3tZhwkkcUbVuAGUdtkMXcdnzeEI1zeDiPOPNPlLXN4IuOtEY9iJ/BvxDjkXY/ZvjNwLCYkFPfEQa9QNMqGWvqsU0mdIqWrHhQ+6Jb6FJBFCXnEyScB3yVGMb4OVcigEFiXJsAieI6Kc3qkPHITsDgYDOgQfxunTor6OEvNWGVbuMXFjpCp0z72uVI0wZa3cvgx7Cf2vcLsf2+YKGVZ4RjWhGJOKV4CF3QHxrxML2VOa5mfKbsSNKBj1Y2D+kKFyD6tH51TRCJgQxXR6UwxbWiY6jzCC/iwDVQgX3mqoKuO+7BMWa5U/FSRQS+Cla+PejDmyt+hzjrNarlyWK5eV4u1UjrwLO3SHpUw1SOq0QXVUYeFrl/old60Ky3RHrSnb6qWSXN26NfSnr8A0niQDA==
s
t
,
AAAC1HichVFNT8JAEB3qF+AX6tFLIzHxRAoH9UjiRzxoggaQBI3Z1rVuKNtmW0gQPRmv3rzq79Lf4sG3azFRYthmO7Nv3ryd2XGjQMSJ47xnrKnpmdm5bC4/v7C4tFxYWW3GYU95vOGFQahaLot5ICRvJCIJeCtSnHXdgJ+7nT0dP+9zFYtQ1pNBxC+7zJfiRngsAdQ+O7brigkppH9VKDolxyx73CmnTpHSVQsLH3RB1xSSRz3qEidJCfyAGMX42lQmhyJglzQEpuAJE+f0QHnk9sDiYDCgHfx9nNopKnHWmrHJ9nBLgK2QadMm9qFRdMHWt3L4Mewn9p3B/H9vGBplXeEA1oViziieAE/oFoxJmd2UOaplcqbuKqEb2jXdCNQXGUT36f3o7COigHVMxKYDw/Sh4ZpzHy8gYRuoQL/ySME2HV/DMmO5UZGpIoOegtWvj3ow5vLfoY47zUqpvF2qnFaKVScdeJbWaYO2MNUdqtIR1VCHnssLvdKb1bTurUfr6ZtqZdKcNfq1rOcv/5eS4g==
RL Training
AAAC1HichVFNT8JAEB3qF+AX6tFLIzHxRAoH9UjiRzxoggaQBI3Z1rVuKNtmW0gQPRmv3rzq79Lf4sG3azFRYthmO7Nv3ryd2XGjQMSJ47xnrKnpmdm5bC4/v7C4tFxYWW3GYU95vOGFQahaLot5ICRvJCIJeCtSnHXdgJ+7nT0dP+9zFYtQ1pNBxC+7zJfiRngsAdQ+O7brigkppH9VKDolxyx73CmnTpHSVQsLH3RB1xSSRz3qEidJCfyAGMX42lQmhyJglzQEpuAJE+f0QHnk9sDiYDCgHfx9nNopKnHWmrHJ9nBLgK2QadMm9qFRdMHWt3L4Mewn9p3B/H9vGBplXeEA1oViziieAE/oFoxJmd2UOaplcqbuKqEb2jXdCNQXGUT36f3o7COigHVMxKYDw/Sh4ZpzHy8gYRuoQL/ySME2HV/DMmO5UZGpIoOegtWvj3ow5vLfoY47zUqpvF2qnFaKVScdeJbWaYO2MNUdqtIR1VCHnssLvdKb1bTurUfr6ZtqZdKcNfq1rOcv/5eS4g==
RL Training
AAACzXichVFLT8JAEB7qC/CFevTSSEw4kZaDeiTxEQ8SMfJKkJi2LLWhr2xbEkS9evOqf01/iwe/XYuJEsM225n95ptvZ3bM0HWiWNPeM8rC4tLySjaXX11b39gsbG23oiDhFmtagRvwjmlEzHV81oyd2GWdkDPDM13WNofHIt4eMR45gd+IxyHreYbtOwPHMmIBXVw3areFolbW5FJnHT11ipSuelD4oBvqU0AWJeQRI59i+C4ZFOHrkk4ahcB6NAHG4TkyzuiR8shNwGJgGECH+Ns4dVPUx1loRjLbwi0uNkemSvvYZ1LRBFvcyuBHsJ/Y9xKz/71hIpVFhWNYE4o5qVgDHtMdGPMyvZQ5rWV+pugqpgEdyW4c1BdKRPRp/eicIMKBDWVEpVPJtKFhyvMIL+DDNlGBeOWpgio77sMa0jKp4qeKBvQ4rHh91IMx63+HOuu0KmX9oFy5qhSrpXTgWdqlPSphqodUpXOqow4LXb/QK70pl0qiPChP31Qlk+bs0K+lPH8BWd+P2g==
LSTM
AAACzXichVFLT8JAEB7qC/CFevTSSEw4kZaDeiTxEQ8SMfJKkJi2LLWhr2xbEkS9evOqf01/iwe/XYuJEsM225n95ptvZ3bM0HWiWNPeM8rC4tLySjaXX11b39gsbG23oiDhFmtagRvwjmlEzHV81oyd2GWdkDPDM13WNofHIt4eMR45gd+IxyHreYbtOwPHMmIBXVw3areFolbW5FJnHT11ipSuelD4oBvqU0AWJeQRI59i+C4ZFOHrkk4ahcB6NAHG4TkyzuiR8shNwGJgGECH+Ns4dVPUx1loRjLbwi0uNkemSvvYZ1LRBFvcyuBHsJ/Y9xKz/71hIpVFhWNYE4o5qVgDHtMdGPMyvZQ5rWV+pugqpgEdyW4c1BdKRPRp/eicIMKBDWVEpVPJtKFhyvMIL+DDNlGBeOWpgio77sMa0jKp4qeKBvQ4rHh91IMx63+HOuu0KmX9oFy5qhSrpXTgWdqlPSphqodUpXOqow4LXb/QK70pl0qiPChP31Qlk+bs0K+lPH8BWd+P2g==
LSTM
AAAC1XichVFNT8JAEB3qF+AX6tELkZhwkRQO6hGDGi8mmMhHAsS0y1I3lLbZLiRIuBmv3rzq39Lf4sHXtZgoMWyzndk3b97O7NiBK0Jlmu8JY2l5ZXUtmUqvb2xubWd2duuhP5SM15jv+rJpWyF3hcdrSiiXNwPJrYHt8obdr0TxxojLUPjerRoHvDOwHE/0BLMUoPYZU748qkihBLvL5MyCqVd23inGTo7iVfUzH9SmLvnEaEgD4uSRgu+SRSG+FhXJpABYhybAJDyh45ymlEbuECwOhgW0j7+DUytGPZwjzVBnM9ziYktkZukQ+1Ir2mBHt3L4Iewn9oPGnH9vmGjlqMIxrA3FlFa8Bq7oHoxFmYOYOatlcWbUlaIenepuBOoLNBL1yX50zhGRwPo6kqULzXSgYevzCC/gwdZQQfTKM4Ws7rgLa2nLtYoXK1rQk7DR66MejLn4d6jzTr1UKB4XSjelXDkfDzxJ+3RAeUz1hMp0RVXUwdDHC73Sm9Ewpsaj8fRNNRJxzh79WsbzF3Q8k24=
Actor-Critic
AAAC1XichVFNT8JAEB3qF+AX6tELkZhwkRQO6hGDGi8mmMhHAsS0y1I3lLbZLiRIuBmv3rzq39Lf4sHXtZgoMWyzndk3b97O7NiBK0Jlmu8JY2l5ZXUtmUqvb2xubWd2duuhP5SM15jv+rJpWyF3hcdrSiiXNwPJrYHt8obdr0TxxojLUPjerRoHvDOwHE/0BLMUoPYZU748qkihBLvL5MyCqVd23inGTo7iVfUzH9SmLvnEaEgD4uSRgu+SRSG+FhXJpABYhybAJDyh45ymlEbuECwOhgW0j7+DUytGPZwjzVBnM9ziYktkZukQ+1Ir2mBHt3L4Iewn9oPGnH9vmGjlqMIxrA3FlFa8Bq7oHoxFmYOYOatlcWbUlaIenepuBOoLNBL1yX50zhGRwPo6kqULzXSgYevzCC/gwdZQQfTKM4Ws7rgLa2nLtYoXK1rQk7DR66MejLn4d6jzTr1UKB4XSjelXDkfDzxJ+3RAeUz1hMp0RVXUwdDHC73Sm9Ewpsaj8fRNNRJxzh79WsbzF3Q8k24=
Actor-Critic
AAAC3HichVFNS8QwEH3W7+9Vj16Ki+Bp6e5BPQp+4EVQcFXwi7TGGrabljSrqHjzJl69edXfpL/Fg6+xCipiSjqTN29eZjJhlqjcBsFLl9fd09vXPzA4NDwyOjZemZjcydOOiWQzSpPU7IUil4nSsmmVTeReZqRoh4ncDVvLRXz3XJpcpXrbXmbysC1irU5VJCyh48rEcqqt0h3pWyOUVjo+rlSDWuCW/9upl04V5dpMK684wAlSROigDQkNSz+BQM5vH3UEyIgd4pqYoadcXOIGQ8ztkCXJEERb/Mc87Zeo5rnQzF12xFsSbsNMH7Pca04xJLu4VdLPad+4rxwW/3nDtVMuKrykDak46BQ3iFuckfFfZrtkftbyf2bRlcUpFl03ivVlDin6jL50VhgxxFou4mPVMWNqhO58zhfQtE1WULzyp4LvOj6hFc5Kp6JLRUE9Q1u8PuvhmOs/h/rb2WnU6vO1xlajuhSUAx/ANGYwx6kuYAnr2GQdES7wiCc8e0ferXfn3X9Qva4yZwrflvfwDkNRlhY=
Continue training
AAAC3HichVFNS8QwEH3W7+9Vj16Ki+Bp6e5BPQp+4EVQcFXwi7TGGrabljSrqHjzJl69edXfpL/Fg6+xCipiSjqTN29eZjJhlqjcBsFLl9fd09vXPzA4NDwyOjZemZjcydOOiWQzSpPU7IUil4nSsmmVTeReZqRoh4ncDVvLRXz3XJpcpXrbXmbysC1irU5VJCyh48rEcqqt0h3pWyOUVjo+rlSDWuCW/9upl04V5dpMK684wAlSROigDQkNSz+BQM5vH3UEyIgd4pqYoadcXOIGQ8ztkCXJEERb/Mc87Zeo5rnQzF12xFsSbsNMH7Pca04xJLu4VdLPad+4rxwW/3nDtVMuKrykDak46BQ3iFuckfFfZrtkftbyf2bRlcUpFl03ivVlDin6jL50VhgxxFou4mPVMWNqhO58zhfQtE1WULzyp4LvOj6hFc5Kp6JLRUE9Q1u8PuvhmOs/h/rb2WnU6vO1xlajuhSUAx/ANGYwx6kuYAnr2GQdES7wiCc8e0ferXfn3X9Qva4yZwrflvfwDkNRlhY=
Continue training
AAADFnichVLLSsNAFD3G97vq0k2wCC4kpEXbuhN84EJBwapQRSbpGEPTJCSpoOJ/CG71N9yJW7f+iQsXnhlT0UVxws29c+beM+fOjBMHfprZ9nuf0T8wODQ8Mjo2PjE5NV2YmT1Ko07iyrobBVFy4ohUBn4o65mfBfIkTqRoO4E8dlobav34SiapH4WH2XUsz9rCC/0L3xUZofPC9G4kmmYcxZ0gR4q2tVqz12or5ndQqeZBtWKWLFuPIvKxHxU+cIomIrjooA2JEBnjAAIpvwZKsBETO8MtsYSRr9cl7jDG2g6zJDME0Rb/HmeNHA05V5yprna5S0BLWHn6a9ageXB0jg0Lq6hx1+Ue8R1MLNK2tRqHOynFknFK/0m70ZjXU92tVqW6u6Z3yDiqGfeIZ7hkxn+V7Tyzq+X/SnUiGS7YgerSp75YI+qM3B+eTa4kxFp6xcSWzvTI4ej5Fc8rpK9TgbqhLoOpO27SC+2lZglzRkG+hF7dHPXwiXTfgdk7OCpbpYpVPigX15fyxzKCeSxgiXdQxTp2sE8d6tU84BFPxr3xbLwYr9+pRl9eM4c/w3j7Ap1dn4g=
Load population
AAADFnichVLLSsNAFD3G97vq0k2wCC4kpEXbuhN84EJBwapQRSbpGEPTJCSpoOJ/CG71N9yJW7f+iQsXnhlT0UVxws29c+beM+fOjBMHfprZ9nuf0T8wODQ8Mjo2PjE5NV2YmT1Ko07iyrobBVFy4ohUBn4o65mfBfIkTqRoO4E8dlobav34SiapH4WH2XUsz9rCC/0L3xUZofPC9G4kmmYcxZ0gR4q2tVqz12or5ndQqeZBtWKWLFuPIvKxHxU+cIomIrjooA2JEBnjAAIpvwZKsBETO8MtsYSRr9cl7jDG2g6zJDME0Rb/HmeNHA05V5yprna5S0BLWHn6a9ageXB0jg0Lq6hx1+Ue8R1MLNK2tRqHOynFknFK/0m70ZjXU92tVqW6u6Z3yDiqGfeIZ7hkxn+V7Tyzq+X/SnUiGS7YgerSp75YI+qM3B+eTa4kxFp6xcSWzvTI4ej5Fc8rpK9TgbqhLoOpO27SC+2lZglzRkG+hF7dHPXwiXTfgdk7OCpbpYpVPigX15fyxzKCeSxgiXdQxTp2sE8d6tU84BFPxr3xbLwYr9+pRl9eM4c/w3j7Ap1dn4g=
Load population
AAADCXichVLLSsNAFD3G97vq0k2wCC4kpEXbuhN84EZQtCpUkUkcY2iahCQtqPgFglv9DXfi1q/wT1y48MyYii6KEyb3zrn3njl3Zpw48NPMtt/7jP6BwaHhkdGx8YnJqenCzOxRGrUTV9bdKIiSE0ekMvBDWc/8LJAncSJFywnksdPcUPHjjkxSPwoPs+tYnrWEF/qXvisyBR2IjjwvFG1rtWav1VbMb6dSzZ1qxSxZth5F5GMvKnzgFBeI4KKNFiRCZPQDCKT8GijBRkzsDLfEEnq+jkvcYYy1bWZJZgiiTf49rho5GnKtOFNd7XKXgDNh5emvVYPTg6NzbFhYRY27Lvfw72BikXNbq3G4k1Is6ae0n5w3GvN6qrvVqlR317QOGUc14y7xDFfM+K+ylWd2tfxfqU4kwyU7UF361BdrRJ2R+8OzyUhCrKkjJrZ0pkcOR687PK+Qtk4F6oa6DKbu+IJWaCs1S5gzCvIltOrmqIdPpPsOzN7OUdkqVazyfrm4vpQ/lhHMYwFLvIMq1rGDPepw2fUDHvFk3BvPxovx+p1q9OU1c/gzjLcvwBGajQ==
Save
AAADCXichVLLSsNAFD3G97vq0k2wCC4kpEXbuhN84EZQtCpUkUkcY2iahCQtqPgFglv9DXfi1q/wT1y48MyYii6KEyb3zrn3njl3Zpw48NPMtt/7jP6BwaHhkdGx8YnJqenCzOxRGrUTV9bdKIiSE0ekMvBDWc/8LJAncSJFywnksdPcUPHjjkxSPwoPs+tYnrWEF/qXvisyBR2IjjwvFG1rtWav1VbMb6dSzZ1qxSxZth5F5GMvKnzgFBeI4KKNFiRCZPQDCKT8GijBRkzsDLfEEnq+jkvcYYy1bWZJZgiiTf49rho5GnKtOFNd7XKXgDNh5emvVYPTg6NzbFhYRY27Lvfw72BikXNbq3G4k1Is6ae0n5w3GvN6qrvVqlR317QOGUc14y7xDFfM+K+ylWd2tfxfqU4kwyU7UF361BdrRJ2R+8OzyUhCrKkjJrZ0pkcOR687PK+Qtk4F6oa6DKbu+IJWaCs1S5gzCvIltOrmqIdPpPsOzN7OUdkqVazyfrm4vpQ/lhHMYwFLvIMq1rGDPepw2fUDHvFk3BvPxovx+p1q9OU1c/gzjLcvwBGajQ==
Save
Figure 4.6: An illustration of the system used to solve our complex manipulation tasks using a combination
of RL, highly parallelized robotic simulation, and Population Based Training (PBT).
4.2.3 ProblemStatement
The degree of usefulness of a robotic system can be characterized by its ability to perform generalized
rearrangement [10], i.e. bringing a given environment into a specified state. In household, factory, or
warehouse environments many interesting tasks will involve a type of rearrangement that requires human-
level object handling and manipulation. We propose using high-DoF anthropomorphic hand+arm systems
to achieve the level of dexterity that might be necessary to match human-level object manipulation skills.
We focus on a problem that can be seen as a special case of rearrangement: single-object reposing.
This task requires changing the state of a single rigid body such that it matches the target positionx∈R
3
and, optionally, orientationR∈SO(3). Dexterous single-object manipulation can be seen as an essential
primitive required to perform general-purpose rearrangement. This task requires mastery of contact-rich
grasping and in-hand manipulation and presents an exciting challenge for robotics research. In this chapter
we are approaching a simulated version of this problem, which allows us to develop methods for learning
control policies that can be later modified to attempt sim-to-real transfer.
We formalize dexterous manipulation as a discrete-time sequential decision making process. At each
step the controller observes environment state s
t
∈ R
N
obs
(which includes the target object pose) and
63
yields an actiona
t
∈ R
N
dof
specifying the desired angles of arm and finger joints. Whenever the object
state matches the target within a specified tolerance the attempt is considered successful and the target
state is reset. If the object is dropped during the attempt, or if the target state is not achieved within a
time periodτ we consider this attempt failed. The simulation proceeds untilN
max
consecutive successes
are reached or until the first failure. The performance on the task can thus be measured as a number of
consecutive successes within an episode N
succ
≤ N
max
, as done in prior work [5]. Similar to Allshire
et al. [3] we use keypoints to represent both observed and desired object pose. We detect a successful
execution of the task when all ofN
kp
keypoints are within the tolerance threshold of their corresponding
target locations.
4.2.4 Scenarios
We developed three variants of our object manipulation task that highlight different challenges in dex-
terous manipulation. These scenarios are regrasping, grasp-and-throw, and reorientation (Fig. 4.5). At the
beginning of each episode, the object appears in a random position on the table and the hand-arm system
is reset to a random states
robot
∈R
2N
dof
consisting of random joint angular velocities and initial angles
within the DoF limits.
Theregrasping task demands that the agent grasp the object, pick it up from the table, and hold it in a
specified location for a duration of time, after which both object and target positions are reset. To succeed
in this scenario the control policy must develop stable grasps that minimize the probability of dropping
the object during the attempt.
Ingrasp-and-throw the task is to pick up the object and displace it into a container that can be outside
of manipulator’s reach. This scenario requires aiming for the container and throwing the object a signif-
icant distance. Here we test the ability of the control policy to understand the dynamic aspects of object
64
manipulation: successful execution of this task requires releasing the grip at the right point of the trajec-
tory, giving the object just the right amount of momentum to direct it towards the goal. After each attempt
we reset the positions of the object and the container to make sure that the policy is able to complete the
task repeatedly for a diverse set of initial conditions.
The reorientation task provides perhaps the most difficult object manipulation challenge. The goal is
to grasp the object and consecutively move it to different target positions and orientations. This scenario
requires maintaining a stable grip for minutes of simulated time, fine control of the joints of the robotic
arm, and occasional in-hand rotation when the reorientation cannot be performed by merely using the
affordances of the Kuka arm. Thus formulated, the reorientation task includes elements seen in previous
work such as dexterous in-hand manipulation [5, 96], but extends the capability of the manipulator to
perform reposing in much larger volume.
In both regrasping and reorientation the completion of the task requiresN
kp
keypoints to be within
the final tolerance ε
∗ = 1 cm of the target, thus requiring very precise control. For grasp-and-throw the
required tolerance is7.5 cm since the main focus is on landing the object into the container.
In contrast to prior work [5], we train and test on a large set of target object with randomised pro-
portions, from small to large, and from cubes to highly elongated parallelipipeds. Using a multitude of
manipulation targets reduces the chance of overfitting to any particular object shape: it is evident from
our experiments that no single grasping behavior works for all objects which requires our policies to de-
velop a wide range of object manipulation capabilities.
4.2.5 Dual-ArmScenarios
In the attempt to find the limits of end-to-end learning for continuous control, we introduce versions
of regrasping and reorientation scenarios for two hand-arm systems. It is likely not a coincidence that
humans, sculpted into a form optimised for object manipulation by evolution, wield not one but a pair of
65
arms and hands. Thus solving object reposing with two high-DoF manipulators in simulation is a clear
milestone on the path towards robotic systems that can mimic or exceed human dexterity.
To produce dual-arm scenarios we double the number of simulated robots per task for the total of 46
DoF (Fig. 4.5). We change the sampling of initial and target object positions in a way that guarantees that
the task cannot be solved by any one robot. Thus, a complete solution requires grasping the object, passing
the object from one hand to another, as well as in-hand manipulation, combining the most challenging
elements from every scenario. We extend both observation (Tab. 4.2) and action space (a
t
∈ R
2N
dof
) to
allow a single policy to control both manipulators.
4.2.6 ReinforcementLearning
We formalize the problem as a Markov Decision Process (MDP) where the agent interacts with the environ-
ment to maximize the expected episodic discounted sum of rewards
E
[
P
T
t=0
γ t
r(s
t
,a
t
)]. We use Proximal
Policy Optimization algorithm [123] to simultaneously learn the policyπ θ and the value functionV
π θ (s),
both parameterized by a single parameter vectorθ . The model architecture is an LSTM [50] followed by
a 3-layer MLP. Even though agents can observe all relevant parts of the environment state, we decided
to follow prior work [5, 96] and train recurrent models because any future real-world deployment will
necessarily involve partial observability and require memory for test time system identification.
The policy is trained using experience simulated in Isaac Gym [79], a highly parallelized GPU-accelerated
physics simulator. To process high volume of data generated by this simulator we use an efficient PPO im-
plementation [78] which keeps the computation graph entirely on the GPU. Combined with the minibatch
size of2
15
transitions, this allows us to maximize the hardware utilization and learning throughput.
66
We utilize normalization of observations, advantages, and TD-returns to make the algorithm invariant
to absolute scale of observations and rewards. We also use an adaptive learning rate algorithm that main-
tains a constant KL-divergence between the trained policyπ θ and the behavior policyπ θ old
that collected
the rollouts.
Both regrasping and reorientation demand very precise control (ε
∗ = 1 cm). In order to create a
smooth learning curriculum we adaptively anneal the tolerance from a larger initial value ε
0
= 7.5 cm.
We periodically check if the policy crosses the performance threshold N
succ
> 3, and in this case we
decrease the current success tolerance: ε← 0.9ε until it reaches the final value.
Rewardfunction. For the successful application of the reinforcement learning method, the reward should
be dense enough to facilitate exploration yet should not distract the agent from the sparse final objective
(which in our case is the number of consecutive successful manipulations).
We propose a reward function that naturally guides the agent through a sequence of motions required
to complete the task, from reaching for the object to picking it up and moving it to the final location:
r(s,a)=r
reach
(s)+r
pick
(s)+r
targ
(s)− r
vel
(a). (4.1)
Herer
reach
rewards the agent for moving the hand closer to the object at the start of the attempt:
r
reach
=α reach
∗ max(d
closest
− d,0), (4.2)
where bothd andd
closest
are distances between the end-effector and the object, d is the current distance,
andd
closest
is the closest distance achieved during the attempt so far. Componentr
pick
rewards the agent
for picking up the object and lifting it off the table:
r
pick
=(1− 1
picked
)∗ α pick
∗ h
t
+r
picked
(4.3)
67
In this equation1
picked
is an indicator function which becomes 1 once the height of the object relative to
the tableh
t
exceeds a predefined threshold. At this moment the agent receives an additional sparse reward
r
picked
. Once the object is picked up, r
targ
rewards the agent for moving the object closer to the target
state:
r
targ
=1
picked
∗ α targ
∗ max(
ˆ
d
closest
− ˆ
d,0)+r
success
. (4.4)
Here
ˆ
d is the maximum distance between corresponding pairs of object and target keypoints, and
ˆ
d
closest
is the closest such distance achieved during the attempt so far. A large sparse reward r
success
is added
when all of the N
kp
keypoints are within a tolerance threshold of their target locations, meaning that
reposing/reorientation is complete. Finally, r
vel
in Eq. 4.1 is a simple joint velocity penalty that can be
tuned to promote smoother movement, and in equations 4.2, 4.3, 4.4 coefficients α reach
,α pick
,α targ
are
relative reward weights. Note that we apply the exact same reward function in all scenarios.
Overall, our reward formulation follows a sequential pattern: the reward componentsr
reach
,r
pick
and
r
targ
are mutually exclusive and do not interfere with each other. For example: by the time the hand
approaches the object, the componentr
reach
is exhausted sinced=d
closest
=0, thereforer
reach
does not
contribute to the reward for the remainder of the trajectory. Likewise,r
pick
̸=0 if and only ifr
targ
=0 and
vice versa. The fact that only one major reward component guides the motion at each stage of the trajectory
makes it easier to tune the rewards and avoid interference between reward components. This allows us to
avoid many possible local minima: for example, ifr
pick
andr
targ
are applied together, depending on the
relative reward magnitudes the agent might choose to slide the object to the edge of the table closer to the
target location and cease further attempts to pick it up and get closer to the target for the fear of dropping
the object.
In addition to that, rewards in equations 4.2, 4.4 have a predefined maximum total value depending
on the initial distance between the hand and the object, and the object and the target respectively. This
eliminates the entire class of reward hacking behaviors where the agent would remain close but not quite
68
Table 4.2: Actor/critic observations and their dimensionality. HereN
arm
=1 for single-arm andN
arm
=2
for dual-arm tasks. N
kp
= 1 for regrasping and grasp-and-throw tasks (since rotation tracking is not
required), andN
kp
=4 for reorientation.
Input Dimensionality
Joint angles 23D∗ N
arm
Joint velocities 23D∗ N
arm
Hand position 3D∗ N
arm
Hand rotation 3D∗ N
arm
Hand velocity 3D∗ N
arm
Hand angular velocity 3D∗ N
arm
Fingertip positions 12D∗ N
arm
Object keypoints rel. to hand 3D∗ N
kp
∗ N
arm
Object keypoints rel. to goal 3D∗ N
kp
Object rotation 4D (qaternion)
Object velocity 3D
Object angular velocity 3D
Object dimensions 3D
1
picked
,d
closest
,
ˆ
d
closest
3D
Total, one arm regrasping/throw 92D
Total, one arm reorientation 110D
Total, two arms regrasping 165D
Total, two arms reorientation 192D
at the goal to keep collecting the proximity reward. In our formulation, only movement towards the goal
is rewarded while mere proximity to the goal is not.
Observationsandactions. In our experiments both actorπ θ and criticV
π θ (s) observe environment state
directly, including joint angles and velocities, positions of fingertips, object rotation, velocity, and angular
velocity. Tab. 4.2 lists all observations available to the agent and their corresponding dimensionalities.
The policyπ θ outputs two vectorsµ,σ ∈R
N
dof
∗ Narm
which are used as parameters ofN
dof
∗ N
arm
in-
dependent Gaussian probability distributions. Actions are sampled from these distributionsa∼N (µ,σ ),
normalized to corresponding joint limits and interpreted as target joint angles. Then a PD controller yields
joint torques in order to get joints to the target angles specified by the policy.
69
4.2.7 Population-BasedTraining
A contact-rich continuous control problem with up to 192 observation dimensions (Tab. 4.2) and up to
46 action dimensions can be exceptionally challenging even for modern RL algorithms [78]. The main
challenge is exploration: from the large number of possible behaviors that maximize rewards early in the
training, only relatively few lead to high performance solutions at convergence.
Another dimension of complexity when using learning methods is hyperparameter tuning. Any reward
shaping scheme contains a number of coefficients that need to be carefully balanced in order to maximize
the objective. In addition to that, modern RL algorithms have a substantial number of settings, such as
learning rate, number of epochs, magnitudes of different losses, and so on. Choosing these parameters can
be quite challenging and heavily relies on the expertise of engineers and researchers.
In order to mitigate these problems we employ a Population-Based Training approach [55]. The core
idea is akin to an evolutionary algorithm: we train a population of agentsP, perform mutation to generate
promising hyperparameter combinations, and use selection to prioritize agents with the best performance.
Each agent(θ i
,p
i
)∈P is characterized by a parameter vectorθ i
and a set of hyperparametersp
i
, which
includes settings of the RL algorithm as well as reward coefficients α reach
,α pick
,α targ
, etc. (see Tab. 4.3).
Periodically, each agent is evaluated to obtain the target performance metricN
succ
(number of consecutive
successful manipulations). Algorithm 1 provides a high-level overview of PBT operation: note that we
choose to mutate hyperparametersp even when the weights are not replaced. This allows the method to
keep trying various diverse hyperparameter combinations.
One advantage of PBT is that it introduces an outer optimization loop that can meta-optimize for a
final sparse scalar objective as opposed to inner RL loop which balances various dense reward components.
Although meta-optimizing for N
succ
is an obvious choice, the adaptive tolerance annealing described in
Sec. 4.2.6 creates a complication: because it is easier to achieve higherN
succ
with higher tolerance, some
70
Table 4.3: RL hyperparameters and reward function coefficients. Rightmost column shows final parameter
values for a single PBT experiment, dual-arm reorientation (parameters not optimized by PBT are omitted).
Parameter Initialvalue
PBT-optimized
value
LSTM size 768 -
MLP layers [768,512,256] -
Nonlinearity ELU -
Discount factorγ 0.99 0.9888
GAE discountλ [122] 0.95 -
Learning rate Adaptive (Sec 4.2.6) -
Adapt. LRD
KL
(π |π old
) 0.016 0.01432
Gradient norm 1.0 1.028
PPO-clipϵ 0.1 0.2564
Critic loss coeff. 4.0 5.188
Entropy coeff. 0 -
Num. agents 8192 -
Minibatch size 32768 -
Rollout length 16 -
Epochs per iter. 2 1
α reach
50 74.8
α pick
20 22.1
r
picked
300 414.4
α targ
200 263.5
r
success
1000 1322.7
71
0 1 2 3 4 5
Env. steps
× 10
9
0
10
20
30
40
Num. successes
Regrasping
0 1 2 3 4 5
Env. steps
× 10
9
0
10
20
30
40
Grasp-And-Throw
0 1 2 3 4 5
Env. steps
× 10
9
0
10
20
30
40
Reorientation
PBT (best) PBT (mean) No PBT (best) No PBT (mean)
0 1 2 3 4 5
Env. steps
× 10
9
0
2
4
6
8
Success tolerance, cm
Reorientation (success tolerance)
Figure 4.7: Training curves with and without PBT forsingle-arm+handtasks. Shaded area is between
the best and the worst policy among 8 agents inP or 8 seeds in non-PBT experiments.
agents will receive an unfair advantage early in the training. To address this issue, we define our meta-
optimization objective as follows:
r
meta
=
ε
0
− ε
ε
0
− ε
∗ +0.01∗ N
succ
ifε>ε
∗ r
meta
=1+N
succ
ifε=ε
∗ (4.5)
Until the target toleranceε
∗ is reached, this objective is dominated by the term
ε
0
− ε
ε
0
− ε
∗ , after whichN
succ
is prioritized.
Decentralized PBT. A trait that characterizes our implementation is a complete lack of any central or-
chestrator, typically found in other algorithms [105, 74]. Instead, we propose a completely decentralized
PBT architecture. In our implementation, agents inP interact exclusively through low-bandwidth access
to a shared network directory containing histories of checkpoints and performance metrics for each agent.
The lack of any central controller not only removes a point of failure, but also allows training in a volatile
compute environment such as a contested cluster where jobs can be interrupted or remain in queue for a
long time. In this case, agents that start training later will be at a disadvantage when compared to other
members ofP that started earlier. To mitigate that, we compare performance of agents that started later
only to historical checkpoints of other agents that correspond to the same amount of collected experience.
72
Algorithm1 Population-Based Training
Require: P (initial population,θ,p sampled randomly)
1: for(θ,p )∈P do (async. and decentralized)
2: while not end of trainingdo
3: θ ← train(θ,p ) ▷ Do RL forN
iter
steps
4: N
succ
← eval(θ )
5: (θ ∗ ,p
∗ )∼P
top
⊂P ▷ Get agent from top 30%
6: if N
succ
in bottom 30% ofP then
7: p← mutate(p,p
∗ )
8: θ ← θ ∗ ▷ Replace weights
9: elseif N
succ
not inP
top
then
10: p← mutate(p)
11: endif
12: endwhile
13: endfor
14: returnθ best
∈P ▷ Agentθ with the highestN
succ
4.2.8 Experiments
We conduct our experiments using instances with 8 CPU cores and a single Nvidia V100 GPU with 16 Gb
of VRAM. Using the Isaac Gym simulator [79] we are able to simulate 8192 parallel environments on each
GPU. Combined with a GPU-based RL implementation [78], this allows us to reach training throughput of
5× 10
4
samples per second for single-arm tasks and3.5× 10
4
samples per second for dual-arm scenarios.
At this rate, to train successful policies on5× 10
9
environment transitions, it takes 30 hours for single- and
40 hours for dual-arm tasks (PBT and non-PBT training having almost equivalent per-node throughput).
For non-PBT experiments we simply train policies starting from multiple random seeds, each trained
on a single instance. In our PBT experiments we use eight separate 1-GPU instances that exchange in-
formation using low-bandwidth access to a shared directory (in our case, an NFS-mounted folder on a
Slurm/NGC cluster). Tab. 4.3 lists RL parameters and reward shaping coefficients used in our experiments.
Figures Fig. 4.7, Fig. 4.8 demonstrate performance of PBT compared to regular single-GPU training.
We find that PBT improves training substantially in all scenarios, and in three of them PBT becomes
an enabling factor allowing the algorithm to reach non-trivial performance. For example, in single-arm
reorientation task (Fig. 4.7), without PBT none of the eight single-GPU training sessions reached the target
tolerance ε
∗ . Of the three types of scenarios, reorientation relies on exploration the most: finding the
73
0 1 2 3 4 5
Env. steps
× 10
9
0
10
20
30
40
Num. successes
Dual-Arm Regrasping
0 1 2 3 4 5
Env. steps
× 10
9
0
10
20
30
40
Dual-Arm Reorientation
PBT (best) PBT (mean) No PBT (best) No PBT (mean)
Figure 4.8: Training curves with and without PBT for dual-arm + hand tasks. Shaded area is between
the best and the worst policy among 8 agents inP or 8 seeds in non-PBT experiments.
correct approach to in-hand manipulation is essential, and there are many local optima to get stuck in. PBT
excels at overcoming these challenges. It greatly amplifies the exploration capabilities of RL by directing
computational resources towards promising agents and applying diverse hyperparameter combinations.
Surprisingly, a brute-force end-to-end learning approach scales well even to dual-arm tasks, despite
significantly increased overall complexity of the control and exploration problems. Moreover, our PBT
dual-arm agent performed better in the reorientation task, reaching almost 40 out ofN
max
=50 successes.
The dual-arm task requires the agent to constantly pass the object from one hand to another, as a result the
policy became more confident at juggling and tossing the object in-hand, while single-handed agents tend
to rely on more conservative in-hand rotations. Some of the difference can be attributed to experimental
variance as well: we noticed that some PBT agents occasionally discover much more advanced strategies.
One such strategy is putting the object back on the table and changing its orientation by rotating it on the
surface without any risk of dropping the object, then picking it up and finishing the task. We demonstrate
these and other learned behaviors in supplementary videos.
74
4.2.9 Conclusions
In this project, we demonstrate the ability of end-to-end deep RL to learn control policies for sophisticated
dexterous manipulators. To our best knowledge, this is the first time this level of control has been achieved
using a learning approach on a high-DoF hand-arm system with a multi-fingered end-effector. We consider
the solution to this problem an important step on the path towards real-world deployment of robotic
systems with human-level dexterity.
While our agents demonstrate strong performance in simulation, many additional obstacles need to
be overcome before practical applications are feasible. Our policies demonstrate aggressive control on the
limits of robot capabilities which can lead to equipment damage in the real world. One promising approach
is to utilize Riemannian Motion Policies [68] or Geometric Fabrics [144] to improve safety on the real robot
by imposing conservative motion priors.
One of the most important future research directions is closing the sim-to-real gap. One approach
that has shown to improve sim-to-real transfer is randomization of physical parameters during training.
Projects such as Dactyl [5, 96] have demonstrated that training policies with domain randomization is
particularly challenging and requires extensive computational resources. Our approach based on paral-
lelized physics simulation and high-throughput GPU-accelerated learning has the potential to significantly
reduce computational requirements for such experiments and make them accessible to a wider research
community.
75
Chapter5
Sim-to-RealApplications
In this part of the thesis we come back to the sim-to-real protocol described in Chapter 1. We show how
accelerating learning in simulation is not an end in itself, but can actually be used to bootstrap ambitious
robotics projects with real-world deployment.
We discuss two projects in this chapter. "Decentralized Control of Quadrotor Swarms" project (Sec. 5.1)
applies sim-to-real approach to coordinated quadrotor drone flight, while "DeXtreme" (Sec. 5.2) focuses on
achieving dexterous robotic manipulation with an anthropomorphic multi-fingered hand.
Much like in Chapter 4, one of the projects is based on a heterogeneous learning system with a CPU-
based simulator and accelerated by Sample Factory (Sec. 2.1), while the other is based on a homogeneous
GPU-accelerated learning system.
Parts of this chapter appeared in:
• Sumeet Batra*, Zhehui Huang*,AlekseiPetrenko*, Tushar Kumar, Artem Molchanov, and Gaurav S. Sukhatme. Decen-
tralized Control of Quadrotor Swarms with End-to-end Deep Reinforcement Learning. In Conference on Robot Learning
(CORL), 2021.
• Ankur Handa*, Arthur Allshire*, Viktor Makoviychuk*,AlekseiPetrenko*, Ritvik Singh*, Jingzhou Liu*, Denys Makovi-
ichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter
Fox, and Gavriel State. DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality. Under review at
IEEE International Conference on Robotics and Automation (ICRA), 2023.
76
5.1 Decentralized Control of Quadrotor Swarms with End-to-end Deep
ReinforcementLearning
5.1.1 Introduction
Teams of unmanned aerial vehicles are widely applicable to e.g. area coverage, reconnaissance etc. State
of the art approaches for planning and control of drone teams require full state information and extensive
computational resources to plan in advance [51]. This greatly limits their applicability since existing
planning algorithms are often brittle in partially-observable environments and challenging to execute in
real time on embedded hardware. Many classical methods also suffer from the curse of dimensionality as
the number of possible configurations grows combinatorially with team size. In addition, kinodynamic
planners typically require a precise model of drone dynamics.
Here, we present a fully learned approach that uses a small amount of computation during execution
and relies exclusively on local observations, yet results in effective controllers for large-scale, swarm-like,
teams that are zero-shot transferable to real quadrotors. We take advantage of multi-agent deep rein-
forcement learning (DRL) to train quadrotor drones on hundreds of millions of environment transitions in
realistic simulated environments. We find neural network architectures and observation spaces that allow
our neural controllers to achieve high performance on a diverse set of tasks. Our experiments demon-
strate that drone swarms controlled by our neural network policies generate highly effective, smooth, and
virtually collision-free trajectories. We show simulated dynamic obstacle avoidance and demonstrate that
our learned controllers successfully execute tasks on physical drones in team sizes upto 8 quadrotors. To
the best of our knowledge, this is the first approach that a) demonstrates scalable coordinated flight of
drone swarms in a realistic physical simulation achieved through end-to-end reinforcement learning (RL)
with direct thrust control, and b) demonstrates that the learned policies transfer to physical drone teams.
77
Figure 5.1: Drones are trained on formations randomly sampled from one of six pre-defined scenarios on
the left. Drones employ a self- and neighbor encoder that learn a mapping from a combination of pro-
prioceptive and extereoceptive inputs to thrust outputs. The combined intermediate embeddings enable
policies to perform high speed, aggressive maneuvers and learn collision avoidance behaviors e.g. for-
mation creation, formation swaps, evader pursuit. These are shown on the right both in simulation and
physical experiments.
We view this as progress towards hardware agnostic real-world deployment of quadrotor swarms with
realtime replanning capabilities.
5.1.2 CoordinatedQuadrotorFlight: RelatedWork
Classical methods such as RRT [129] and PRM [59] can generate collision-free piece-wise linear trajec-
tories. These do not directly account for robot dynamics and scale poorly with team size, limiting their
applicability for online re-planning. To account for the dynamics, a smooth trajectory refinement [51]
or optimization [136] is performed. Kinodynamic planning methods take advantage of the known dy-
namics of individual robots and are more suitable for agile and aggressive flight [83]. It is also possible
to convert a geometric route to a dynamically feasible trajectory [111] resulting in an agile system or to
directly find dynamically feasible paths and refine them with gradient based optimization [147]. Prior
learning-based approaches include [2] which estimates reachable states given the quadrotor dynamics.
However, these methods require knowledge of the entire state space a priori and do not generalize well
to other high dimensional environments. Additionally, kinodynamic planning is typically intractable in
the multi-robot setting. Collision avoidance is particularly difficult where drones perform aggressive ma-
neuvers. Traditional techniques use observations from neighboring drones to constrain the geometric free
78
space or the action space of each drone. Prior work includes ann-body collision avoidance approach [16]
that constrains the velocity space of a robot using velocity information from neighboring robots. Another
approach utilizes pose information from neighbors to construct a Voronoi Cell, and each robot plans a
collision free trajectory through its respective cell [149]. The approach in [76] minimally constrains the
control space of an agent by constructing chance-constrained safety sets that account for measurement
noise and disturbances present in real world environments. While these approaches guarantee collision
free trajectories and are capable of running in realtime, with the exception of [76] (PrSBC), they signifi-
cantly constrain the configuration space of the robot, preventing aggressive maneuvers and resulting in
occasional deadlock. While PrSBC minimally constrains the control space of any given policy, it requires
a model of neighboring agents in order to construct chance-constrained safety sets. Our controllers rely
on a reward formulation with a high emphasis on collision avoidance and our policies are not constrained,
allowing for aggressive maneuvers in a variety of rich, dynamic environments.
Simulation trained neural networks have been used for single-agent and multi-agent quadrotor con-
trol to address shortcomings of kinodynamic planners. [112] and [116] both utilize Imitation Learning (IL)
from a global centralized planner, [116] further utilizes RL to balance the tradeoff between actions opti-
mal to the team and individuals. [128, 127] learn controllers robust to aerodynamic interactions. Graph
Neural Networks (GNNs) are promising architectures for swarm control and they have been proposed as
a parameterization for imitation learning [69, 135] and RL [62, 61] algorithms.
Reinforcement learning has shown promise in learning policies for UAV flight [88]. Deep RL [130]
has been used to learn minimum-time trajectory generation for quadrotors. Similar to [88] (but in a multi-
agent setting) we train from scratch using DRL via an end-to-end approach: the policies directly control
the motor thrust. This is similar to [89] which uses Soft Actor Critic to teach a single quadrotor how to
fly. Multi-Agent Reinforcement Learning (MARL) has recently been applied to autonomous driving, path
finding, and cooperative multi-agent control [126, 116, 45] and UAV team control (e.g. [98] – a new gym
79
environment that models drone dynamics, taking into account aerodynamic effects, and trains policies
for basic hovering and leader-following tasks). Khan et al. [63] train a centralized Q-network to solve
multi-agent motion planning problems. This approach has limited scalability as the input space of the Q-
network increases with the number of drones. We address scaling by relying only on local neighborhood
information both during training and inference.
5.1.3 ProblemFormulation
We consider a team ofN quadrotor drones in a simulated 3D environment. Our task is to learn a control
policy that directly maps proprioceptive observations of an individual drone to motor thrusts with the goal
of minimizing the distances to positions in the desired formation while avoiding collisions. We analyze
the case of online decentralized control, i.e. instead of a centralized system solving the joint trajectory
optimization problem offline, we consider policies which simultaneously (and implicitly) plan trajectories
and control individual quadrotors in real-time. The decentralized approach assumes no access to the global
state during evaluation and scales better with the size of the swarm. In the real world, this is analogous
to individual quadrotors in the swarm generating their own action sequences using only on-board com-
putations. We demonstrate the efficacy of this approach by showing that our learned policies do indeed
transfer to a physical setting. Formally, the state of the environment at time t is described by the tuple
S
(t)
= (g
(t)
1
,...,g
(t)
N
,s
(t)
1
,...,s
(t)
N
). Hereg
(t)
i
∈ R
3
are the goal locations for the quadrotors that together
define the desired swarm formation at time t, ands
(t)
i
are the states of individual quadrotors. We describe
the state of a single quadrotor by the tuple (p,v,R,ω), where p ∈ R
3
is the position of the quadrotor
relative to the goal location, v ∈ R
3
is the linear velocity in the world frame, ω ∈ R
3
is the angular
velocity in the body frame, andR ∈ SO(3) is the rotation matrix from the quadrotor’s body coordinate
frame to the world frame. Each quadrotor is controlled by a learned policyπ θ (a
(t)
|o
(t)
) which maps ob-
servationso
(t)
to Gaussian distributions over continuous actionsa
(t)
. To effectively maneuver and avoid
80
collisions, the quadrotors need to be able to measure their own position and orientation in spaces
(t)
i
, and
relative positions ˜ p
(t)
ij
and relative velocities ˜ v
(t)
ij
of their neighbors. We therefore represent each quadro-
tor’s observations aso
(t)
i
= (s
(t)
i
,η (t)
i
), whereη (t)
i
= (˜ p
(t)
i1
,...,˜ p
(t)
iK
,˜ v
(t)
i1
,...,˜ v
(t)
iK
) is a tuple containing the
neighborhood information. HereK ≤ N− 1 is the number of neighbors that each individual quadrotor
can observe. We useK ≪ N for larger swarms to improve scalability during both training and evalua-
tion. Simulated quadrotors, similar to their real-world counterparts, are powered by motors that spin in
one direction and generate only non-negative thrust. We thus transform actionsa
(t)
∈R
4
sampled by the
policy to control inputsf
(t)
∈ [0,1]
4
viaf
(t)
=
1
2
(clip(a
(t)
,− 1,1)+1), wheref
(t)
1...4
= 0 corresponds to
no thrust, andf
(t)
1...4
=1 corresponds to maximum thrust on motors1...4.
5.1.4 SimulationandSim-to-RealConsiderations
We train and evaluate our control policies in simulated environments with realistic quadrotor dynamics.
We adopt a simulation engine with drone dynamics [88], and augment it to support quadrotor swarms.
Virtual quadrotors are modeled on the Crazyflie 2.0 [37] – our physical demonstration platform. A key
feature of this engine is the model of hardware imperfections previously shown to facilitate zero-shot
sim-to-real transfer of stabilizing policies for single quadrotors [88]. Non-ideal motors are modeled using
motor-lag and thrust noise and the simulator provides imperfect observations, with noise injected into
position, orientation, and velocity estimations. Noise parameters are estimated from data collected on real
quadrotors. While motor and sensor noise create a challenging learning environment, they are instru-
mental to prevent overfitting to otherwise unrealistic idealized conditions of the simulator. To facilitate
the emergence of collision-avoidance behaviors, in addition to modeling dynamics, we simulate collisions
between individual quadrotors. In reality, collision outcomes depend on many factors, e.g. rigidity and
materials of the quadrotor frames, whether the blades of two colliding quadrotors touch, etc. Instead of
81
modeling these complex processes, we adopt a simple randomized collision model. When collision be-
tween quadrotors is detected, we briefly apply random force and torque to both quadrotors with opposite
signs, preserving linear and angular momenta.
5.1.5 TrainingSetup
For training we use a policy gradient RL algorithm. In our setup, the learning algorithm updates the
parametersθ of the policyπ θ (a
(t)
|o
(t)
) to maximize the expected discounted sum of rewards
E
π θ "
T
X
t
′
=1
γ t
′
− t
R(S
(t
′
)
,a
(t
′
)
)
#
(5.1)
For our experiments we choose Proximal Policy Optimization (PPO) [123]. In particular, we use the imple-
mentation from the high-throughput asynchronous RL framework "SampleFactory" [105] which supports
multi-agent learning and enables large scale experiments. In each training episode, the goal of every
quadrotor in the swarm is to reach its designated position in the formation while avoiding collisions with
the ground and with other quadrotors. In order to provide rich and diverse training experience, we train
our policies in a mixture of scenarios featuring a variety of geometric 2D and 3D formations. Further train-
ing details are in Sec. 5.1.8. The reward function optimized by the RL algorithm for quadrotori consists of
three major components:
r
(t)
=r
(t)
pos
+r
(t)
col
+r
(t)
aux
(5.2)
82
Here, r
(t)
pos
= − α pos
p
(t)
i
2
rewards the quadrotors for minimizing the distance to their target locations.
r
(t)
col
is responsible for penalizing collisions between quadrotors and is defined as:
r
(t)
col
=− α col
1
(t)
col
− α prox
K
X
j=1
max
1−
˜ p
(t)
ij
2
d
prox
,0
(5.3)
The indicator function1
(t)
col
is equal to1 for timesteps where a new collision involving thei-th quadrotor
is detected. The second term represents the smooth proximity penalty with linear falloff distance d
prox
.
Here
˜ p
(t)
j
2
is the distance between centers of mass ofi-th andj-th quadrotors, andd
prox
in our experi-
ments is double the size of the quadrotor frame, which encourages them to keep minimal distance in tight
formations. In addition to the first two terms r
(t)
pos
andr
(t)
col
that convey our main objective we also adopt
an auxiliary reward function similar to [88] to facilitate the initial learning of stabilizing controllers:
r
(t)
aux
=− α ω
ω
(t)
i
2
− α f
f
(t)
i
2
+α rot
R
(t)
33
(5.4)
This penalizes high angular velocity, high motor thrusts, and large rotations about horizontal (x- andy-)
axes respectively.
5.1.6 Modelarchitectures
During training and evaluation, the quadrotors’ actionsa are sampled from a parametric stochastic policy
π θ (a|o). We omit time indices and quadrotor identity for simplicity where possible. We representπ θ with a
Gaussian distribution,a∼ π θ (a|o)
d
=N(µ a
,σ 2
), where the meanµ a
∈R
4
is a function of the quadrotor’s
observation at timet parameterized by a feed-forward neural network, and the varianceσ 2
∈R is a single
learned parameter independent of the state. To compute the distribution means we embed the state of the
quadrotor and its neighborhood before regressing: µ a
: e
s
= ϕ s
(s), e
η = f
η (s,η ), µ a
= ϕ a
(e
s
,e
η ).
Here e
s
and e
η are the embedding vectors that encode each quadrotor’s own state and the state of its
83
Figure 5.2: Detailed neural network architectures. Left: Deepsets architecture used in simulation. Middle:
Attention architecture used in simulation. Right: Smaller deepsets architecture deployed on Crazyflie2.0.
neighborhood respectively, ϕ s
and ϕ a
are fully-connected neural networks, and f
η is the neighborhood
encoder. We analyze two types of neighborhood encoders: deep sets and attention-based (Sec. 5.1.7). The
value functionV
π uses the same architecture as the policy, except that it regresses a single deterministic
value estimate instead of the parameters of the action distribution. Weights are not shared betweenπ and
V
π models.
5.1.7 NeighborhoodEncoder
Deep sets. The task of the encoder f
η is to generate a compact and expressive representation e
η of
each quadrotor’s local neighborhood. Since the individual quadrotors are agnostic to the identity of their
neighbors, this representation must be permutation invariant. In addition, scale invariance is a desirable
property since the sizeK of the observable neighborhood fluctuates over time, i.e. when a sufficient num-
ber of quadrotors in the formation move beyond the sensor range.
The deep sets architecture proposed by Zaheer et al. [146] has these required properties. We apply
the same learned transformationψ η to the observed features of each quadrotor in the neighborhood, after
which a permutation-invariant aggregation function is applied to neighbor embeddingse
j
. We calculate
the mean of neighbor embeddings to achieve scale invariance: e
j
= ψ η (˜ p
ij
,˜ v
ij
), e
η =
1
K
P
K
j=1
e
j
. Not
all neighbors are equally important for trajectory planning and decision making. For example, distant
84
and stationary drones are less likely to influence a drone’s behavior compared to closely located and fast-
moving neighbors. The mean operation in the deep sets encoder does not allow it to convey the relative
importance of different neighbors – this motivates a more sophisticated encoder architecture.
Attention-based. The attention mechanism [137] provides a natural way to express the relative
importance of individual neighbors. The attention-based neighborhood encoder is based on [23], adapted
for quadrotors in 3D. The current quadrotor’s states
i
and the neighbor observations (˜ p
ij
,˜ v
ij
) for thej-
th neighbor are used to compute the attention weights: e
j
= ψ e
(s
i
,˜ p
ij
,˜ v
ij
), e
m
=
1
K
P
K
j=1
e
j
, α j
=
ψ α (e
j
,e
m
). Heree
j
are the embedding vectors of individual neighbors,e
m
is the summary of the whole
neighborhood, and ψ e
and ψ α are fully-connected neural networks. We use the softmax operation over
α j
to compute the attention scores which sum up to 1. The neighborhood embedding is thus produced
as e
η =
P
K
j=1
Softmax(α j
)ψ h
(e
j
), where ψ h
represents an additional hidden layer. Both in deep sets
and attention encoders, we used multi-layer perceptrons (MLPs) with256 neurons andtanh activations.
Additional details are provided in the supplementary materials.
5.1.8 ExperimentsandResults
We train our control policies in episodes, in diverse randomized scenarios with static and dynamic forma-
tions. Our virtual experimental arena is a 10× 10× 10 m room. At the beginning of each episode, we
randomly spawn the quadrotors in a 3 m radius around the central axis of the room, at a height between
0.25 and 2 m. We randomly initialize their orientation, linear and angular velocities to facilitate learning
robust recovery behaviors. To provide a diverse and challenging training environment, we procedurally
generate scenarios of different types, listed below.
Staticformations. The target formation is fixed throughout the episode. The formation takes various
geometric shapes e.g. 2D grid, circle, cylinder, and cube (Fig. 5.1). The separationr between goals in the
formation is chosen randomly. A special caser = 0 where the goal locations for all quadrotors coincide
85
Figure 5.3: (Left) Evader pursuit, (Middle)N = 16 quadrotors in a dense formation after fine-tuning (see
Sec. 5.1.11), and (Right) A swarm breaking formation to avoid a collision with an obstacle. Videos of the
learned swarm behaviors in different scenarios are at: https://sites.google.com/view/swarm-rl.
0 200M 400M 600M 800M 1B
−1.0
−0.8
−0.6
−0.4
−0.2
0.0
Avg. episode reward
1e2
Total reward
0 200M 400M 600M 800M 1B
0
1
2
3
Avg. distance, meters
Avg. distance to the target
0 200M 400M 600M 800M 1B
Simulation steps
0.5
0.6
0.7
0.8
0.9
1.0
Fraction of the episode in the air
Flight performance
0 200M 400M 600M 800M 1B
Simulation steps
0.12
0.50
2.00
8.00
Number of collisions
Avg. collisions between drones
Blind agent MLP DEEPSETS ATTENTION
Figure 5.4: Comparison of different model architectures for N = 8 quadrotors with K = 6 neighbors
visible to each. For each architecture we show mean and standard deviation of four independent training
runs.
demands very dense configurations with high probabilities of collisions. We refer to this task as the same
goal scenario.
Dynamic formations. We modify the separation between the quadrotors, and the position of the
formation origin. We explore gradually shrinking the inter-quadrotor separation over time, and randomly
teleporting the formation around the room. To train policies to avoid head-on collisions at high speed, we
include scenarios where the swarm is split into two groups. We swap the target formations of quadrotors
in these two groups several times per episode, which requires two teams of quadrotors to fly ’through’
each other. We refer to this as the swarm-vs-swarm scenario.
Evaderpursuit. For the evader pursuit task, the team is given a shared goal that moves according to
some policy (simulating an evader that the team must pursue). We use two evader trajectory parameteri-
zations: a 3D Lissajous curve and a randomly sampled Bezier curve.
86
5.1.9 ModelArchitectureStudy
We compare training performance with different neighborhood encoders with N =8 quadrotors (Fig. 5.4)
and a fixed number K = 6 of visible neighbors. In addition to the architectures described in Sec. 5.1.7
we train two baselines. The first is a blind quadrotor, for which we remove the neighborhood encoderf
η entirely. While blind quadrotors get close to their targets, they are not able to avoid collisions with each
other. The second uses a plain multi layer perceptron, which concatenates neighbor observations as input.
This policy, which is not permutation invariant, fails to avoid collisions in most scenarios, suggesting that a
permutation-invariant architecture is needed. The difference between the attention and deep sets encoders
is most prominent in tasks that require dense swarm configurations, such as the same goal scenario. In
addition, quadrotors with the deep sets encoder do not get as close to their target locations, sacrificing
formation density to minimize collisions. Since the attention-based architecture demonstrated the best
performance in both goal reaching and collision avoidance, we use it in all further simulated experiments.
5.1.10 AttentionWeightsStudy
We investigate the results of training an attention-based architecture for encoding relative scores of neigh-
boring drones. We ask whether the attention mechanism learns to assign higher attention scores to neigh-
bors that are closer and whose velocity vector points towards the current agent. In addition, we investigate
to what degree distance and velocity individually affect the scores. We modify the swarm-vs-swarm sce-
nario to contain two teams of two drones whose goals are 1 m apart and situated in the same horizontal
plane. The drones are allowed to settle at their respective goals following which the goals are swapped. We
take a snapshot of the experiment and record the softmax attention weights for each drone. We manually
set the relative velocities of all neighbors to (0,0,0) for each drone and pass the modified observations
to the attention encoder. The results (Fig. 5.5) show that the red quadrotor assigns the highest attention
weight of 0.61 to the blue quadrotor, which is on a collision course with it. Similarly, the blue quadrotor
87
red grey green blue
red grey green blue
Attention weights
red grey green blue
red grey green blue
Attention weights, velocity = 0
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
Figure 5.5: Attention weights between drones. Left: original velocities, right: velocities set to 0.
#agents
Collisions
perminute
perdrone
Distanceto
target,m
Collisions
perminute
perdrone
+200Mtraining
Distanceto
target,m
+200Mtraining
8 0.02 0.42 0.00 0.41
16 0.09 0.57 0.08 0.63
32 0.90 0.86 0.16 0.81
48 2.43 1.11 0.23 0.92
64 5.36 1.70 0.29 1.12
128 8.63 4.32 1.37 3.05
Table 5.1: Scaling up the attention policy with and without additional training. The size of the visible local
neighborhood is fixed at K =6 drones. The metrics are averaged over 20 episodes.
assigns the highest weight of 0.57 to its red neighbor. For the gray and green quadrotors, all neighbors
are assigned a roughly equal weighting, with neighbors closer in distance having slightly higher weights.
When the relative velocity observations are manually set to 0 and fed to the attention encoder, we observe
a drastically different, much more uniform distribution of attention weights. We conclude that neighbor-
ing quadrotors with small relative distance and high relative velocity vectors in the direction of the viewer
are prioritized over drones further away with velocities in other directions. Drones with relative velocities
pointing towards the viewer seem to be prioritized higher than drones that are closer but with velocity
vectors away from the viewer, implying that velocity is considered more important than distance.
88
5.1.11 Scaling
We investigate the ability of our policies to scale to larger swarm sizes without re-training from scratch.
We introduce a second, fixed-cost training step in which policies trained with N = 8,K = 6 are trained
for an additional 2× 10
8
steps with the new target number of quadrotors N and the same K = 6, i.e.
the baseline policies are copied and tuned separately in the environment with larger swarm size. The re-
sults (Tab. 5.1) show that without additional training, the number of collisions increases with the number
of quadrotors because the state distribution changes significantly compared to 8-drone case. Additional
tuning has a significant positive effect. Even with 128 drones, the quadrotors can avoid collisions in dy-
namic environments, and the higher number of collisions is largely explained by cascade effects: when a
collision happens, it affects multiple drones. The additional tuning amounts to only 20% of the original
training session (∼ 4 hours).
5.1.12 ObstacleAvoidance
We experiment with a harder version of the environment by introducing a spherical obstacle moving
through the formations at random angles multiple times throughout the episode. At the beginning of each
episode, we randomly sample the obstacle size, as well as its velocity and the parameters of its trajectory. To
incorporate obstacles into our training protocol, we augment the quadrotor observationso
(t)
i
to contain the
information about obstacle stateζ (t)
i
. This information includes the radiusˆ r of the obstacle, and its position
ˆ p
(t)
i
and velocityˆ v
(t)
i
relative to thei-th quadrotor. We process the obstacle observations with an additional
MLPϕ o
to produce the obstacle embeddinge
o
, which is used in conjunction with the neighborhood encoder
to generate the action distributions (see Fig. 5.1.6 for details): e
o
= ϕ o
(ζ ), µ a
= ϕ a
(e
s
,e
η ,e
o
). The
collision physics and the penalties are modeled in the same way as for quadrotor-vs-quadrotor collisions.
Fig. 5.7 shows the training performance in obstacle avoidance scenarios. Despite increased complexity, we
achieve performance comparable to the baseline, keeping dense formations close to the target locations.
89
Figure 5.6: Time lapse of eight drones swapping goals (left) and performing a formation change (right).
0 200M 400M 600M 800M 1B
Simulation steps
0.25
0.50
1.00
2.00
Avg. distance, meters
Avg. distance to the target
0 200M 400M 600M 800M 1B
Simulation steps
0.06
0.12
0.25
0.50
Number of collisions
Avg. collisions between the obstacle & drones per minute
Figure 5.7: Learning to maintain dense formations while avoiding collisions with moving external obsta-
cles.
90
5.1.13 AdditionalBaselines
Classical trajectory optimization and control methods have been proposed for tasks similar to ours. [109]
uses graph search in discretized space followed by trajectory smoothing to switch formations with large
teams of quadrotors. It relies on full prior information about the (static) environment and an offline op-
timization process taking up to tens of seconds. We test our controller in randomized dynamic environ-
ments without access to global information, which makes it hard to make a direct comparison between
our method and [109]. Graph Neural Networks [69, 135] and RL [62, 61] are a prime candidate for an alter-
native model architecture. GNNs can capture global information about the swarm while relying only on
local communication by performing multiple consecutive graph convolution operations and communicat-
ing intermediate representations between adjacent robots. While this enables decentralized operation, the
reliance on multiple message exchanges between UAVs on each step is prohibitively expensive for nano-
quadrotor platforms such as Crazyflie. Our controllers run at 500Hz on the real drones, which makes
the communication protocol latency requirements exceptionally tight for multi-layer GNNs. Nonetheless,
GNNs remain an attractive option for more powerful platforms or tasks that do not require high-frequency
reactive control. [149] uses Buffered Voronoi Cells to compute safe regions around quadrotors (with mar-
gins between cells to account for kinodynamic constraints) and a PID controller to achieve positions within
safe regions that are the closest to the target. This method relies only on local neighborhood information.
We implement it on our hardware platform and compare its performance to our method (Sec. 5.1.14).
5.1.14 PhysicalDeployment
We deployed our policies on the Crazyflie2.0, a small, lightweight, open-source quadrotor platform. We
tested the neural controller in several scenarios: hovering in a close proximity to a shared goal (same
goal), following a moving target, maintaining dynamic geometric formations, and flying through a team
of moving drones (swarm-vs-swarm), with up to 8 quadrotors in the latter experiments (Fig. 5.6). Video
91
x (m)
3
2
1
0
1
y (m)
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
z (m)
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Swap Goals (NN)
x (m)
3
2
1
0
1
y (m)
2.0
1.5
1.0
0.5
0.0
0.5
1.0
z (m)
0.2
0.4
0.6
0.8
1.0
Swap Goals (Voronoi Cells + PID)
Figure 5.8: (Left) Two teams of four quadrotors swap goals twice and then land at their original starting
positions. (Middle) Four quadrotors follow a moving goal, starting from the bottom right going counter-
clockwise. The goal positions are marked with red X’s. Both plots show 3D positions over time of the four
quadrotors. (Right) two teams of four quadrotors swap goals using PID controllers and buffered Voronoi
cells for collision avoidance.
demonstrations are available at https://sites.google.com/view/swarm-rl. Crazyflie2.0 is a low-power nano-
quadrotor platform where on board computation is provided by a microcontroller with 168MHz and 192Kb
of RAM. To run a team-aware neural policy on such limited hardware we trained a much smaller version
of our deep sets model with only 16 and 8 neurons in the hidden layers of self- and neighbor encoders
respectively. Surprisingly, even such tiny policies with∼ 10
3
parameters performed well on real quadro-
tors. We use a Vicon system to provide position and velocity updates at 100 Hz with an added low-pass
exponential filter for neighbor positions to reduce noise. Each drone’s controller runs at 500 Hz incorpo-
rating the IMU measurements atop the latest available Vicon data. Despite the fact that downwash was
not explicitly modeled in the simulator, our policies recover from the aerodynamic disturbances caused by
the proximity of other drones. We observed recovery from non-destructive collisions with each other or
with the ground wherein drones re-stabilize after collisions with teammates, and "bounce back" from the
ground and resume flight.
92
For comparison, we implemented an online planning and collision avoidance algorithm [149], based
on buffered Voronoi cells and PID control, to understand how our controllers behave compared to a tra-
ditional approach. We tested with 8 drones on the swarm-vs-swarm scenario. The trajectories generated
by the classical algorithm are longer than the ones produced by our neural network controller. Although
there were no restrictions on the configuration space, we observed that the classical method generates
trajectories predominately in a single 2Dxy− plane, whereas the neural controllers utilize all 3 spatial di-
mensions (Fig. 5.8). Finally, we observe that our controllers execute more aggressive maneuvers, reaching
goals faster with max velocity (acceleration) of up to 4m/s (7m/s
2
) respectively, compared to a max of
1m/s (3m/s
2
) for the classical method. Additional details are provided in the supplementary materials.
5.1.15 Conclusion
Our results demonstrate that drones trained with deep reinforcement learning can achieve strong goal-
reaching and collision avoidance performance across a diverse range of training scenarios with realistic
quadrotor dynamics. We present evidence of successful swarm control in simulation and demonstrate
the zero-shot transfer of policies learned in simulation to the Crazyflie platform, on which we are able to
perform successful trials on multiple tasks. Our policies learn to fly from scratch, without the use of tuned
PID controllers. Our method is thus model-agnostic, i.e. we can learn policies for quadrotors with different
physical parameters (e.g. mass, size, inertia matrix, thrust) by simply re-running the training in the updated
simulator. In contrast with classical planning methods, we do not introduce any constraints on the velocity
or acceleration and allows the controller to take advantage of full capabilities of the quadrotor. This enables
agile flight with aggressive maneuvers. Our pursuit-evasion experiments are the most representative of
this. In order to stay close to the fast evader, simulated quadrotors exceed speeds of 7 m/s and reach
accelerations up to1.7g. Permutation and scale-agnostic model architectures used in our method allow us
to switch between different team sizes. We found that after additional training to adjust to the larger team
93
size, our policies can control swarms of up to 128 members without a significant increase in computation
burden per quadrotor.
5.2 DeXtreme:TransferofAgileIn-handManipulationfromSimulation
toReality
Figure 5.9: The DeXtreme system using an Allegro Hand in action in the real world.
In the previous section we demonstrated successful sim-to-real transfer in the domain of quadro-
tor swarm coordination. This result was achieved with an asynchronous learning system "Sample Fac-
tory" [105] through efficient parallelization of an inherently scalar (i.e. simulating one scene at a time)
CPU-based flight simulator. Since CPU-based simulation and GPU-accelerated inference and learning are
94
performed on different computational devices, it is crucial to use an asynchronous algorithm with double-
buffered sampling that can reach high enough throughput to enable large-scale learning.
The advent of vectorized GPU-accelerated physics engines like Isaac Gym [79] has enabled a different
approach to the design of reinforcement learning systems. Since all major computational tasks (simulation,
inference, and backpropagation) are now performed on the GPU, it is actually more efficient to follow a
simplistic synchronous and sequential design of the learning algorithm: with large enough batch sizes we
can simply chain all computational steps in a sequence and achieve a very high GPU-utilization.
In this section we show how this approach manifested in the system called "DeXtreme" [47] can be
used to train policies to perform object reorientation with a multi-fingered Allegro Hand and transfer
these policies to the real world. We emphasize the importance of high-throughput vectorized physics
simulation [79], entirely GPU-side implementation of PPO [123, 78], and vectorized automatic domain
randomization (VADR) engine that allows us to keep very high simulation performance while generating
thousands of environments with diverse physics parameters.
5.2.1 RoboticManipulationwithMulti-FingeredHands: Introduction
Multi-fingered robotic hands offer an exciting platform to develop and enable human-level dexterity. Not
only do they provide kinematic redundancy for stable grasps, but they also enable the repertoire of skills
needed to interact with a wide range of day-to-day objects. However, controlling such high-DoF end-
effectors has remained challenging. Even in research, most robotic systems today use parallel-jaw grippers.
In 2018, OpenAI et al. [97] showed for the first time that multi-fingered hands with a purely end-to-end
deep-RL based approach could endow robots with unprecedented capabilities for challenging contact-rich
in-hand manipulation. However, due to the complexity of their training architecture, and the sui generis
nature of their work on sim-to-real transfer, reproducing and building upon their success has proven to
be a challenge for the community. Recent advancements in in-hand manipulation with RL have made
95
progress with multiple objects and an anthropomorphic hand [24], but those results have only been in
simulation.
We build on top of the prior work in [97]. We use a comparatively affordable Allegro Hand with a
locked wrist and four fingers, using only position encoders on servo motors; the Shadow Hand used in
OpenAI’s experiments costs an order of magnitude more than the Allegro Hand. Furthermore, we use the
GPU-based Isaac Gym physics simulator [79] as opposed to the CPU-based MuJoCo [134], which allows us
to reduce the amount of computational resources used and the complexity of the training infrastructure.
Our best models required only 8 NVIDIA A40 GPUs to train, as opposed to OpenAI’s use of a CPU cluster
composed of 400 servers with 32 CPU-cores each, as well as 32 NVIDIA V100 GPUs [96] (compute require-
ments for block reorientation). Our more affordable hand, in combination with the simple vision system
architecture and accessible compute, dramatically simplifies the process of developing and deploying agile
and dexterous manipulation.
5.2.2 ReorientationTask
We propose a method for performing object reorientation on an anthropomorphic hand. Initially the
object to be manipulated is placed on the palm of the hand and a random target orientation is sampled in
SO(3). The policy then orchestrates the motion of the fingers so as to bring the object to its desired target
orientation.
Similar to OpenAI et al. [97], if the object orientation is within a specified threshold of 0.4 radians of
the target orientation, we sample a new target orientation. This process is repeated until the robot drops
the object or gets stuck. Our performance metric is the number of consecutive successes achieved in one
uninterrupted experiment. Fig. 5.10 shows the reorientation task setup both in simulation and in the real
world. Please also refer to https://dextreme.org for video demonstrations.
96
Policy Training in Simulation
Recurrent Policy
AAAC2nichVFLS8NAEJ7GV1sfrXr0EmwFD1LSIuqx4AMvQgX7gLaUJN3W0LzYbAu1ePEmXr151R+lv8WD366poEW6YTOz33zz7cyOFbpOJAzjPaEtLC4tryRT6dW19Y1MdnOrFgVDbrOqHbgBb1hmxFzHZ1XhCJc1Qs5Mz3JZ3Rqcynh9xHjkBP6NGIes7Zl93+k5tikAdbKZfNQRB3ogf7wj8p1szigYaumzTjF2chSvSpD9oBZ1KSCbhuQRI58EfJdMivA1qUgGhcDaNAHG4Tkqzuie0sgdgsXAMIEO8O/j1IxRH2epGalsG7e42ByZOu1hXyhFC2x5K4MfwX5i3yms/+8NE6UsKxzDWlBMKcUr4IJuwZiX6cXMaS3zM2VXgnp0orpxUF+oENmn/aNzhggHNlARnc4Vsw8NS51HeAEftooK5CtPFXTVcRfWVJYpFT9WNKHHYeXrox6Mufh3qLNOrVQoHhUOr0u5shEPPEk7tEv7mOoxlemSKqhDTv6FXulNa2kP2qP29E3VEnHONv1a2vMX5xqT8A==
s
t
,o
t
,r
t
AAAC2nichVFLS8NAEJ7GV1sfrXr0EmwFD1LSIuqx4AMvQgX7gLaUJN3W0LzYbAu1ePEmXr151R+lv8WD366poEW6YTOz33zz7cyOFbpOJAzjPaEtLC4tryRT6dW19Y1MdnOrFgVDbrOqHbgBb1hmxFzHZ1XhCJc1Qs5Mz3JZ3Rqcynh9xHjkBP6NGIes7Zl93+k5tikAdbKZfNQRB3ogf7wj8p1szigYaumzTjF2chSvSpD9oBZ1KSCbhuQRI58EfJdMivA1qUgGhcDaNAHG4Tkqzuie0sgdgsXAMIEO8O/j1IxRH2epGalsG7e42ByZOu1hXyhFC2x5K4MfwX5i3yms/+8NE6UsKxzDWlBMKcUr4IJuwZiX6cXMaS3zM2VXgnp0orpxUF+oENmn/aNzhggHNlARnc4Vsw8NS51HeAEftooK5CtPFXTVcRfWVJYpFT9WNKHHYeXrox6Mufh3qLNOrVQoHhUOr0u5shEPPEk7tEv7mOoxlemSKqhDTv6FXulNa2kP2qP29E3VEnHONv1a2vMX5xqT8A==
s
t
,o
t
,r
t
AAACznichVFLT8JAEB7qC/CFevTSCCaeSCFGPZL4iBcTTCyQICHbstSGvtIuJEiIV29e9afpb/Hgt2sxUWLYZjuz33zz7cyOFXluIgzjPaMtLa+srmVz+fWNza3tws5uIwmHsc1NO/TCuGWxhHtuwE3hCo+3opgz3/J40xqcy3hzxOPEDYM7MY54x2dO4PZdmwlAZol1RalbKBplQy193qmkTpHSVQ8LH3RPPQrJpiH5xCkgAd8jRgm+NlXIoAhYhybAYniuinOaUh65Q7A4GAzoAH8Hp3aKBjhLzURl27jFw46RqdMh9pVStMCWt3L4Cewn9qPCnH9vmChlWeEY1oJiTineABf0AMaiTD9lzmpZnCm7EtSnM9WNi/oihcg+7R+dC0RiYAMV0elSMR1oWOo8wgsEsCYqkK88U9BVxz1YpixXKkGqyKAXw8rXRz0Yc+XvUOedRrVcOSkf31aLNSMdeJb26YCOMNVTqtE11VGHDc0XeqU3ra6NtKn29E3VMmnOHv1a2vMXE7WQKg==
a
t
AAACznichVFLT8JAEB7qC/CFevTSCCaeSCFGPZL4iBcTTCyQICHbstSGvtIuJEiIV29e9afpb/Hgt2sxUWLYZjuz33zz7cyOFXluIgzjPaMtLa+srmVz+fWNza3tws5uIwmHsc1NO/TCuGWxhHtuwE3hCo+3opgz3/J40xqcy3hzxOPEDYM7MY54x2dO4PZdmwlAZol1RalbKBplQy193qmkTpHSVQ8LH3RPPQrJpiH5xCkgAd8jRgm+NlXIoAhYhybAYniuinOaUh65Q7A4GAzoAH8Hp3aKBjhLzURl27jFw46RqdMh9pVStMCWt3L4Cewn9qPCnH9vmChlWeEY1oJiTineABf0AMaiTD9lzmpZnCm7EtSnM9WNi/oihcg+7R+dC0RiYAMV0elSMR1oWOo8wgsEsCYqkK88U9BVxz1YpixXKkGqyKAXw8rXRz0Yc+XvUOedRrVcOSkf31aLNSMdeJb26YCOMNVTqtE11VGHDc0XeqU3ra6NtKn29E3VMmnOHv1a2vMXE7WQKg==
a
t
DR params
AAACznichVFLT8JAEB7qC/CFevTSCCaeSCFGPZL4iBcTTCyQICHbstQNfaVdSJAQr9686k/T3+LBr2sxUWLYZjuz33zz7cyOFboilobxntGWlldW17K5/PrG5tZ2YWe3EQfDyOamHbhB1LJYzF3hc1MK6fJWGHHmWS5vWoPzJN4c8SgWgX8nxyHveMzxRV/YTAIyS2FXlrqFolE21NLnnUrqFCld9aDwQffUo4BsGpJHnHyS8F1iFONrU4UMCoF1aAIsgidUnNOU8sgdgsXBYEAH+Ds4tVPUxznRjFW2jVtc7AiZOh1iXylFC+zkVg4/hv3EflSY8+8NE6WcVDiGtaCYU4o3wCU9gLEo00uZs1oWZyZdSerTmepGoL5QIUmf9o/OBSIRsIGK6HSpmA40LHUe4QV8WBMVJK88U9BVxz1YpixXKn6qyKAXwSavj3ow5srfoc47jWq5clI+vq0Wa0Y68Czt0wEdYaqnVKNrqqMOG5ov9EpvWl0baVPt6ZuqZdKcPfq1tOcvN9yQOQ==
p
t
AAACznichVFLT8JAEB7qC/CFevTSCCaeSCFGPZL4iBcTTCyQICHbstQNfaVdSJAQr9686k/T3+LBr2sxUWLYZjuz33zz7cyOFboilobxntGWlldW17K5/PrG5tZ2YWe3EQfDyOamHbhB1LJYzF3hc1MK6fJWGHHmWS5vWoPzJN4c8SgWgX8nxyHveMzxRV/YTAIyS2FXlrqFolE21NLnnUrqFCld9aDwQffUo4BsGpJHnHyS8F1iFONrU4UMCoF1aAIsgidUnNOU8sgdgsXBYEAH+Ds4tVPUxznRjFW2jVtc7AiZOh1iXylFC+zkVg4/hv3EflSY8+8NE6WcVDiGtaCYU4o3wCU9gLEo00uZs1oWZyZdSerTmepGoL5QIUmf9o/OBSIRsIGK6HSpmA40LHUe4QV8WBMVJK88U9BVxz1YpixXKn6qyKAXwSavj3ow5srfoc47jWq5clI+vq0Wa0Y68Czt0wEdYaqnVKNrqqMOG5ov9EpvWl0baVPt6ZuqZdKcPfq1tOcvN9yQOQ==
p
t
AAACznichVFLT8JAEB7qC3yiHr00goknUohRjyQ+4sUEEwskSMi2LKWhr2wXEiTEqzev+tP0t3jw61pMlBi22c7sN998O7NjRZ4bS8N4z2hLyyura9nc+sbm1vZOfnevHodDYXPTDr1QNC0Wc88NuCld6fFmJDjzLY83rMFFEm+MuIjdMLiX44i3feYEbs+1mQRkFvsdWezkC0bJUEufd8qpU6B01cL8Bz1Ql0KyaUg+cQpIwveIUYyvRWUyKALWpgkwAc9VcU5TWkfuECwOBgM6wN/BqZWiAc6JZqyybdziYQtk6nSEfa0ULbCTWzn8GPYT+1Fhzr83TJRyUuEY1oJiTineApfUB2NRpp8yZ7Uszky6ktSjc9WNi/oihSR92j86l4gIYAMV0elKMR1oWOo8wgsEsCYqSF55pqCrjruwTFmuVIJUkUFPwCavj3ow5vLfoc479UqpfFo6uasUqkY68Cwd0CEdY6pnVKUbqqEOG5ov9EpvWk0baVPt6ZuqZdKcffq1tOcvJJSQMQ==
h
t
AAACznichVFLT8JAEB7qC3yiHr00goknUohRjyQ+4sUEEwskSMi2LKWhr2wXEiTEqzev+tP0t3jw61pMlBi22c7sN998O7NjRZ4bS8N4z2hLyyura9nc+sbm1vZOfnevHodDYXPTDr1QNC0Wc88NuCld6fFmJDjzLY83rMFFEm+MuIjdMLiX44i3feYEbs+1mQRkFvsdWezkC0bJUEufd8qpU6B01cL8Bz1Ql0KyaUg+cQpIwveIUYyvRWUyKALWpgkwAc9VcU5TWkfuECwOBgM6wN/BqZWiAc6JZqyybdziYQtk6nSEfa0ULbCTWzn8GPYT+1Fhzr83TJRyUuEY1oJiTineApfUB2NRpp8yZ7Uszky6ktSjc9WNi/oihSR92j86l4gIYAMV0elKMR1oWOo8wgsEsCYqSF55pqCrjruwTFmuVIJUkUFPwCavj3ow5vLfoc479UqpfFo6uasUqkY68Cwd0CEdY6pnVKUbqqEOG5ov9EpvWk0baVPt6ZuqZdKcffq1tOcvJJSQMQ==
h
t
Vectorised ADR
CS
AAACz3ichVFNS8NAEH3Gr9bPqkcvwSp4KkkR9Sj4gRdBwVahiiRxG2PzxWaraFG8evOq/0x/iwdf1iioSDdsZvbNm7czO24aBpmyrNcBY3BoeGS0VB4bn5icmq7MzDazpCs90fCSMJHHrpOJMIhFQwUqFMepFE7khuLI7Wzm8aMrIbMgiQ/VTSpOI8ePg3bgOYpQc7F3d6YWzypVq2bpZf517MKpolj7SeUNJzhHAg9dRBCIoeiHcJDxa8GGhZTYKXrEJL1AxwXuMMbcLlmCDIdoh3+fp1aBxjznmpnO9nhLyC2ZaWKJe0crumTntwr6Ge07963G/H9v6GnlvMIbWpeKZa24R1zhgox+mVHB/Kqlf2belUIb67qbgPWlGsn79L51thiRxDo6YmJbM31quPp8xReIaRusIH/lLwVTd3xO62grtEpcKDrUk7T567Mejtn+PdS/TrNes1drKwf16oZVDLyEeSxgmVNdwwZ2sc86PFziCc94MQ6Ma+PeePikGgNFzhx+LOPxA6YgkMs=
t
AAACz3ichVFNS8NAEH3Gr9bPqkcvwSp4KkkR9Sj4gRdBwVahiiRxG2PzxWaraFG8evOq/0x/iwdf1iioSDdsZvbNm7czO24aBpmyrNcBY3BoeGS0VB4bn5icmq7MzDazpCs90fCSMJHHrpOJMIhFQwUqFMepFE7khuLI7Wzm8aMrIbMgiQ/VTSpOI8ePg3bgOYpQc7F3d6YWzypVq2bpZf517MKpolj7SeUNJzhHAg9dRBCIoeiHcJDxa8GGhZTYKXrEJL1AxwXuMMbcLlmCDIdoh3+fp1aBxjznmpnO9nhLyC2ZaWKJe0crumTntwr6Ge07963G/H9v6GnlvMIbWpeKZa24R1zhgox+mVHB/Kqlf2belUIb67qbgPWlGsn79L51thiRxDo6YmJbM31quPp8xReIaRusIH/lLwVTd3xO62grtEpcKDrUk7T567Mejtn+PdS/TrNes1drKwf16oZVDLyEeSxgmVNdwwZ2sc86PFziCc94MQ6Ma+PeePikGgNFzhx+LOPxA6YgkMs=
t
AAAC0XichVFNT8JAEB3qF+AX6tFLI2g8kUKMeiQRjRcTjIIkQMy2LLWhX2kXEiQmxqs3r/rH9Ld48O1aTJQYttnO7Js3b2d2zNB1YmEY7yltbn5hcSmdyS6vrK6t5zY2G3EwiCxetwI3iJomi7nr+LwuHOHyZhhx5pkuvzH7JzJ+M+RR7AT+tRiFvOMx23d6jsUEoGahXeWuYIXbXN4oGmrp004pcfKUrFqQ+6A2dSkgiwbkESefBHyXGMX4WlQig0JgHRoDi+A5Ks7pgbLIHYDFwWBA+/jbOLUS1MdZasYq28ItLnaETJ12sc+Uogm2vJXDj2E/se8VZv97w1gpywpHsCYUM0rxArigOzBmZXoJc1LL7EzZlaAeHatuHNQXKkT2af3oVBGJgPVVRKdTxbShYarzEC/gw9ZRgXzliYKuOu7CMmW5UvETRQa9CFa+PurBmEt/hzrtNMrF0mHx4LKcr+wlA0/TNu3QPqZ6RBU6pxrqkNN8oVd60660kfaoPX1TtVSSs0W/lvb8BRkwkVA=
AAAC0XichVFNT8JAEB3qF+AX6tFLI2g8kUKMeiQRjRcTjIIkQMy2LLWhX2kXEiQmxqs3r/rH9Ld48O1aTJQYttnO7Js3b2d2zNB1YmEY7yltbn5hcSmdyS6vrK6t5zY2G3EwiCxetwI3iJomi7nr+LwuHOHyZhhx5pkuvzH7JzJ+M+RR7AT+tRiFvOMx23d6jsUEoGahXeWuYIXbXN4oGmrp004pcfKUrFqQ+6A2dSkgiwbkESefBHyXGMX4WlQig0JgHRoDi+A5Ks7pgbLIHYDFwWBA+/jbOLUS1MdZasYq28ItLnaETJ12sc+Uogm2vJXDj2E/se8VZv97w1gpywpHsCYUM0rxArigOzBmZXoJc1LL7EzZlaAeHatuHNQXKkT2af3oVBGJgPVVRKdTxbShYarzEC/gw9ZRgXzliYKuOu7CMmW5UvETRQa9CFa+PurBmEt/hzrtNMrF0mHx4LKcr+wlA0/TNu3QPqZ6RBU6pxrqkNN8oVd60660kfaoPX1TtVSSs0W/lvb8BRkwkVA=
(a) Policy training.
Cloud
Dataset
(5M)
Omniverse Rendering
with Randomisation
Data Augmentation Mask RCNN
Vision Network Training
(bbox, keypoints)
(b) Vision data generation and training pipeline.
Policy
PnP &
Average
AAADDnichVLLattAFD1RmmfzcNNlN6J2IJQgJBOSZhdoU7oppFAnLrYpI3ksC8uSIo0DqfE/FLpNf6O7kG1/oX+SRRY9M5VLujC+YnTvnHvvmTMPP4ujQrnu7wVr8cnS8srq2vrTjc2t7cqznfMiHeWBbARpnOZNXxQyjhLZUJGKZTPLpRj6sbzwB290/uJK5kWUJp/UdSY7QxEmUS8KhCL0udbuCzW+nNS+VKqe4xqzXeeYdnRQBseePU1VUdpZWrlHG12kCDDCEBIJFOMYAgW/Fjy4yIh1MCaWM4pMXmKCdfaOWCVZIYgO+A85a5VowrnmLEx3wFVijpyd7UezFkcI39R4cLie/vZnxBPY2OV4Z9T4XEkrlowL+geOrwYLZ6obG1V6d9f0PhnXDOMH4gp9VszrHJaVUy3zO/WJKPTw2uwyor7MIPqMgn88b5nJiQ1MxsapqQzJ4Zv5Fc8roW9Qgb6hKYNtdtylF8ZLw5KUjIJ8Ob2+Oep5/ERmB+d1xzt0Dj7Wqyevyseyihd4iT3exRFO8B5n1BFQyXfc4If1zfpp3Vp3f0uthbLnOf4z69cftwqcTQ==
ˆ q
AAADDnichVLLattAFD1RmmfzcNNlN6J2IJQgJBOSZhdoU7oppFAnLrYpI3ksC8uSIo0DqfE/FLpNf6O7kG1/oX+SRRY9M5VLujC+YnTvnHvvmTMPP4ujQrnu7wVr8cnS8srq2vrTjc2t7cqznfMiHeWBbARpnOZNXxQyjhLZUJGKZTPLpRj6sbzwB290/uJK5kWUJp/UdSY7QxEmUS8KhCL0udbuCzW+nNS+VKqe4xqzXeeYdnRQBseePU1VUdpZWrlHG12kCDDCEBIJFOMYAgW/Fjy4yIh1MCaWM4pMXmKCdfaOWCVZIYgO+A85a5VowrnmLEx3wFVijpyd7UezFkcI39R4cLie/vZnxBPY2OV4Z9T4XEkrlowL+geOrwYLZ6obG1V6d9f0PhnXDOMH4gp9VszrHJaVUy3zO/WJKPTw2uwyor7MIPqMgn88b5nJiQ1MxsapqQzJ4Zv5Fc8roW9Qgb6hKYNtdtylF8ZLw5KUjIJ8Ob2+Oep5/ERmB+d1xzt0Dj7Wqyevyseyihd4iT3exRFO8B5n1BFQyXfc4If1zfpp3Vp3f0uthbLnOf4z69cftwqcTQ==
ˆ q
AAADDnichVLLTttAFD24vMsjLUs2FqFSVSHLRojHLhIPdYMEEgGqBFVjMzhWHNuyJ0hplH9AYgu/0V3Fll/on7DoomcGB8Ei4lrje+fce8+cefhZHBXKdf+OWR/GJyanpmdmP87NLyxWPn0+LdJuHsh6kMZpfu6LQsZRIusqUrE8z3IpOn4sz/z2rs6fXcu8iNLkRPUyedERYRJdRYFQhH6sNltC9U8Gqz8rVc9xjdmus0Pb2iiDHc8epqoo7SitPKGJS6QI0EUHEgkU4xgCBb8GPLjIiF2gTyxnFJm8xACz7O2ySrJCEG3zH3LWKNGEc81ZmO6Aq8QcOTubr2YNjhC+qfHgcD39rY2IB7DxhePAqPG5klYsGRf0/zh+GSwcqa5vVOnd9eh9Ms4YxkPiCi1WvNfZKSuHWt7v1CeicIVts8uI+jKD6DMKXnj2mMmJtU3Gxr6pDMnhm/k1zyuhr1OBvqEhg212fEkvjJeGJSkZBflyen1z1PP6iYwOTtcdb9PZOF6v1r6Vj2Uay1jBV97FFmr4jiPqCKjkFne4t26s39Yf6+G51Bore5bwxqzH/2p2nDA=
ˆ
T
AAADDnichVLLTttAFD24vMsjLUs2FqFSVSHLRojHLhIPdYMEEgGqBFVjMzhWHNuyJ0hplH9AYgu/0V3Fll/on7DoomcGB8Ei4lrje+fce8+cefhZHBXKdf+OWR/GJyanpmdmP87NLyxWPn0+LdJuHsh6kMZpfu6LQsZRIusqUrE8z3IpOn4sz/z2rs6fXcu8iNLkRPUyedERYRJdRYFQhH6sNltC9U8Gqz8rVc9xjdmus0Pb2iiDHc8epqoo7SitPKGJS6QI0EUHEgkU4xgCBb8GPLjIiF2gTyxnFJm8xACz7O2ySrJCEG3zH3LWKNGEc81ZmO6Aq8QcOTubr2YNjhC+qfHgcD39rY2IB7DxhePAqPG5klYsGRf0/zh+GSwcqa5vVOnd9eh9Ms4YxkPiCi1WvNfZKSuHWt7v1CeicIVts8uI+jKD6DMKXnj2mMmJtU3Gxr6pDMnhm/k1zyuhr1OBvqEhg212fEkvjJeGJSkZBflyen1z1PP6iYwOTtcdb9PZOF6v1r6Vj2Uay1jBV97FFmr4jiPqCKjkFne4t26s39Yf6+G51Bore5bwxqzH/2p2nDA=
ˆ
T
AAAC0XichVFLS8NAEJ7GV1tfVY9eiq0gHkpSRD0WfOBFqNgXtKVs0m0MTZOQbAu1COLVm1f9Y/pbPPjtmgpapBs2M/vNN9/O7JiB60RC198T2sLi0vJKMpVeXVvf2MxsbdcifxhavGr5rh82TBZx1/F4VTjC5Y0g5Gxgurxu9s9kvD7iYeT4XkWMA94eMNtzeo7FBKBGvnXHRKeS72RyekFXKzvrGLGTo3iV/cwHtahLPlk0pAFx8kjAd4lRhK9JBukUAGvTBFgIz1FxTg+URu4QLA4GA9rH38apGaMezlIzUtkWbnGxQ2RmaR/7UimaYMtbOfwI9hP7XmH2vzdMlLKscAxrQjGlFK+BC7oDY17mIGZOa5mfKbsS1KNT1Y2D+gKFyD6tH51zREJgfRXJ0oVi2tAw1XmEF/Bgq6hAvvJUIas67sIyZblS8WJFBr0QVr4+6sGYjb9DnXVqxYJxXDi6KeZKh/HAk7RLe3SAqZ5Qia6ojDrkNF/old60W22sPWpP31QtEefs0K+lPX8BKU2RWg==
ˆ
T
AAAC0XichVFLS8NAEJ7GV1tfVY9eiq0gHkpSRD0WfOBFqNgXtKVs0m0MTZOQbAu1COLVm1f9Y/pbPPjtmgpapBs2M/vNN9/O7JiB60RC198T2sLi0vJKMpVeXVvf2MxsbdcifxhavGr5rh82TBZx1/F4VTjC5Y0g5Gxgurxu9s9kvD7iYeT4XkWMA94eMNtzeo7FBKBGvnXHRKeS72RyekFXKzvrGLGTo3iV/cwHtahLPlk0pAFx8kjAd4lRhK9JBukUAGvTBFgIz1FxTg+URu4QLA4GA9rH38apGaMezlIzUtkWbnGxQ2RmaR/7UimaYMtbOfwI9hP7XmH2vzdMlLKscAxrQjGlFK+BC7oDY17mIGZOa5mfKbsS1KNT1Y2D+gKFyD6tH51zREJgfRXJ0oVi2tAw1XmEF/Bgq6hAvvJUIas67sIyZblS8WJFBr0QVr4+6sGYjb9DnXVqxYJxXDi6KeZKh/HAk7RLe3SAqZ5Qia6ojDrkNF/old60W22sPWpP31QtEefs0K+lPX8BKU2RWg==
ˆ
T
AAAC0XichVFLS8NAEJ7GV1tfVY9eiq0gHkpSRD0WfOBFqGgf0JaySbc1NC+TbaEWQbx686p/TH+LB79dU0GLdMNmZr/55tuZHTNw7Ejo+ntCm5tfWFxKptLLK6tr65mNzWrkD0KLVyzf8cO6ySLu2B6vCFs4vB6EnLmmw2tm/0TGa0MeRrbv3YhRwFsu63l217aYAFTPN2+ZaN/l25mcXtDVyk47RuzkKF5lP/NBTeqQTxYNyCVOHgn4DjGK8DXIIJ0CYC0aAwvh2SrO6YHSyB2AxcFgQPv493BqxKiHs9SMVLaFWxzsEJlZ2sU+V4om2PJWDj+C/cS+V1jv3xvGSllWOII1oZhSipfABd2CMSvTjZmTWmZnyq4EdelYdWOjvkAhsk/rR+cUkRBYX0WydKaYPWiY6jzEC3iwFVQgX3mikFUdd2CZslypeLEig14IK18f9WDMxt+hTjvVYsE4LBxcFXOl/XjgSdqmHdrDVI+oRBdURh1ymi/0Sm/atTbSHrWnb6qWiHO26NfSnr8AbviRdw==
ˆ q
AAAC0XichVFLS8NAEJ7GV1tfVY9eiq0gHkpSRD0WfOBFqGgf0JaySbc1NC+TbaEWQbx686p/TH+LB79dU0GLdMNmZr/55tuZHTNw7Ejo+ntCm5tfWFxKptLLK6tr65mNzWrkD0KLVyzf8cO6ySLu2B6vCFs4vB6EnLmmw2tm/0TGa0MeRrbv3YhRwFsu63l217aYAFTPN2+ZaN/l25mcXtDVyk47RuzkKF5lP/NBTeqQTxYNyCVOHgn4DjGK8DXIIJ0CYC0aAwvh2SrO6YHSyB2AxcFgQPv493BqxKiHs9SMVLaFWxzsEJlZ2sU+V4om2PJWDj+C/cS+V1jv3xvGSllWOII1oZhSipfABd2CMSvTjZmTWmZnyq4EdelYdWOjvkAhsk/rR+cUkRBYX0WydKaYPWiY6jzEC3iwFVQgX3mikFUdd2CZslypeLEig14IK18f9WDMxt+hTjvVYsE4LBxcFXOl/XjgSdqmHdrDVI+oRBdURh1ymi/0Sm/atTbSHrWnb6qWiHO26NfSnr8AbviRdw==
ˆ q
AAAC0nichVFLT8JAEB7qC/CFevTSCCbGAynEqEcSH/Figgkvg4Rsy1Ib+nJbSIBwMF69edUfpr/Fg1/XYqLEsM12Zr/55tuZHd23rSDUtPeEsrC4tLySTKVX19Y3NjNb27XA6wuDVw3P9kRDZwG3LZdXQyu0ecMXnDm6zet67yyK1wdcBJbnVsKhz1sOM12raxksBHSbe2iPOzyY5NqZrJbX5FJnnULsZCleZS/zQXfUIY8M6pNDnFwK4dvEKMDXpAJp5ANr0RiYgGfJOKcJpZHbB4uDwYD28Ddxasaoi3OkGchsA7fY2AKZKu1jX0pFHezoVg4/gP3EHknM/PeGsVSOKhzC6lBMScVr4CHdgzEv04mZ01rmZ0ZdhdSlU9mNhfp8iUR9Gj8654gIYD0ZUelCMk1o6PI8wAu4sFVUEL3yVEGVHXdgmbRcqrixIoOegI1eH/VgzIW/Q511asV84Th/dFPMlg7jgSdpl/boAFM9oRJdURl1GKjkhV7pTakoI+VRefqmKok4Z4d+LeX5Cwxrkhw=
q
des
AAAC0nichVFLT8JAEB7qC/CFevTSCCbGAynEqEcSH/Figgkvg4Rsy1Ib+nJbSIBwMF69edUfpr/Fg1/XYqLEsM12Zr/55tuZHd23rSDUtPeEsrC4tLySTKVX19Y3NjNb27XA6wuDVw3P9kRDZwG3LZdXQyu0ecMXnDm6zet67yyK1wdcBJbnVsKhz1sOM12raxksBHSbe2iPOzyY5NqZrJbX5FJnnULsZCleZS/zQXfUIY8M6pNDnFwK4dvEKMDXpAJp5ANr0RiYgGfJOKcJpZHbB4uDwYD28Ddxasaoi3OkGchsA7fY2AKZKu1jX0pFHezoVg4/gP3EHknM/PeGsVSOKhzC6lBMScVr4CHdgzEv04mZ01rmZ0ZdhdSlU9mNhfp8iUR9Gj8654gIYD0ZUelCMk1o6PI8wAu4sFVUEL3yVEGVHXdgmbRcqrixIoOegI1eH/VgzIW/Q511asV84Th/dFPMlg7jgSdpl/boAFM9oRJdURl1GKjkhV7pTakoI+VRefqmKok4Z4d+LeX5Cwxrkhw=
q
des
(c) Functioning in the real world.
Figure 5.10: High level overview of the training and inference systems.
97
5.2.3 ReinforcementLearninginSimulation
The task of manipulating the cube to the desired orientation is modelled as a sequential decision making
problem where the agent interacts with the environment in order to maximise the sum of discounted
rewards from any given moment to the end of the episode. In our case, we formulate it as a discrete-
time, partially observable Markov Decision Process (POMDP). We use Proximal Policy Optimisation (PPO)
algorithm [123] to learn a parametric stochastic policy π θ (actor), mapping from observations o ∈ O to
actions a ∈ A. PPO additionally learns a function V
π ϕ (s,o) (critic) to approximate the on-policy value
function. Following Pinto et al. [108], the critic does not take in the same observations as the actor, but
receives additional observations including statess∈S in the POMDP. The actor and critic observations
are detailed in Tab. 5.2.
We use a high-performance PPO implementation from rl-games [78] to train LSTM-based [50] pol-
icy and value function. This PPO implementation, crucially, keeps all intermediate data and calculations
entirely on GPU thus eliminating costly CPU-GPU communication. Combined with entirely GPU-side
simulation in IsaacGym this enables high hardware utilization and throughput.
The action spaceA of our policy is the PD controller target for each of the 16 joints (4 per finger) on
the robot hand. The output of the policy is low-pass filtered with an exponential moving average (EMA)
smoothing factor.
The reward function of the POMDP used to train reorientation policies is described in Tab. 5.3.
5.2.4 DomainRandomisation
It is widely known that there is a "sim-to-real" gap between physics simulators and real life [56]. Com-
pounding this is the fact that the robot as a system can change from day to day (e.g., due to wear-and-tear)
and even from timestep to timestep (e.g., stochastic noise). To help overcome this, we introduce various
kinds of randomisations [103] into the simulated environment as listed in Tab. 5.4.
98
Input Dimensionality Actor Critic
Object position 3D ✓ ✓
Object orientation 4D (quaternion) ✓ ✓
Target position 3D ✓ ✓
Target orientation 4D (quaternion) ✓ ✓
Relative target orientation 4D (quaternion) ✓ ✓
Last actions 16D ✓ ✓
Hand joints angles 16D ✓ ✓
Stochastic delays 4D × ✓
Fingertip positions 12D × ✓
Fingertip rotations 16D (quaternions) × ✓
Fingertip velocities 24D × ✓
Fingertip forces and torques 24D × ✓
Hand joints velocities 16D × ✓
Hand joints generalised forces 16D × ✓
object scale, mass, friction 3D × ✓
Object linear velocity 3D × ✓
Object angular velocity 3D × ✓
Object position with noise 3D × ✓
Object rotation with noise 4D × ✓
Random forces on object 3D × ✓
Domain randomisation params 78D × ✓
Gravity vector 3D × ✓
Rotation distances 2D × ✓
Hand scale 1D × ✓
Table 5.2: Observations of the policy and value networks. The input vector is 50D in size for policy and
265D for the value function.
Reward Formula Weight Justification
Rotation Close to Goal 1/(d+0.1) 1.0 Shaped reward to bring cube close to goal
Position Close to Fixed Target ||p
object
− p
goal
|| -10.0 Encourage the cube to stay in the hand
Action Penalty ||a||
2
-0.001 Prevent actions that are too large
Action Delta Penalty ||targ
curr
− targ
prev
||
2
-0.25 Prevent rapid changes in joint target
Joint Velocity Penalty ||v
joints
||
2
-0.003 Stop fingers from moving too quickly
Reset Reward Condition Value
Reach Goal Bonus d<0.1 250.0 Large reward for getting the cube to the target
Table 5.3: Reward terms are computed, multiplied by their weight, and summed to produce the reward
at each timestep. d represents the rotational distance from the object’s current to the target orientation.
p
object
and p
goal
are the position of the object and goal respectively. a is the current action. targ
curr
and
targ
prev
are the current and previous joint position targets. v
joints
is the current joint velocity vector.
99
Parameter Type Distribution Initial Range ADR-Discovered Range
Hand
Mass Scaling uniform [0.4, 1.5] [0.4, 1.5]
Scale Scaling uniform [0.95, 1.05] [0.95, 1.05]
Friction Scaling uniform [0.8, 1.2] [0.54, 1.58]
Armature Scaling uniform [0.8, 1.02] [0.31, 1.24]
Effort Scaling uniform [0.9, 1.1] [0.9, 2.49]
Joint Stiffness Scaling loguniform [0.3, 3.0] [0.3, 3.52]
Joint Damping Scaling loguniform [0.75, 1.5] [0.43, 1.6]
Restitution Additive uniform [0.0, 0.4] [0.0, 0.4]
Object
Mass Scaling uniform [0.4, 1.6] [0.4, 1.6]
Friction Scaling uniform [0.3, 0.9] [0.01, 1.60]
Scale Scaling uniform [0.95, 1.05] [0.95, 1.05]
External Forces Additive Refer to [97] – –
Restitution Additive uniform [0.0, 0.4] [0.0, 0.4]
Observation
Obj. Pose Delay Prob. Set Value uniform [0.0, 0.05] [0.0, 0.47]
Obj. Pose Freq. Set Value uniform [1.0, 1.0] [1.0, 6.0]
Obs Corr. Noise Additive gaussian [0.0, 0.04] [0.0, 0.12]
Obs Uncorr. Noise Additive gaussian [0.0, 0.04] [0.0, 0.14]
Random Pose Injection Set Value uniform [0.3, 0.3] [0.3, 0.3]
Action
Action Delay Prob. Set Value uniform [0.0, 0.05] [0.0, 0.31]
Action Latency Set Value uniform [0.0, 0.0] [0.0, 1.5]
Action Corr. Noise Additive gaussian [0.0, 0.04] [0.0, 0.32]
Action Uncorr. Noise Additive gaussian [0.0, 0.04] [0.0, 0.48]
RNAα Set Value uniform [0.0, 0.0] [0.0, 0.16]
Environment
Gravity (each coord.) Additive normal [0, 0.5] [0, 0.5]
Table 5.4: Domain randomisation parameter ranges for policy learning
100
VectorisedAutomaticDomainRandomisation. We set the parameters of the domain randomisations
via a vectorised implementation of Automatic Domain Randomisation (ADR, introduced in [96]). ADR
automatically adjusts the range of domain randomisations to keep them as wide as possible while keeping
the policy performance above a certain threshold. This allows us to train policies with less randomisation
earlier in training (enabling behaviour exploration) while producing final policies that are robust to the
largest range of environment conditions possible at the end of training, with the aim of improving sim-to-
real transfer by learning policies which are robust and adaptive to a range of environment randomisation
parameters. We implement a vectorised variant of the ADR algorithm, which we call Vectorised Automatic
Domain Randomisation (VADR). Instead of running a large number of parallel CPU-based simulators with
different settings we are able to use capabilities of IsaacGym to manifest up to 2
14
diverse environments
on a single GPU (please see Handa et al. [47] for additional details).
Physics Randomisations. We apply physics randomisations to account for both changing real-world
dynamics and the inevitable gaps between physics in simulation and reality. These include basic properties
such as mass, friction and restitution of the hand and object. We also randomly scale the hand and object to
avoid over-reliance on exact morphology. On the hand, joint stiffness, damping, and limits are randomised.
Furthermore, we add random forces to the cube in a similar fashion to [97].
Joint stiffness, joint damping, and joint motor max torque ("effort) are scaled using the value sampled
directly from the ADR-given uniform distribution. Mass and scale for object and the hand are randomised
within a fixed range due to API limitations currently. Gravity cannot be randomised per-environment in
the current Isaac Gym API, but a new gravity value is sampled every720 concurrent simulation steps for
all environments to reduce the probability of overfitting.
Non-physics Randomisations. In addition to normal physics randomisations, Tab. 5.4 lists action and
observation randomisations, which we found to be critical to achieving good real-world performance. To
make our policies more robust to the changing inference frequency and jitter resulting from our ROS-based
101
inference system, we add stochastic delays to cube pose and action delivery time as well as fixed-for-
an-episode action latency. To the actions and observations, we add correlated and uncorrelated additive
Gaussian noise. To account for unmodelled dynamics, we use a Random Network Adversary (RNA, see
below).
We apply Gaussian noise to the observations and actions with the noise function
f
δ,ϵ
(x)=x+δ +ϵ Where δ and ϵ are sampled from Gaussian distributions parameterised by the ADR values p
i
,p
j
, δ ∼ N(·;0, var(p
i
)),ϵ ∼N (·;0, var(p
j
)) where var(a)=exp
a
2
− 1
Forδ , this sampling happens once per episode at the beginning of the episode, corresponding to corre-
lated noise. Forϵ , sampling happens at every timestep. Note that the formula for var has a cutoff at 0 noise.
This allows ADR to set a certain fraction of environments to have 0 noise, which we found an important
case that is not covered in previous works when setting fixed or above-zero cutoff variance (since during
inference, zero white noise is added).
We apply three forms of delay. The first is an exponential delay, where the chance of applying a
delay each step is p
i
and is given by f(x;x
last
) = x
last
· d + x· (1− d) and d ∼ Bern(·;p
i
) is the
Bernoulli distribution parametrised by thei-th ADR variable,p
i
∈ [0,1). This delay case, applied to both
observations of cube pose and actions, mimics random jitter in latency times.
The second form of delay is action latency, where the action from n timesteps ago is executed. For this
parameter, we slightly modify the vanilla ADR formulation to allow smooth increase in delay with ADR
value despite the discretisation of timesteps. The bounds are still continuously modified, but the sampling
from the range is done from a categorical distribution. Specifically, let ϵ ∼ U(0,b)+U(− 0.5,0.5) be the
sampled ADR value (plus random noise used to allow probabilistic blending of delay steps when sampling
on the ADR boundary). Then the delay k isk = round(ϵ ).
102
Random Network Adversary
AAAC23ichVFLT8JAEB7qC/BV9eilEUw8kUKMesT4iBcNGnkkQMi2LNjQV9qFiISTN+PVm1f9T/pbPPh1LSZKjNtsZ/abb76d2TF82wqFrr8llJnZufmFZCq9uLS8sqqurVdCrx+YvGx6thfUDBZy23J5WVjC5jU/4MwxbF41ekdRvDrgQWh57rUY+rzpsK5rdSyTCUAtVc2y1qgh+K0YXV0cjsfZlprRc7pc2rSTj50Mxavkqe/UoDZ5ZFKfHOLkkoBvE6MQX53ypJMPrEkjYAE8S8Y5jSmN3D5YHAwGtId/F6d6jLo4R5qhzDZxi40dIFOjbexTqWiAHd3K4YewH9h3Euv+ecNIKkcVDmENKKak4jlwQTdg/JfpxMxJLf9nRl0J6tCB7MZCfb5Eoj7Nb51jRAJgPRnR6EQyu9Aw5HmAF3Bhy6ggeuWJgiY7bsMyablUcWNFBr0ANnp91IMx538PddqpFHL5vdzuZSFT1OOBJ2mTtmgHU92nIp1RCXWYqOmZXuhVaSr3yoPy+EVVEnHOBv1YytMnOCWVRw==
a
RNA
AAAC23ichVFLT8JAEB7qC/BV9eilEUw8kUKMesT4iBcNGnkkQMi2LNjQV9qFiISTN+PVm1f9T/pbPPh1LSZKjNtsZ/abb76d2TF82wqFrr8llJnZufmFZCq9uLS8sqqurVdCrx+YvGx6thfUDBZy23J5WVjC5jU/4MwxbF41ekdRvDrgQWh57rUY+rzpsK5rdSyTCUAtVc2y1qgh+K0YXV0cjsfZlprRc7pc2rSTj50Mxavkqe/UoDZ5ZFKfHOLkkoBvE6MQX53ypJMPrEkjYAE8S8Y5jSmN3D5YHAwGtId/F6d6jLo4R5qhzDZxi40dIFOjbexTqWiAHd3K4YewH9h3Euv+ecNIKkcVDmENKKak4jlwQTdg/JfpxMxJLf9nRl0J6tCB7MZCfb5Eoj7Nb51jRAJgPRnR6EQyu9Aw5HmAF3Bhy6ggeuWJgiY7bsMyablUcWNFBr0ANnp91IMx538PddqpFHL5vdzuZSFT1OOBJ2mTtmgHU92nIp1RCXWYqOmZXuhVaSr3yoPy+EVVEnHOBv1YytMnOCWVRw==
a
RNA
AAACzHichVFLS8NAEJ6mPtr6qoonL8EieCqJiHos+MCLUtG2Qi0lSbcxNC8220ItvXrzqr/HPyHoxT8i6JdtKmiRbtjM7DfffDsza4auEwlNe0sp6ZnZuflMNrewuLS8kl9dq0ZBl1usYgVuwG9MI2Ku47OKcITLbkLODM90Wc3sHMXxWo/xyAn8a9EPWcMzbN9pO5YhAF1FTdHMF7SiJpc66eiJUyilX79eNj5YOci/0y21KCCLuuQRI58EfJcMivDVSSeNQmANGgDj8BwZZzSkHHK7YDEwDKAd/G2c6gnq4xxrRjLbwi0uNkemStvYp1LRBDu+lcGPYD+x7yVm/3vDQCrHFfZhTShmpeI5cEF3YEzL9BLmuJbpmXFXgtp0KLtxUF8okbhP60fnGBEOrCMjKp1Ipg0NU557mIAPW0EF8ZTHCqrsuAVrSMukip8oGtDjsPH0UQ+eWf/7qJNOdbeo7xf3LvVCSaPRytAmbdEOXvWASnRGZdRhobpHeqJn5UIRykAZjqhKKslZp19LefgGnSmUEQ==
s
t
AAACzHichVFLS8NAEJ6mPtr6qoonL8EieCqJiHos+MCLUtG2Qi0lSbcxNC8220ItvXrzqr/HPyHoxT8i6JdtKmiRbtjM7DfffDsza4auEwlNe0sp6ZnZuflMNrewuLS8kl9dq0ZBl1usYgVuwG9MI2Ku47OKcITLbkLODM90Wc3sHMXxWo/xyAn8a9EPWcMzbN9pO5YhAF1FTdHMF7SiJpc66eiJUyilX79eNj5YOci/0y21KCCLuuQRI58EfJcMivDVSSeNQmANGgDj8BwZZzSkHHK7YDEwDKAd/G2c6gnq4xxrRjLbwi0uNkemStvYp1LRBDu+lcGPYD+x7yVm/3vDQCrHFfZhTShmpeI5cEF3YEzL9BLmuJbpmXFXgtp0KLtxUF8okbhP60fnGBEOrCMjKp1Ipg0NU557mIAPW0EF8ZTHCqrsuAVrSMukip8oGtDjsPH0UQ+eWf/7qJNOdbeo7xf3LvVCSaPRytAmbdEOXvWASnRGZdRhobpHeqJn5UIRykAZjqhKKslZp19LefgGnSmUEQ==
s
t +
AAAC0HichVFLS8NAEJ7GV1tfVY9egq3gqSQi6rHgAy9CFfuAtpRNuq1L8yJJi7WIePXmVX+Z/hYPfrumghbphs3MfvPNtzM7VuCIKDaM95Q2N7+wuJTOZJdXVtfWcxub1cgfhDav2L7jh3WLRdwRHq/EInZ4PQg5cy2H16z+iYzXhjyMhO/dxKOAt1zW80RX2CwGVCuwdjMQhXYubxQNtfRpx0ycPCWr7Oc+qEkd8smmAbnEyaMYvkOMInwNMsmgAFiLxsBCeELFOT1QFrkDsDgYDGgf/x5OjQT1cJaakcq2cYuDHSJTp13sc6VogS1v5fAj2E/se4X1/r1hrJRlhSNYC4oZpXgJPKZbMGZluglzUsvsTNlVTF06Vt0I1BcoRPZp/+icIhIC66uITmeK2YOGpc5DvIAHW0EF8pUnCrrquAPLlOVKxUsUGfRCWPn6qAdjNv8Oddqp7hfNw+LB1X6+ZCQDT9M27dAepnpEJbqgMuqQHb7QK71p19qd9qg9fVO1VJKzRb+W9vwFMgiQ/w==
a
⇡
AAAC0HichVFLS8NAEJ7GV1tfVY9egq3gqSQi6rHgAy9CFfuAtpRNuq1L8yJJi7WIePXmVX+Z/hYPfrumghbphs3MfvPNtzM7VuCIKDaM95Q2N7+wuJTOZJdXVtfWcxub1cgfhDav2L7jh3WLRdwRHq/EInZ4PQg5cy2H16z+iYzXhjyMhO/dxKOAt1zW80RX2CwGVCuwdjMQhXYubxQNtfRpx0ycPCWr7Oc+qEkd8smmAbnEyaMYvkOMInwNMsmgAFiLxsBCeELFOT1QFrkDsDgYDGgf/x5OjQT1cJaakcq2cYuDHSJTp13sc6VogS1v5fAj2E/se4X1/r1hrJRlhSNYC4oZpXgJPKZbMGZluglzUsvsTNlVTF06Vt0I1BcoRPZp/+icIhIC66uITmeK2YOGpc5DvIAHW0EF8pUnCrrquAPLlOVKxUsUGfRCWPn6qAdjNv8Oddqp7hfNw+LB1X6+ZCQDT9M27dAepnpEJbqgMuqQHb7QK71p19qd9qg9fVO1VJKzRb+W9vwFMgiQ/w==
a
⇡ Random weights,
sampled ~once/episode
AAAC13ichVFLS8NAEB5TH63PqnjyEiyCp5KIqMeCD7wICtZWtMom3daleZFsi7UUb+LVm1f9J/4JQS/+EUG/bFNBRbphM7PffPPtzKwVOCKShvE6pKWGR0bH0pnxicmp6Zns7Nxx5DdDmxdt3/HDssUi7giPF6WQDi8HIWeu5fCS1diK46UWDyPhe0eyHfCKy+qeqAmbSUDn7KJzJvmV7ETC7XYvsjkjb6il/3XMxMkVUi+fzwvv/MDPvtEZVcknm5rkEiePJHyHGEX4TskkgwJgFeoAC+EJFefUpXHkNsHiYDCgDfzrOJ0mqIdzrBmpbBu3ONghMnVaxt5VihbY8a0cfgT7gX2tsPq/N3SUclxhG9aCYkYp7gOXdAnGoEw3YfZrGZwZdyWpRpuqG4H6AoXEfdrfOtuIhMAaKqLTjmLWoWGpcwsT8GCLqCCecl9BVx1XYZmyXKl4iSKDXggbTx/14JnN34/61zlezZvr+bVDM1cwqLfStEhLtIJX3aAC7dEB6rCh/ECP9KSdaDfarXbXo2pDSc48/Vja/Rf3fplT
a
sim
AAAC13ichVFLS8NAEB5TH63PqnjyEiyCp5KIqMeCD7wICtZWtMom3daleZFsi7UUb+LVm1f9J/4JQS/+EUG/bFNBRbphM7PffPPtzKwVOCKShvE6pKWGR0bH0pnxicmp6Zns7Nxx5DdDmxdt3/HDssUi7giPF6WQDi8HIWeu5fCS1diK46UWDyPhe0eyHfCKy+qeqAmbSUDn7KJzJvmV7ETC7XYvsjkjb6il/3XMxMkVUi+fzwvv/MDPvtEZVcknm5rkEiePJHyHGEX4TskkgwJgFeoAC+EJFefUpXHkNsHiYDCgDfzrOJ0mqIdzrBmpbBu3ONghMnVaxt5VihbY8a0cfgT7gX2tsPq/N3SUclxhG9aCYkYp7gOXdAnGoEw3YfZrGZwZdyWpRpuqG4H6AoXEfdrfOtuIhMAaKqLTjmLWoWGpcwsT8GCLqCCecl9BVx1XYZmyXKl4iSKDXggbTx/14JnN34/61zlezZvr+bVDM1cwqLfStEhLtIJX3aAC7dEB6rCh/ECP9KSdaDfarXbXo2pDSc48/Vja/Rf3fplT
a
sim
AAAC03ichVFLT8JAEB4qKuAL9eilEU3wIGmNUY8kPuLFBBMLRCBmWxdo6CvbQoLEi/Hqzav+L/0tHvy6FBMlhmm2M/vNzDePNQPHDiNN+0gpc+n5hcVMNre0vLK6ll/fqIZ+X1jcsHzHF3WThdyxPW5EduTweiA4c02H18zeaeyvDbgIbd+7iYYBb7ms49lt22IRoNuivt9kTtBle3f5glbSpKjThp4YhXK6LaXi5z+pSffkk0V9comTRxFshxiF+Bqkk0YBsBaNgAlYtvRzeqQccvuI4ohgQHv4d3BrJKiHe8wZymwLVRwcgUyVdnEuJKOJ6Lgqhx1Cf+E8SKzzb4WRZI47HEKbYMxKxivgEXURMSvTTSInvczOjKeKqE0nchob/QUSiee0fnjO4BHAetKj0rmM7IDDlPcBNuBBG+gg3vKEQZUT30Mzqblk8RJGBj4BHW8f/eCZ9b+POm1UD0r6UenwWi+Ud2gsGdqibSriVY+pTJdUQR8WqrzSG70rhjJSnpTncaiSSnI26ZcoL9+MRZSM
(1 ↵ )
AAAC03ichVFLT8JAEB4qKuAL9eilEU3wIGmNUY8kPuLFBBMLRCBmWxdo6CvbQoLEi/Hqzav+L/0tHvy6FBMlhmm2M/vNzDePNQPHDiNN+0gpc+n5hcVMNre0vLK6ll/fqIZ+X1jcsHzHF3WThdyxPW5EduTweiA4c02H18zeaeyvDbgIbd+7iYYBb7ms49lt22IRoNuivt9kTtBle3f5glbSpKjThp4YhXK6LaXi5z+pSffkk0V9comTRxFshxiF+Bqkk0YBsBaNgAlYtvRzeqQccvuI4ohgQHv4d3BrJKiHe8wZymwLVRwcgUyVdnEuJKOJ6Lgqhx1Cf+E8SKzzb4WRZI47HEKbYMxKxivgEXURMSvTTSInvczOjKeKqE0nchob/QUSiee0fnjO4BHAetKj0rmM7IDDlPcBNuBBG+gg3vKEQZUT30Mzqblk8RJGBj4BHW8f/eCZ9b+POm1UD0r6UenwWi+Ud2gsGdqibSriVY+pTJdUQR8WqrzSG70rhjJSnpTncaiSSnI26ZcoL9+MRZSM
(1 ↵ )
AAACz3ichVFNS8NAEH2Nn/Wz6tFLsQqeSiKi3hT8wItQwbZCK7JJ1xqbJiFJKyqKV29e23/gT9Lf4sG32yioiBM2M/vmzduZXTv03DgxzdeMMTQ8Mjo2np2YnJqemc3NzVfioBM5suwEXhCd2iKWnuvLcuImnjwNIynatierdmtX5atdGcVu4J8kN6E8a4um7164jkgIVerCCy/Fea5gFk1t+d+BlQaF7Zeesn4pyL2hjgYCOOigDQkfCWMPAjG/GiyYCImd4Y5YxMjVeYl7TLC2Q5YkQxBt8d/krpaiPvdKM9bVDk/xuCJW5rHCdaAVbbLVqZJxTP/Odaux5p8n3Gll1eENvU3FrFY8Ip7gkoz/Ktsp87OX/yvVVAkusKWncdlfqBE1p/Ols8dMRKylM3nsa2aTGrbed3kDPn2ZHahb/lTI64kb9EJ7qVX8VFFQL6JXt89++MzWz0f9HVTWitZGcf3YKuwsY2DjWMQSVvmqm9jBIUrsw8EVntFD3zg2ro0H43FANTJpzQK+mfH0AWTVlYw=
↵
AAACz3ichVFNS8NAEH2Nn/Wz6tFLsQqeSiKi3hT8wItQwbZCK7JJ1xqbJiFJKyqKV29e23/gT9Lf4sG32yioiBM2M/vmzduZXTv03DgxzdeMMTQ8Mjo2np2YnJqemc3NzVfioBM5suwEXhCd2iKWnuvLcuImnjwNIynatierdmtX5atdGcVu4J8kN6E8a4um7164jkgIVerCCy/Fea5gFk1t+d+BlQaF7Zeesn4pyL2hjgYCOOigDQkfCWMPAjG/GiyYCImd4Y5YxMjVeYl7TLC2Q5YkQxBt8d/krpaiPvdKM9bVDk/xuCJW5rHCdaAVbbLVqZJxTP/Odaux5p8n3Gll1eENvU3FrFY8Ip7gkoz/Ktsp87OX/yvVVAkusKWncdlfqBE1p/Ols8dMRKylM3nsa2aTGrbed3kDPn2ZHahb/lTI64kb9EJ7qVX8VFFQL6JXt89++MzWz0f9HVTWitZGcf3YKuwsY2DjWMQSVvmqm9jBIUrsw8EVntFD3zg2ro0H43FANTJpzQK+mfH0AWTVlYw=
↵ Figure 5.11: The functioning of the Random Network Adversary
A third form of delay, this time on observation, is that caused by the refresh rate of the cameras in
the real world. To compensate for this, we have randomisation on the refresh rate. Similarly to the afore-
mentioned action latency, we use ADR to sample a categorical action delay d ∈ {1,...,delay
max
}. We
then only update the cube pose observation if(t+r) mod d=0, effectively mimicing a pose estimation
frequency ofd· ∆ t (where r is a randomly sampled alignment variable to offset updates from the beginning
of the episode randomly).
We noticed that due to heavy occlusion and caging from the fingers of the robotic hand, our cube pose
estimator occasionally produces erroneous values that are significantly different from the true pose. To
ensure that the policy performance did not deteriorate we occasionally inject completely random cube
poses into the observation space of the policy.
RandomNetworkAdversary, introduced in [96], uses a randomly-generated neural network each episode
to introduce much more structured, state-varying noise patterns into the environment, in contrast to nor-
mal Gaussian noise. As we are doing simulation on GPU rather than CPU, instead of using a new network
per environment-episode and wasting memory on thousands of individual MLPs, we generate a single
network across all environments. Actions from the RNA network are blended with those from the policy
bya=α · a
RNA
+(1− α )· a
policy
, whereα is controlled by ADR.
103
Experiment Cons. SuccessTrials(sorted) Average Median
Best Model
1, 6, 6, 10, 10, 18, 18, 36, 61, 112 27.8± 19.0 14.0
3, 4, 7, 16, 19, 22, 29, 31, 58, 77 26.6± 13.2 20.5
1, 5, 5, 11, 12, 12, 33, 36, 42, 51 20.8± 9.8 12.0
Best Model
(Goal frame count=10)
6, 8, 10, 16, 16, 17, 20, 33, 39, 45 21.0± 7.4 16.5
9, 11, 13, 13, 15, 16, 27, 29, 32, 36 20.1± 5.4 15.5
2, 3, 3, 9, 11, 12, 14, 15, 43, 44 16.6± 8.4 11.5
Non-ADR Model
2, 3, 7, 7, 13, 16, 22, 23, 26, 29 14.8± 5.4 14.5
1, 1, 3, 7, 8, 11, 14, 17, 22, 35 11.9± 5.8 9.5
0, 7, 8, 8, 9, 10, 10, 11, 17, 20 10.0± 3.0 9.5
Table 5.5: The results of running different models on the real robot. We run 10 trials per policy [97] to
benchmark the average consecutive successes. Individual rows within each experiment indicate running
the experiment on different days [42] and ± indicates 90% confidence interval. Our best model was trained
with ADR while non-ADR experiments had DR ranges manually tuned. The second experiment shows
results when the cube is held at a goal for additional consecutive frames once the target cube pose is
reached.
5.2.5 ExperimentsandReal-WorldPerformance
In this section, we present the results we achieved in object reorientation in the simulations and then real
world. We then follow it up with tests of policy robustness in reality and simulation.
For all of our experiments, we use a simulation dt of
1
60
s and a control dt of
1
30
s. We train with 2
14
(16384) agents per GPU and use a goal-reaching orientation threshold of 0.1 rad but test with 0.4 rad as in
[97, 79] for all experiments both in simulation and the real world. All policies are trained with randomi-
sations described in Tab. 5.4. Most importantly, our ADR policies using the same compute resources — 8
NVIDIA A40s — achieved the best performance in the real world after training for only 2.5 days in contrast
to [96] that trained for 2 weeks to months for the task of block reorientation.
Training with manual DR takes roughly 32 hours to converge on 8 NVIDIA A40s generating a combined
(across all GPUs) frame rate of 700K frames/sec. With adt=
1
60
s, this amounts to
32× 700000
60× 24× 365
which is∼ 42
years of real-world experience.
Real-WorldPerformance. Similar to [97], we observe a large range of different behaviours in the policies
deployed on the real robot. Our real-world quantitative results measuring average consecutive successes
are illustrated in Tab. 5.5. We collect 10 trials for each policy to obtain the average consecutive successes
104
(a)t = 1.0s (b)t = 5.0s (c)t = 9.0s (d)t = 12.0s
Figure 5.12: Policies trained with manual DR exhibited ‘stuck’ behaviours, where the cube remained stuck
in certain configurations and was unable to recover. An example of such behaviour can be observed here:
https://www.youtube.com/watch?v=tJgq18VbL3k.
and also collect different sets of trials across different days to understand the inter-day variability that
may arise due to various environmental factors. We believe such inter-day variations are important to
benchmark in robotics [42] and have endeavoured to highlight this specifically in this challenging task.
We find that our policies do not show a dramatic drop in average performance, indicating that they are
mostly robust to variations between experimental conditions on different dates.
We benchmark both ADR and non-ADR (manually-tuned DR ranges) policies in Tab. 5.5 and like [97]
find that the policies trained with ADR perform the best, suggesting that the sheer diversity of data gleaned
from the simulator endows the policies with the extreme robustness needed in the real world. Importantly,
we observed that policies trained with non-ADR exhibited ‘stuck’ behaviours (as shown in Fig. 5.12), which
ADR-based policies were able to overcome due to increased diversity in training data. We also find that
on an average, the trials with ADR achieve more consecutive successes than non-ADR policies.
5.2.6 Conclusion
In this section we saw how high-throughput simulation and learning methods can enable ambitious robotics
projects at a fraction of the cost of the compute hardware compared to previous SOTA [97, 96]. Vectorized
physics simulation based on IsaacGym [79] and PhysX [94] allowed us to train policies with Automatic
Domain Randomization at 700,000 simulation steps per second, yielding robust policies in only 2 days of
training. This was crucial to the success of the project as it allowed us to reduce the experiment turnaround
time and experiment on ideas, learning parameters, and domain randomization techniques.
105
In DeXtreme project we still used the traditional training paradigm of training a single policy to max-
imize a single RL objective, which is the shaped reward (see Tab. 5.3). The metric that we actually care
about is the sim-to-real transfer performance, i.e. the expected number of successes in the real world. The
width of the ADR ranges can be used as a proxy for this metric, and we can try to optimize for it directly
using meta-optimization algorithms like Population-Based Training [55], similarly to how it was used in
DexPBT.
We also noticed how useful it was to be able to change the behavior of the policy when we experimented
on the real robot, for example through changing the EMA filtering strength. Some algorithms such as
"Quality Diversity" family of methods [92] yield not a single policy but an array of different behaviors that
potentially could be quickly tested on the real robot, maximizing the chance that a robust and safe policy
can be found.
106
Chapter6
Conclusion
Performance of algorithms matters. This is especially true in the realm of deep reinforcement learning
where experiments sometimes require hundreds of millions or billions of environment rollouts to be col-
lected and analyzed. Increasing the performance of learning algorithms can reduce hardware requirements
for ambitious projects and make RL experimentation more accessible. With faster RL systems researchers
can reduce the experiment turnaround times and iterate on their ideas more quickly.
In recent years the demand for reinforcement learning applications has been steadily increasing. Re-
searchers and practitioners in the industry are applying RL to train AI opponents in video games [39].
Emerging systems that use large language models for planning [1] require libraries of low-level manipu-
lation skills that can be trained with Deep RL. Overall, reinforcement learning can have a very significant
impact across many domains if it can be used as a building block in larger systems. Unfortunately, at this
stage of development this can be prohibitively expensive as any kind of non-trivial RL application usually
requires months of research and experimentation.
In order to turn reinforcement learning from a field of study into an off-the-shelf technology we need
to provide researchers and practitioners with learning systems that make RL experimentation seamless,
practical, and fast. In this thesis, we discussed how such systems can be built (Chapter 2, Chapter 3)
and considered multiple applications of these systems in simulated and real-world domains (Chapter 4,
107
Chapter 5). We presented Sample Factory 2.0 [115] (see Sec. 2.2), a new optimized codebase that can assist
RL practitioners who work with any kind of simulation platform, whether it is a traditional CPU-based or
a vectorized GPU-accelerated simulator.
Humanity is currently facing tremendous challenges and existential risks: fighting poverty, disease
and aging, climate change, energy crisis, and many others. It seems increasingly clear that in the future
Artificial Intelligence will play an important role in overcoming these challenges by assisting us in scientific
discovery and automation. It is the authors’ hope that fast and efficient reinforcement learning can become
an essential building block in the foundation of this powerful future AI infrastructure that will guide
humans into safe and prosperous millennia ahead.
108
Bibliography
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,
Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho,
Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano,
Kyle Jeffrey, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,
Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor,
Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet,
Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu,
Sichun Xu, and Mengyuan Yan. “Do As I Can, Not As I Say: Grounding Language in Robotic
Affordances”. In: CoRR abs/2204.01691 (2022). doi: 10.48550/arXiv.2204.01691. arXiv: 2204.01691.
[2] Ross Allen and Marco Pavone. “A real-time framework for kinodynamic planning with
application to quadrotor obstacle avoidance”. In: AIAA Guidance, Navigation, and Control
Conference. 2016, p. 1374.
[3] Arthur Allshire, Mayank Mittal, Varun Lodaya, Viktor Makoviychuk, Denys Makoviichuk,
Felix Widmaier, Manuel Wüthrich, Stefan Bauer, Ankur Handa, and Animesh Garg. “Transferring
Dexterous Manipulation from GPU Simulation to a Remote Real-World TriFinger”. In: CoRR
abs/2108.09779 (2021). arXiv: 2108.09779. url: https://arxiv.org/abs/2108.09779.
[4] Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta,
Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and
Amir Roshan Zamir. “On Evaluation of Embodied Navigation Agents”. In:arXiv:1807.06757 (2018).
[5] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew,
Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider,
Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. “Learning
dexterous in-hand manipulation”. In: Int. J. Robotics Res. 39.1 (2020). doi:
10.1177/0278364919887447.
[6] Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz.
“Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU”. In: ICLR.
2017.
[7] Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi,
Zhaohan Daniel Guo, and Charles Blundell. “Agent57: Outperforming the atari human
benchmark”. In: ICML. 2020, pp. 507–517.
109
[8] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and
Igor Mordatch. “Emergent Tool Use From Multi-Agent Autocurricula”. In: ICLR. 2020.
[9] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. “Emergent
Complexity via Multi-Agent Competition”. In: ICLR. 2018.
[10] Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun,
Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, and Hao Su.
“Rearrangement: A Challenge for Embodied AI”. In: arXiv:2011.01975 (2020).
[11] Sumeet Batra, Zhehui Huang, Aleksei Petrenko, Tushar Kumar, Artem Molchanov, and
Gaurav S. Sukhatme. “Decentralized Control of Quadrotor Swarms with End-to-end Deep
Reinforcement Learning”. In: Conference on Robot Learning, 8-11 November 2021, London, UK.
Ed. by Aleksandra Faust, David Hsu, and Gerhard Neumann. Vol. 164. Proceedings of Machine
Learning Research. PMLR, 2021, pp. 576–586. url:
https://proceedings.mlr.press/v164/batra22a.html.
[12] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright,
Heinrich Küttler, Andrew Lefrancq, Simon Green, Victor Valdés, Amir Sadik, Julian Schrittwieser,
Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King,
Demis Hassabis, Shane Legg, and Stig Petersen. “DeepMind Lab”. In: CoRR abs/1612.03801 (2016).
[13] Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, Ludovic Righetti,
Gaurav Sukhatme, and Franziska Meier. “Meta Learning via Learned Loss”. In: International
Conference on Pattern Recognition. 2021.
[14] Edward Beeching, Christian Wolf, Jilles Dibangoye, and Olivier Simonin. “Deep Reinforcement
Learning on a Budget: 3D Control and Reasoning Without a Supercomputer”. In: CoRR
abs/1904.01806 (2019).
[15] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. “The Arcade Learning
Environment: An Evaluation Platform for General Agents”. In: IJCAI. 2013.
[16] Jur van den Berg, Stephen J. Guy, Ming C. Lin, and Dinesh Manocha. “Reciprocal n-Body
Collision Avoidance”. In: Robotics Research - The 14th International Symposium, ISRR 2009, August
31 - September 3, 2009, Lucerne, Switzerland. Ed. by Cédric Pradalier, Roland Siegwart, and
Gerhard Hirzinger. Vol. 70. Springer Tracts in Advanced Robotics. Springer, 2009, pp. 3–19. doi:
10.1007/978-3-642-19457-3\_1.
[17] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak,
Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz,
Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto,
Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever,
Jie Tang, Filip Wolski, and Susan Zhang. “Dota 2 with Large Scale Deep Reinforcement Learning”.
In: CoRR abs/1912.06680 (2019).
110
[18] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary,
Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and
Qiao Zhang. JAX: composable transformations of Python+NumPy programs. Version 0.3.13. 2018.
url: http://github.com/google/jax.
[19] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. “OpenAI Gym”. In: CoRR abs/1606.01540 (2016).
[20] Angel X. Chang, Angela Dai, Thomas A. Funkhouser, Maciej Halber, Matthias Nießner,
Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. “Matterport3D: Learning from
RGB-D Data in Indoor Environments”. In: 3DV. 2017.
[21] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan D. Ratliff,
and Dieter Fox. “Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real
World Experience”. In: International Conference on Robotics and Automation, ICRA 2019, Montreal,
QC, Canada, May 20-24, 2019. IEEE, 2019, pp. 8973–8979. doi: 10.1109/ICRA.2019.8793789.
[22] Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah,
Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. “Soundspaces: Audio-visual
navigation in 3D environments”. In: ECCV. 2020, pp. 17–36.
[23] Changan Chen, Yuejiang Liu, Sven Kreiss, and Alexandre Alahi. “Crowd-robot interaction:
Crowd-aware robot navigation with attention-based deep reinforcement learning”. In: 2019
International Conference on Robotics and Automation (ICRA). IEEE. 2019, pp. 6015–6022.
[24] Tao Chen, Jie Xu, and Pulkit Agrawal. “A System for General In-Hand Object Re-Orientation”. In:
Conference on Robot Learning (2021).
[25] Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuang Jiang,
Stephen Marcus McAleer, Hao Dong, Zongqing Lu, and Song-Chun Zhu. Towards Human-Level
Bimanual Dexterous Manipulation with Reinforcement Learning. 2022. eprint: arXiv:2206.08686.
[26] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. “Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation”. In: EMNLP. 2014, pp. 1724–1734.
[27] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. “Leveraging Procedural
Generation to Benchmark Reinforcement Learning”. In: arXiv:1912.01588 (2019).
[28] Erwin Coumans et al. “Bullet physics library”. In: Open source: bulletphysics. org 15.49 (2013), p. 5.
[29] Erwin Coumans and Yunfei Bai. “PyBullet, a Python module for physics simulation for games,
robotics and machine learning, 2016”. In: URL http://pybullet. org (2016).
[30] Steven Davis and Paul Mermelstein. “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences”. In: IEEE Transactions on Acoustics, Speech,
and Signal Processing 28.4 (1980), pp. 357–366.
111
[31] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert,
Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines.
https://github.com/openai/baselines. 2017.
[32] Alexey Dosovitskiy and Vladlen Koltun. “Learning to Act by Predicting the Future”. In: ICLR.
2017.
[33] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. “CARLA:
An Open Urban Driving Simulator”. In: Conference on Robot Learning (CoRL). 2017.
[34] Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. “SEED RL:
Scalable and Efficient Deep-RL with Accelerated Central Inference”. In: CoRR abs/1910.06591
(2019).
[35] Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward,
Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu.
“IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner
Architectures”. In: ICML. 2018.
[36] Arthur Flajolet, Claire Bizon Monroc, Karim Beguir, and Thomas Pierrot. “Fast Population-Based
Reinforcement Learning on a Single Machine”. In: International Conference on Machine Learning,
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. Ed. by Kamalika Chaudhuri,
Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato. Vol. 162. Proceedings of
Machine Learning Research. PMLR, 2022, pp. 6533–6547. url:
https://proceedings.mlr.press/v162/flajolet22a.html.
[37] Julian Förster. “System Identification of the Crazyflie 2.0 Nano Quadrocopter”. BA Thesis. ETH
Zurich, 2015.
[38] C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem.
“Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation”. In: Proceedings of
the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets
and Benchmarks 2021, December 2021, virtual. Ed. by Joaquin Vanschoren and Sai-Kit Yeung. 2021.
url: https://datasets-benchmarks-
proceedings.neurips.cc/paper/2021/hash/d1f491a404d6854880943e5c3cd9ca25-Abstract-
round1.html.
[39] Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, and Peter Dürr. “Super-Human
Performance in Gran Turismo Sport Using Deep Reinforcement Learning”. In: IEEE Robotics
Autom. Lett. 6.3 (2021), pp. 4257–4264. doi: 10.1109/LRA.2021.3064284.
[40] Raluca D Gaina and Matthew Stephenson. ““Did You Hear That?” Learning to Play Video Games
from Audio Cues”. In: COG. 2019, pp. 1–4.
[41] Daniel Garcia-Romero, Greg Sell, and Alan Mccree. “MagNetO: X-vector Magnitude Estimation
Network plus Offset for Improved Speaker Recognition”. In: Proc. Odyssey 2020 The Speaker and
Language Recognition Workshop. 2020, pp. 1–8.
112
[42] Google. The Importance of A/B Testing in Robotics. 2021. url:
https://ai.googleblog.com/2021/06/the-importance-of-ab-testing-in-robotics.html.
[43] Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber,
David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver,
Timothy Lillicrap, and Victor Valdes. An investigation of model-free planning: Boxoban levels.
https://github.com/deepmind/boxoban-levels/. 2018.
[44] Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo A Pires, and
Rémi Munos. “Neural predictive belief representations”. In: arXiv:1811.06407 (2018).
[45] Jayesh K. Gupta, Maxim Egorov, and Mykel J. Kochenderfer. “Cooperative Multi-agent Control
Using Deep Reinforcement Learning”. In: Autonomous Agents and Multiagent Systems - AAMAS
2017 Workshops, Best Papers, São Paulo, Brazil, May 8-12, 2017, Revised Selected Papers. Ed. by
Gita Sukthankar and Juan A. Rodriguez-Aguilar. Vol. 10642. Lecture Notes in Computer Science.
Springer, 2017, pp. 66–83. doi: 10.1007/978-3-319-71682-4\_5.
[46] William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel,
Manuela Veloso, and Ruslan Salakhutdinov. “MineRL: A Large-Scale Dataset of Minecraft
Demonstrations”. In: IJCAI. 2019.
[47] Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh,
Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich,
Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter Fox, and Gavriel State.
“DeXtreme: Transfer of Agile In-Hand Manipulation from Simulation to Reality”. In:arXiv (2022).
[48] Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, and Rémi Munos. “Q(λ ) with Off-Policy
Corrections”. In: Algorithmic Learning Theory, ALT. 2016.
[49] Shashank Hegde, Anssi Kanervisto, and Aleksei Petrenko. “Agents that Listen: High-Throughput
Reinforcement Learning with Multiple Sensory Systems”. In: IEEE Conference on Games. 2021.
[50] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural Comput. 9.8
(1997), pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
[51] Wolfgang Hönig, James A. Preiss, T. K. Satish Kumar, Gaurav S. Sukhatme, and Nora Ayanian.
“Trajectory Planning for Quadrotor Swarms”. In: IEEE Trans. Robotics 34.4 (2018), pp. 856–869.
[52] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt,
and David Silver. “Distributed Prioritized Experience Replay”. In: ICLR. 2018.
[53] Jemin Hwangbo, Joonho Lee, and Marco Hutter. “Per-contact iteration method for solving contact
dynamics”. In: IEEE Robotics and Automation Letters 3.2 (2018), pp. 895–902. url: www.raisim.com.
113
[54] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever,
Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos,
Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver,
Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. “Human-level performance in
first-person multiplayer games with population-based deep reinforcement learning”. In: CoRR
abs/1807.01281 (2018). arXiv: 1807.01281. url: http://arxiv.org/abs/1807.01281.
[55] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue,
Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and
Koray Kavukcuoglu. “Population Based Training of Neural Networks”. In: CoRR abs/1711.09846
(2017). arXiv: 1711.09846. url: http://arxiv.org/abs/1711.09846.
[56] Nick Jakobi, Phil Husbands, and Inman Harvey. “Noise and the reality gap: The use of simulation
in evolutionary robotics”. In: Advances in Artificial Life . Ed. by Federico Morán, Alvaro Moreno,
Juan Julián Merelo, and Pablo Chacón. 1995.
[57] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang,
Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine.
“QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. In:
CoRR abs/1806.10293 (2018). arXiv: 1806.10293. url: http://arxiv.org/abs/1806.10293.
[58] Steven Kapturowski, Georg Ostrovski, John Quan, Rémi Munos, and Will Dabney. “Recurrent
Experience Replay in Distributed Reinforcement Learning”. In: ICLR. 2019.
[59] Sertac Karaman and Emilio Frazzoli. “Sampling-based algorithms for optimal motion planning”.
In: Int. J. Robotics Res. 30.7 (2011), pp. 846–894. doi: 10.1177/0278364911406761.
[60] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski.
“Vizdoom: A doom-based ai research platform for visual reinforcement learning”. In: CIG. 2016,
pp. 1–8.
[61] Arbaaz Khan, Vijay Kumar, and Alejandro Ribeiro. “Large Scale Distributed Collaborative
Unlabeled Motion Planning With Graph Policy Gradients”. In: IEEE Robotics Autom. Lett. 6.3
(2021), pp. 5340–5347. doi: 10.1109/LRA.2021.3074885.
[62] Arbaaz Khan, Ekaterina I. Tolstaya, Alejandro Ribeiro, and Vijay Kumar. “Graph Policy Gradients
for Large Scale Robot Control”. In: 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka,
Japan, October 30 - November 1, 2019, Proceedings. Ed. by Leslie Pack Kaelbling, Danica Kragic,
and Komei Sugiura. Vol. 100. Proceedings of Machine Learning Research. PMLR, 2019,
pp. 823–834. url: http://proceedings.mlr.press/v100/khan20a.html.
[63] Arbaaz Khan, Chi Zhang, Shuo Li, Jiayue Wu, Brent Schlotfeldt, Sarah Y. Tang, Alejandro Ribeiro,
Osbert Bastani, and Vijay Kumar. “Learning Safe Unlabeled Multi-Robot Planning with Motion
Constraints”. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
2019, Macau, SAR, China, November 3-8, 2019. IEEE, 2019, pp. 7558–7565. doi:
10.1109/IROS40897.2019.8968483.
114
[64] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti,
Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. “AI2-THOR: An Interactive 3D
Environment for Visual AI”. In: arXiv:1712.05474 (2017).
[65] Heinrich Küttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar,
Tim Rocktäschel, and Edward Grefenstette. “TorchBeast: A PyTorch Platform for Distributed RL”.
In: CoRR abs/1910.03552 (2019).
[66] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. “Learning
quadrupedal locomotion over challenging terrain”. In: Sci. Robotics 5.47 (2020), p. 5986. doi:
10.1126/scirobotics.abc5986.
[67] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. “End-to-End Training of Deep
Visuomotor Policies”. In: Journal of Machine Learning Research 17.39 (2016), pp. 1–40. url:
http://jmlr.org/papers/v17/15-522.html.
[68] Anqi Li, Ching-An Cheng, Muhammad Asif Rana, Man Xie, Karl Van Wyk, Nathan D. Ratliff, and
Byron Boots. “RMP2: A Structured Composable Policy Class for Robot Learning”. In: Robotics:
Science and Systems XVII, Virtual Event, July 12-16, 2021. Ed. by Dylan A. Shell, Marc Toussaint,
and M. Ani Hsieh. 2021. doi: 10.15607/RSS.2021.XVII.092.
[69] Qingbiao Li, Fernando Gama, Alejandro Ribeiro, and Amanda Prorok. “Graph Neural Networks
for Decentralized Multi-Robot Path Planning”. In: IEEE/RSJ International Conference on Intelligent
RobotsandSystems,IROS2020,LasVegas,NV,USA,October24,2020-January24,2021. IEEE, 2020,
pp. 11785–11792. doi: 10.1109/IROS45743.2020.9341668.
[70] Yuxi Li and Dale Schuurmans. “MapReduce for Parallel Reinforcement Learning”. In: European
Workshop on Reinforcement Learning. 2011.
[71] Zexiang Li, Ping Hsu, and Shankar Sastry. “Grasping and Coordinated Manipulation by a
Multifingered Robot Hand”. In: The International Journal of Robotics Research 8.4 (1989),
pp. 33–50. doi: 10.1177/027836498900800402. eprint: https://doi.org/10.1177/027836498900800402.
[72] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg,
Joseph Gonzalez, Michael I. Jordan, and Ion Stoica. “RLlib: Abstractions for Distributed
Reinforcement Learning”. In: ICML. 2018.
[73] Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and
Dieter Fox. “GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning”. In:
Conference on Robot Learning (CoRL). 2018.
[74] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica.
“Tune: A Research Platform for Distributed Model Selection and Training”. In: CoRR
abs/1807.05118 (2018). arXiv: 1807.05118. url: http://arxiv.org/abs/1807.05118.
115
[75] Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S. M. Ali Eslami, Daniel Hennes,
Wojciech M. Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y. Siegel,
Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H. Francis Song, Markus Wulfmeier,
Paul Muller, Tuomas Haarnoja, Brendan D. Tracey, Karl Tuyls, Thore Graepel, and Nicolas Heess.
“From motor control to team play in simulated humanoid football”. In: Sci. Robotics 7.69 (2022).
doi: 10.1126/scirobotics.abo0235.
[76] Wenhao Luo, Wen Sun, and Ashish Kapoor. “Multi-Robot Collision Avoidance under Uncertainty
with Probabilistic Safety Barrier Certificates”. In: Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual. Ed. by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell,
Maria-Florina Balcan, and Hsuan-Tien Lin. 2020. url: https:
//proceedings.neurips.cc/paper/2020/hash/03793ef7d06ffd63d34ade9d091f1ced-Abstract.html.
[77] Sagnik Majumder, Ziad Al-Halah, and Kristen Grauman. “Move2Hear: Active Audio-Visual
Source Separation”. In: arXiv preprint arXiv:2105.07142 (2021).
[78] Denys Makoviichuk and Viktor Makoviychuk. rl-games: A High-performance Framework for
Reinforcement Learning. https://github.com/Denys88/rl_games. 2022.
[79] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey,
Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State.
“Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning”. In: CoRR
abs/2108.10470 (2021). arXiv: 2108.10470. url: https://arxiv.org/abs/2108.10470.
[80] Matthew T. Mason and J. Kenneth Salisbury. Robot Hands and the Mechanics of Manipulation.
Cambridge, MA, USA: MIT Press, 1985. isbn: 0262132052.
[81] Matthew Matl, Vishal Satish, Michael Danielczuk, Bill DeRose, Stephen McKinley, and
Ken Goldberg. “Learning ambidextrous robot grasping policies”. In: Sci. Robotics 4.26 (2019). doi:
10.1126/scirobotics.aau4984.
[82] Sam McCandlish, Jared Kaplan, Dario Amodei, et al. “An Empirical Model of Large-Batch
Training”. In: CoRR abs/1812.06162 (2018).
[83] Daniel Mellinger and Vijay Kumar. “Minimum snap trajectory generation and control for
quadrotors”. In: IEEE International Conference on Robotics and Automation, ICRA 2011, Shanghai,
China, 9-13 May 2011. IEEE, 2011, pp. 2520–2525. doi: 10.1109/ICRA.2011.5980409.
[84] Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and
Marco Hutter. “Learning robust perceptive locomotion for quadrupedal robots in the wild”. In:
Sci. Robotics 7.62 (2022). doi: 10.1126/scirobotics.abk2822.
[85] Mehdi Mirza, Andrew Jaegle, Jonathan J. Hunt, Arthur Guez, Saran Tunyasuvunakool,
Alistair Muldal, Théophane Weber, Péter Karkus, Sébastien Racanière, Lars Buesing,
Timothy P. Lillicrap, and Nicolas Heess. “Physically Embedded Planning Problems: New
Challenges for Reinforcement Learning”. In: arXiv:2009.05524 (2020).
116
[86] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,
Tim Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous Methods for Deep
Reinforcement Learning”. In: ICML. 2016.
[87] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski,
Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,
Daan Wierstra, Shane Legg, and Demis Hassabis. “Human-level control through deep
reinforcement learning”. In: Nature 518.7540 (2015).
[88] Artem Molchanov, Tao Chen, Wolfgang Hönig, James A. Preiss, Nora Ayanian, and
Gaurav S. Sukhatme. “Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to
Multiple Quadrotors”. In:2019IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems,
IROS 2019, Macau, SAR, China, November 3-8, 2019. IEEE, 2019, pp. 59–66. doi:
10.1109/IROS40897.2019.8967695.
[89] Gabriel Moraes Barros and Esther Luna Colombini. “Using Soft Actor-Critic for Low-Level UAV
Control”. In: arXiv e-prints (2020), arXiv–2010.
[90] Igor Mordatch and Pieter Abbeel. “Emergence of Grounded Compositional Language in
Multi-Agent Populations”. In: AAAI. 2018.
[91] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang,
Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. “Ray: A Distributed
Framework for Emerging AI Applications”. In: USENIX Symposium on Operating Systems Design
and Implementation. 2018.
[92] Jean-Baptiste Mouret and Jeff Clune. “Illuminating search spaces by mapping elites”. In: CoRR
abs/1504.04909 (2015). arXiv: 1504.04909. url: http://arxiv.org/abs/1504.04909.
[93] Yashraj S. Narang, Kier Storey, Iretiayo Akinola, Miles Macklin, Philipp Reist,
Lukasz Wawrzyniak, Yunrong Guo, Ádám Moravánszky, Gavriel State, Michelle Lu,
Ankur Handa, and Dieter Fox. “Factory: Fast Contact for Robotic Assembly”. In: CoRR
abs/2205.03532 (2022). doi: 10.48550/arXiv.2205.03532. arXiv: 2205.03532.
[94] NVIDIA. NVIDIA PhysX. 2020. url: https://developer.nvidia.com/physx-sdk.
[95] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “Wavenet: A generative model for
raw audio”. In: arXiv:1609.03499 (2016).
[96] OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew,
Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider,
Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and
Lei Zhang. “Solving Rubik’s Cube with a Robot Hand”. In: arXiv preprint arXiv:1910.07113 (2020).
117
[97] OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew,
Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray,
Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba.
“Learning Dexterous In-Hand Manipulation”. In: CoRR abs/1808.00177 (2018). arXiv: 1808.00177.
url: http://arxiv.org/abs/1808.00177.
[98] Jacopo Panerati, Hehui Zheng, Siqi Zhou, James Xu, Amanda Prorok, and Angela P. Schoellig.
“Learning to Fly - a Gym Environment with PyBullet Physics for Reinforcement Learning of
Multi-agent Quadcopter Control”. In: CoRR abs/2103.02142 (2021). arXiv: 2103.02142. url:
https://arxiv.org/abs/2103.02142.
[99] Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Çaglar Gülçehre,
Siddhant M. Jayakumar, Max Jaderberg, Raphaël Lopez Kaufman, Aidan Clark, Seb Noury,
Matthew Botvinick, Nicolas Heess, and Raia Hadsell. “Stabilizing Transformers for Reinforcement
Learning”. In: ICML. 2020.
[100] Kwanyoung Park, Hyunseok Oh, and Youngki Lee. “VECA: A Toolkit for Building Virtual
Environments to Train and Test Human-like Agents”. In: arXiv preprint arXiv:2105.00762 (2021).
[101] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,
Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style,
High-Performance Deep Learning Library”. In: Advances in Neural Information Processing
Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett.
Vol. 32. Curran Associates, Inc., 2019. url:
https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
[102] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. “Curiosity-driven
Exploration by Self-supervised Prediction”. In: ICML. 2017.
[103] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. “Sim-to-real transfer
of robotic control with dynamics randomization”. In: 2018 IEEE international conference on
robotics and automation (ICRA). IEEE. 2018, pp. 3803–3810.
[104] Ken Perlin. “An image synthesizer”. In: SIGGRAPH. 1985.
[105] Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav Sukhatme, and Vladlen Koltun. “Sample
Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement
Learning”. In: ICML. 2020.
[106] Aleksei Petrenko and Tushar Kumar. A Faster Alternative to Python’s multiprocessing.Queue.
https://github.com/alex-petrenko/faster-fifo. 2020.
[107] Aleksei Petrenko, Erik Wijmans, Brennan Shacklett, and Vladlen Koltun. “Megaverse: Simulating
Embodied Agents at One Million Experiences per Second”. In: ICML. 2021.
118
[108] Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel.
“Asymmetric Actor Critic for Image-Based Robot Learning”. In: CoRR (2017). url:
http://arxiv.org/abs/1710.06542.
[109] James A. Preiss, Wolfgang Hönig, Nora Ayanian, and Gaurav S. Sukhatme. “Downwash-aware
trajectory planning for large quadrotor teams”. In: 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017. IEEE,
2017, pp. 250–257. doi: 10.1109/IROS.2017.8202165.
[110] Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. “Hogwild: A Lock-Free
Approach to Parallelizing Stochastic Gradient Descent”. In: Neural Information Processing
Systems. 2011.
[111] Charles Richter, Adam Bry, and Nicholas Roy. “Polynomial trajectory planning for aggressive
quadrotor flight in dense indoor environments”. In: Robotics research. Springer, 2016, pp. 649–666.
[112] Benjamin Rivière, Wolfgang Hönig, Yisong Yue, and Soon-Jo Chung. “GLAS: Global-to-Local Safe
Autonomy Synthesis for Multi-Robot Motion Planning With End-to-End Learning”. In: IEEE
Robotics Autom. Lett. 5.3 (2020), pp. 4249–4256. doi: 10.1109/LRA.2020.2994035.
[113] Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. “Learning to Walk in Minutes
Using Massively Parallel Deep Reinforcement Learning”. In: CoRR abs/2109.11978 (2021). arXiv:
2109.11978. url: https://arxiv.org/abs/2109.11978.
[114] J. Kenneth Salisbury and John J. Craig. “Articulated Hands: Force Control and Kinematic Issues”.
In: The International Journal of Robotics Research 1.1 (1982), pp. 4–17. doi:
10.1177/027836498200100102. eprint: https://doi.org/10.1177/027836498200100102.
[115] Sample Factory 2.0. 2022. url: https://www.samplefactory.dev/.
[116] Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, T. K. Satish Kumar, Sven Koenig, and
Howie Choset. “PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning”.
In: IEEE Robotics Autom. Lett. 4.3 (2019), pp. 2378–2385. doi: 10.1109/LRA.2019.2903261.
[117] Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys,
Timothy Lillicrap, and Sylvain Gelly. “Episodic Curiosity through Reachability”. In: ICLR. 2019.
[118] Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun.
“MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments”. In:
arXiv:1712.03931 (2017).
[119] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain,
Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. “Habitat: A
Platform for Embodied AI Research”. In: ICCV. 2019.
[120] Simon Schmitt, Matteo Hessel, and Karen Simonyan. “Off-Policy Actor-Critic with Shared
Experience Replay”. In: CoRR abs/1909.11583 (2019).
119
[121] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. “Trust
Region Policy Optimization”. In: ICML. 2015.
[122] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel.
“High-Dimensional Continuous Control Using Generalized Advantage Estimation”. In: 4th
International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4,
2016, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2016. url:
http://arxiv.org/abs/1506.02438.
[123] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy
Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347. url:
http://arxiv.org/abs/1707.06347.
[124] Ozan Sener and Vladlen Koltun. “Learning to Guide Random Search”. In: ICLR. 2020.
[125] Brennan Shacklett, Erik Wijmans, Aleksei Petrenko, Manolis Savva, Dhruv Batra, Vladlen Koltun,
and Kayvon Fatahalian. “Large Batch Simulation for Deep Reinforcement Learning”. In: ICLR.
2021.
[126] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. “Safe, Multi-Agent, Reinforcement
Learning for Autonomous Driving”. In: CoRR abs/1610.03295 (2016). arXiv: 1610.03295. url:
http://arxiv.org/abs/1610.03295.
[127] Guanya Shi, Wolfgang Hönig, Xichen Shi, Yisong Yue, and Soon-Jo Chung. “Neural-Swarm2:
Planning and Control of Heterogeneous Multirotor Swarms using Learned Interactions”. In:CoRR
abs/2012.05457 (2020). arXiv: 2012.05457. url: https://arxiv.org/abs/2012.05457.
[128] Guanya Shi, Wolfgang Hönig, Yisong Yue, and Soon-Jo Chung. “Neural-Swarm: Decentralized
Close-Proximity Multirotor Control Using Learned Interactions”. In: (2020), pp. 3241–3247. doi:
10.1109/ICRA40945.2020.9196800.
[129] Kiril Solovey, Oren Salzman, and Dan Halperin. “Finding a needle in an exponential haystack:
Discrete RRT for exploration of implicit roadmaps in multi-robot motion planning”. In: Int. J.
Robotics Res. 35.5 (2016), pp. 501–513.
[130] Yunlong Song, Mats Steinweg, Elia Kaufmann, and Davide Scaramuzza. “Autonomous Drone
Racing with Deep Reinforcement Learning”. In: IEEE/RSJ International Conference on Intelligent
Robots and Systems, IROS 2021, Prague, Czech Republic, September 27 - Oct. 1, 2021. IEEE, 2021,
pp. 1205–1212. doi: 10.1109/IROS51168.2021.9636053.
[131] Adam Stooke and Pieter Abbeel. “rlpyt: A Research Code Base for Deep Reinforcement Learning
in PyTorch”. In: CoRR abs/1909.01500 (2019).
[132] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1992.
[133] Gerald Tesauro. “Temporal Difference Learning and TD-Gammon”. In: Commun.ACM 38.3 (1995).
120
[134] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-based
control”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2012,
pp. 5026–5033.
[135] Ekaterina I. Tolstaya, Fernando Gama, James Paulos, George J. Pappas, Vijay Kumar, and
Alejandro Ribeiro. “Learning Decentralized Controllers for Robot Swarms with Graph Neural
Networks”. In: 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 -
November 1, 2019, Proceedings. Ed. by Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura.
Vol. 100. Proceedings of Machine Learning Research. PMLR, 2019, pp. 671–682. url:
http://proceedings.mlr.press/v100/tolstaya20a.html.
[136] Jesus Tordesillas and Jonathan P. How. “MADER: Trajectory Planner in Multi-Agent and
Dynamic Environments”. In: CoRR abs/2010.11061 (2020). arXiv: 2010.11061. url:
https://arxiv.org/abs/2010.11061.
[137] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In: NIPS. 2017, pp. 5998–6008.
[138] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik,
Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh,
Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai,
John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen,
Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre,
Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch,
Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu,
Demis Hassabis, Chris Apps, and David Silver. “Grandmaster level in StarCraft II using
multi-agent reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354.
[139] Xingchen Wan, Cong Lu, Jack Parker-Holder, Philip J. Ball, Vu Nguyen, Binxin Ru, and
Michael Osborne. “Bayesian Generational Population-Based Training”. In: First Conference on
Automated Machine Learning (Main Track). 2022. url:
https://openreview.net/forum?id=HW4-ZaHUg5.
[140] Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva,
and Dhruv Batra. “DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion
Frames”. In: ICLR. 2020.
[141] Ronald J. Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist
Reinforcement Learning”. In: Mach. Learn. 8.3-4 (1992).
[142] Abraham Woubie, Anssi Kanervisto, Janne Karttunen, and Ville Hautamaki. “Do Autonomous
Agents Benefit from Hearing?” In: arXiv preprint arXiv:1905.04192 (2019).
[143] Marek Wydmuch, Michal Kempka, and Wojciech Jaskowski. “ViZDoom Competitions: Playing
Doom From Pixels”. In: IEEE Transactions on Games 11.3 (2019).
121
[144] Karl Van Wyk, Mandy Xie, Anqi Li, Muhammad Asif Rana, Buck Babich, Bryan Peele, Qian Wan,
Iretiayo Akinola, Balakumar Sundaralingam, Dieter Fox, Byron Boots, and Nathan D. Ratliff.
“Geometric Fabrics: Generalizing Classical Mechanics to Capture the Physics of Behavior”. In:
IEEE Robotics Autom. Lett. 7.2 (2022), pp. 3202–3209. doi: 10.1109/LRA.2022.3143311.
[145] Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. “Masked Visual Pre-training for
Motor Control”. In: arXiv preprint arXiv:2203.06173 (2022).
[146] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan Salakhutdinov, and
Alexander J. Smola. “Deep Sets”. In: Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,
USA. Ed. by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus,
S. V. N. Vishwanathan, and Roman Garnett. 2017, pp. 3391–3401. url: https:
//proceedings.neurips.cc/paper/2017/hash/f22e4747da1aa27e363d86d40ff442fe-Abstract.html.
[147] Boyu Zhou, Fei Gao, Luqi Wang, Chuhao Liu, and Shaojie Shen. “Robust and efficient quadrotor
trajectory generation for fast autonomous flight”. In: IEEE Robotics and Automation Letters 4.4
(2019), pp. 3529–3536.
[148] Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. “Does computer vision matter for action?”
In: Science Robotics 4.30 (2019).
[149] Dingjiang Zhou, Zijian Wang, Saptarshi Bandyopadhyay, and Mac Schwager. “Fast, On-line
Collision Avoidance for Dynamic Vehicles Using Buffered Voronoi Cells”. In: IEEE Robotics
Autom. Lett. 2.2 (2017), pp. 1047–1054. doi: 10.1109/LRA.2017.2656241.
122
Abstract (if available)
Abstract
Advances in computing hardware and machine learning have enabled a data-driven approach to robotic autonomy where control policies are discovered by analyzing raw data via interactive experience collection and learning. In this thesis we discuss a specific implementation of this approach: we show how policies can be trained in simulated environments using deep reinforcement learning techniques and then deployed on real robotic systems via the sim-to-real paradigm.
We build towards this vision by developing tools for efficient simulation and learning under a constrained computational budget. We improve the systems design of reinforcement learning algorithms and simulators to create high-throughput GPU-accelerated infrastructure for rapid experimentation.
This learning infrastructure is then applied to continuous control problems in challenging domains. We scale up training in a CPU-based quadrotor flight simulator to find robust policies that are able to control physical quadrotors flying in tight formations. We then use large batch reinforcement learning in a massively parallel physics simulator IsaacGym to learn dexterous object manipulation with a multi-fingered robotic hand and we transfer these skills from simulation to reality using automatic domain randomization.
We distill the lessons learned in these and other projects into a high-throughput learning system "Sample Factory" (https://samplefactory.dev/) and release it as an open-source codebase to facilitate and accelerate further progress in the field, as well as to democratize reinforcement learning research and make it accessible to a wider community.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Scaling robot learning with skills
PDF
Closing the reality gap via simulation-based inference and control
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Algorithms and systems for continual robot learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Accelerating robot manipulation using demonstrations
PDF
Decision making in complex action spaces
PDF
Emotional appraisal in deep reinforcement learning
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Understanding goal-oriented reinforcement learning
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Augmented simulation techniques for robotic manipulation
PDF
Learning objective functions for autonomous motion generation
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
Asset Metadata
Creator
Petrenko, Aleksei
(author)
Core Title
High-throughput methods for simulation and deep reinforcement learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
01/19/2023
Defense Date
11/16/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,deep learning,deep reinforcement learning,embodied intelligence,machine learning,neural networks,OAI-PMH Harvest,open-source software,optimization,reinforcement learning,robotics,simulation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav (
committee chair
), Jain, Rahul (
committee member
), Nikolaidis, Stefanos (
committee member
), Thomason, Jesse (
committee member
), Zyda, Mike (
committee member
)
Creator Email
apetrenko1991@gmail.com,petrenko@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC25343
Unique identifier
UC25343
Identifier
etd-PetrenkoAl-11419.pdf (filename)
Legacy Identifier
etd-PetrenkoAl-11419
Document Type
Dissertation
Format
theses (aat)
Rights
Petrenko, Aleksei
Internet Media Type
application/pdf
Type
texts
Source
20230120-usctheses-batch-1002
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
deep learning
deep reinforcement learning
embodied intelligence
machine learning
neural networks
open-source software
optimization
reinforcement learning
robotics
simulation